As a first step in verifying Apache Spark. As anyone with Hadoop experience knows, it counts the same words in a file. The environment is Mac OSX, but I wonder if it is almost the same for Linux. The complete code is here.
$ brew install apache-spark
OK if spark-shell works and `scala>`
is displayed
$ /usr/local/bin/spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.1
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
16/04/07 16:44:47 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:47 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:51 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/07 16:44:51 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/04/07 16:44:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:56 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/07 16:44:56 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.
scala>
This was written with reference to the description on the Official Site.
Please prepare as follows.
$ tree
.
├── input
│ └── data #Text to read
└── wordcount.py #Execution script
1 directory, 4 files
Here we use python. You can write in scala or Java. I'm good at it, so let's go. Like this.
wordcount.py
#!/usr/bin/env python
# coding:utf-8
from pyspark import SparkContext
def execute(sc, src, dest):
'''
Perform word count
'''
#Read src file
text_file = sc.textFile(src)
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
#Export results
counts.saveAsTextFile(dest)
if __name__ == '__main__':
sc = SparkContext('local', 'WordCount')
src = './input'
dest = './output'
execute(sc, src, dest)
Appropriately. For example, like this.
./input/data
aaa
bbb
ccc
aaa
bbbb
ccc
aaa
The following command.
$ which pyspark
/usr/local/bin/pyspark
#Run
$ pyspark ./wordcount.py
When you execute it, a log will flow. (Like Hadoop Streaming)
./output/part-00000
(u'aaa', 3)
(u'bbbb', 1)
(u'bbb', 1)
(u'ccc', 2)
It was counted correctly.
Note that if the output destination directory (./output) has already been generated, the next process will fail. It is a good idea to attach a shell like the one below to the same directory.
exec.sh
#!/bin/bash
rm -fR ./output
/usr/local/bin/pyspark ./wordcount.py
echo ">>>>> result"
cat ./output/*
$ sh exec.sh
・ ・ ・
>>>>> result
(u'aaa', 3)
(u'bbbb', 1)
(u'bbb', 1)
(u'ccc', 2)
Recommended Posts