Word Count with Apache Spark and python (Mac OS X)

Overview

As a first step in verifying Apache Spark. As anyone with Hadoop experience knows, it counts the same words in a file. The environment is Mac OSX, but I wonder if it is almost the same for Linux. The complete code is here.

Installation

$ brew install apache-spark

Installation confirmation

OK if spark-shell works and `scala>` is displayed

$ /usr/local/bin/spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
16/04/07 16:44:47 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:47 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:51 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/07 16:44:51 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/04/07 16:44:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:56 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/07 16:44:56 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.

scala>

Try word counting of local files with python

This was written with reference to the description on the Official Site.

Directory structure

Please prepare as follows.

$ tree
.
├── input
│   └── data #Text to read
└── wordcount.py #Execution script

1 directory, 4 files

Write code

Here we use python. You can write in scala or Java. I'm good at it, so let's go. Like this.

`wordcount.py`


#!/usr/bin/env python
# coding:utf-8

from pyspark import SparkContext

def execute(sc, src, dest):
    '''
Perform word count
    '''
    #Read src file
    text_file = sc.textFile(src)
    counts = text_file.flatMap(lambda line: line.split(" ")) \
                 .map(lambda word: (word, 1)) \
                 .reduceByKey(lambda a, b: a + b)
    #Export results
    counts.saveAsTextFile(dest)

if __name__ == '__main__':
    sc = SparkContext('local', 'WordCount')
    src  = './input'
    dest = './output'
    execute(sc, src, dest)

Read file preparation

Appropriately. For example, like this.

`./input/data`


aaa
bbb
ccc
aaa
bbbb
ccc
aaa

Run

The following command.

$ which pyspark
/usr/local/bin/pyspark

#Run
$ pyspark ./wordcount.py

When you execute it, a log will flow. (Like Hadoop Streaming)

Verification

`./output/part-00000`


(u'aaa', 3)
(u'bbbb', 1)
(u'bbb', 1)
(u'ccc', 2)

It was counted correctly.

bonus

Note that if the output destination directory (./output) has already been generated, the next process will fail. It is a good idea to attach a shell like the one below to the same directory.

`exec.sh`


#!/bin/bash

rm -fR ./output
/usr/local/bin/pyspark ./wordcount.py

echo ">>>>> result"
cat ./output/*

$ sh exec.sh
・ ・ ・
>>>>> result
(u'aaa', 3)
(u'bbbb', 1)
(u'bbb', 1)
(u'ccc', 2)