Word Count with Apache Spark and python (Mac OS X)


As a first step in verifying Apache Spark. As anyone with Hadoop experience knows, it counts the same words in a file. The environment is Mac OSX, but I wonder if it is almost the same for Linux. The complete code is here.


$ brew install apache-spark

Installation confirmation

OK if spark-shell works and `scala>` is displayed

$ /usr/local/bin/spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
16/04/07 16:44:47 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:47 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:51 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/07 16:44:51 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/04/07 16:44:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:56 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/07 16:44:56 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.


Try word counting of local files with python

This was written with reference to the description on the Official Site.

Directory structure

Please prepare as follows.

$ tree
├── input
│   └── data #Text to read
└── wordcount.py #Execution script

1 directory, 4 files

Write code

Here we use python. You can write in scala or Java. I'm good at it, so let's go. Like this.


#!/usr/bin/env python
# coding:utf-8

from pyspark import SparkContext

def execute(sc, src, dest):
Perform word count
    #Read src file
    text_file = sc.textFile(src)
    counts = text_file.flatMap(lambda line: line.split(" ")) \
                 .map(lambda word: (word, 1)) \
                 .reduceByKey(lambda a, b: a + b)
    #Export results

if __name__ == '__main__':
    sc = SparkContext('local', 'WordCount')
    src  = './input'
    dest = './output'
    execute(sc, src, dest)

Read file preparation

Appropriately. For example, like this.




The following command.

$ which pyspark

$ pyspark ./wordcount.py

When you execute it, a log will flow. (Like Hadoop Streaming)



(u'aaa', 3)
(u'bbbb', 1)
(u'bbb', 1)
(u'ccc', 2)

It was counted correctly.


Note that if the output destination directory (./output) has already been generated, the next process will fail. It is a good idea to attach a shell like the one below to the same directory.



rm -fR ./output
/usr/local/bin/pyspark ./wordcount.py

echo ">>>>> result"
cat ./output/*
$ sh exec.sh
・ ・ ・
>>>>> result
(u'aaa', 3)
(u'bbbb', 1)
(u'bbb', 1)
(u'ccc', 2)

Recommended Posts

Word Count with Apache Spark and python (Mac OS X)
Install lp_solve on Mac OS X and call it with python.
Python3 + Django ~ Mac ~ with Apache
Install Python 2.7.9 and Python 3.4.x with pip.
Test Python with Miniconda on OS X and Linux with travis-ci
Get started with the Python framework Django on Mac OS X
CentOS 6.4 with Python 2.7.3 with Apache with mod_wsgi and Django
Put OpenCV in OS X with Homebrew and input / output video with python
Run Zookeeper x python (kazoo) on Mac OS X
Put Python 2.7.x on Mac OSX 10.15.5 with pyenv
Install shogun with python modular (OS X Yosemite)
Shpinx (Python documentation builder) on Mac OS X
Install selenium on Mac and try it with python
[Mac OS] Use Kivy with PyCharm! [Python application development]
Build a Python development environment on Mac OS X
mac OS X 10.15.x pyenv Python If you can't install
Install PyQt5 with homebrew on Mac OS X Marvericks (10.9.2)
Investigate Java and python data exchange with Apache Arrow
pangolin x python x mac os build failed memorandum unsolved
Streaming Python and SensorTag, Kafka, Spark Streaming-Part 5: Connecting from Jupyter to Spark with Apache Toree
I tried to build an environment for machine learning with Python (Mac OS X)
Programming with Python and Tkinter
x86 compiler self-made with python
Python and hardware-Using RS232C with Python-
Using multiple versions of Python on Mac OS X (2) Usage
I learned MNIST with Caffe and tried to draw it (MAC OS X El Capitan)
Using NAOqi 2.4.2 Python SDK on Mac OS X El Capitan
Apache mod_auth_tkt and Python AuthTkt
Create an LCD (16x2) game with Raspberry Pi and Python
Build a python environment with pyenv (OS X El Capitan 10.11.3)
Play with Mastodon's archive in Python 2 Count replies and favourites
Memo on Mac OS X
python with pyenv and venv
[Python x Zapier] Get alert information and notify with Slack
How to install Theano on Mac OS X with homebrew
About Python and os operations
Using OpenCV with Python @Mac
Works with Python and R
Using multiple versions of Python on Mac OS X (1) Multiple Ver installation
Apache Beam 2.0.x with Google Cloud Dataflow starting with IntelliJ and Gradle
[Machine learning] Try running Spark MLlib with Python and make recommendations
Build a Python environment on your Mac with Anaconda and PyCharm
Error and solution when installing python3 with homebrew on mac (catalina 10.15)
Continuation ・ Notes on preparing the Python development environment on Mac OS X
Quickly install OpenCV 2.4 (+ python) on OS X and try the sample
How to run Jupyter and Spark on Mac with minimal settings
Communicate with FX-5204PS with Python and PyUSB
Shining life with Python and OpenCV
Robot running with Arduino and python
Neural network with OpenCV 3 and Python 3
AM modulation and demodulation with python
Make apache log csv with python
[Python] font family and font with matplotlib
Scraping with Node, Ruby and Python
Scraping with Python, Selenium and Chromedriver
Install Sphinx on Mac OS X
Scraping with Python and Beautiful Soup
Installation of scikit-learn (Mac OS X)
JSON encoding and decoding with python
Hadoop introduction and MapReduce with Python
[GUI with Python] PyQt5-Drag and drop-