I want to do Hadoop machine learning with Python

Since you can write Hadoop Jobs other than Java, I made a wrapper called SkipJack that can implement Python that is strong in machine learning with Hadoop at Python Mokumokukai and New Year.

GitHub is below. (No pip) GitHub-SkipJack

Details below

HadoopStreaming
Scikit-learn
SkipJack

HadoopStreaming

In Hadoop

--Run Java on slave part (Haoop MR Tutorial) --Execute files via standard I / O on the slave part (Hadoop Streaming Tutorial)

There are two execution methods, Hadoop can be used in all languages that can handle standard I / O. (Hadoop Streaming)

So you don't have to use Mahout just because you're doing machine learning in Hadoop, You can implement it in your favorite library using Python, which is strong in machine learning.

For the general flow of preparing Hadoop, refer to Introduction of Hadoop and MapReduce by Python.

Scikit-learn

The most major machine learning library implemented in Python. In order to use this, you need to install Numpy and Scipy as well, but pip alone cannot easily install it, so I downloaded the 3 series of Anaconda which contains a set of libraries from the beginning and installed it on all slaves.

SkipJack

In Hadoop Streaming, the execution command of hadoop had to be typed by hand, which was troublesome. By running python

** Decide the Job to be executed → Hadoop execution → Result evaluation → Determine the next Job to be executed → Below, loop until stop **

I made a wrapper that can do. If you implement mapper, reducer, and result evaluation method, you do not need to write routine work.

The contents are It's as simple as running a Hadoop command (run, file placement (put), read result (cat)).

In the sample,

--WordCount + Alpha --Refine using grid search

We have prepared two.

Wrapper running Hadoop in Python

I want to do Hadoop machine learning with Python