Put Jupyter in the Spark cluster launched by Amazon EMR On top of that, when using PySpark, a summary of how to deal with the jammed points.

EMR launch

This time for verification

Applications: All Applications: Hadoop 2.6.0, Hive 1.0.0, Hue 3.7.1, Mahout 0.11.0, Pig 0.14.0, and Spark 1.5.0 Instance type: m3.xlarge Number of instances: 1 Permission: Default

Prepare.

If you include Hue, Hue will use port 8888 Jupyter can no longer use port 8888 (default). In that case, make it accessible from your PC Make a hole in the security group.

Python2.6-> Go to Python2.7

EC2 started by EMR has Python version 2.6.9, so change it to 2.7. Since 2.7 is originally installed, just change the link destination.

sudo unlink /usr/bin/python
sudo ln -s /usr/bin/python2.7 /usr/bin/python

To pip2.6-> pip2.7

pip upgraded and changed the link destination.

sudo pip install -U pip
sudo ln -s /usr/bin/pip-2.7 /usr/bin/pip

Jupyter installation

Currently (October 2015), Jupyter 4.0.6 is installed.

sudo pip install jupyter

Launch Jupyter

jupyter-notebook

About creating a profile

Create template configuration file (output destination is ~ / .jupyter / jupyter_notebook_config.py)

jupyter notebook --generate-config

`py:~/.jupyter/jupyter_notebook_config.py`


c = get_config()

c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888

If you include Hue, go to c.NotebookApp.port Set a port other than 8888 opened in the security group.

Does profile seem to disappear from Jupyter 4.X? You can specify a configuration file using the config option. Example)

jupyter-notebook --config='~/.ipython/profile_nbservers/ipython_config.py'

If you specify the directory path in the environment variable JUPYTER_CONFIG_DIR It will read jupyter_notebook_config.py in that directory.

Make Spark available on Jupyter

Changed spark.master from yarn to local. (If you don't do this, SparkContext will stop)

`/usr/lib/spark/conf/spark-defaults.conf`


# spark.master yarn
spark.master local

Previously in ~ / .ipython / profile_ \ <profile name > / startup / 00- \ <profile name >-setup.py I was preparing for Spark, but I couldn't do that either The following command is executed on Jupyter Notebook.

export SPARK_HOME='/usr/lib/spark'

import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
    raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

It may be read as a file.

I tried using PySpark from Jupyter 4.x on EMR