Put Jupyter in the Spark cluster launched by Amazon EMR On top of that, when using PySpark, a summary of how to deal with the jammed points.
This time for verification
Applications: All Applications: Hadoop 2.6.0, Hive 1.0.0, Hue 3.7.1, Mahout 0.11.0, Pig 0.14.0, and Spark 1.5.0 Instance type: m3.xlarge Number of instances: 1 Permission: Default
Prepare.
If you include Hue, Hue will use port 8888 Jupyter can no longer use port 8888 (default). In that case, make it accessible from your PC Make a hole in the security group.
EC2 started by EMR has Python version 2.6.9, so change it to 2.7. Since 2.7 is originally installed, just change the link destination.
sudo unlink /usr/bin/python
sudo ln -s /usr/bin/python2.7 /usr/bin/python
pip upgraded and changed the link destination.
sudo pip install -U pip
sudo ln -s /usr/bin/pip-2.7 /usr/bin/pip
Currently (October 2015), Jupyter 4.0.6 is installed.
sudo pip install jupyter
jupyter-notebook
Create template configuration file (output destination is ~ / .jupyter / jupyter_notebook_config.py)
jupyter notebook --generate-config
py:~/.jupyter/jupyter_notebook_config.py
c = get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888
If you include Hue, go to c.NotebookApp.port Set a port other than 8888 opened in the security group.
Does profile seem to disappear from Jupyter 4.X? You can specify a configuration file using the config option. Example)
jupyter-notebook --config='~/.ipython/profile_nbservers/ipython_config.py'
If you specify the directory path in the environment variable JUPYTER_CONFIG_DIR It will read jupyter_notebook_config.py in that directory.
Changed spark.master from yarn to local. (If you don't do this, SparkContext will stop)
/usr/lib/spark/conf/spark-defaults.conf
# spark.master yarn
spark.master local
Previously in ~ / .ipython / profile_ \ <profile name > / startup / 00- \ <profile name >-setup.py I was preparing for Spark, but I couldn't do that either The following command is executed on Jupyter Notebook.
export SPARK_HOME='/usr/lib/spark'
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
It may be read as a file.
Recommended Posts