This is an updated version of the following article. The image is organized as Jupyter has been upgraded to 5 series.
-Try using Jupyter's Docker image --Qiita
Jupyter started with IPython, a Python tool, but since the process of data analysis can be shared on the Web, it has also been used in R, and recently it has been integrated with Spark with the Hadoop infrastructure as the back end. is. As represented by Numpy and pandas, libraries required for data analysis may have complicated dependencies and may be difficult to install, but Docker can improve the portability of the execution environment. Also, since the execution environment can be separated for each container, the mechanism for running Docker on Hadoop ([Docker Container Executor --Apache Hadoop 2.7.4](https://hadoop.apache.org/docs/stable/hadoop-yarn] /hadoop-yarn-site/DockerContainerExecutor.html)) is also being considered (though I think it's a bit early in the practical stage). It is thought that it will be a direction to use lightweight images properly according to the purpose, so I will summarize the images organized by Jupyter.
The Jupyter server is intended for single user use. Therefore, only one type of password can be set. Please use JupyterHub for multiple people.
Jupyter's Docker image is managed separately as follows. There is only one * jupyter / docker-stacks * repository on GitHub, but there are multiple images on Docker Hub.
Image name | Description |
---|---|
base-notebook | Jupyter Notebook 5.0.You can use x. Scientific calculation library is not included. |
minimal-notebook | base-notebookDocument conversion tools such as pandoc and texlive have been added to. |
scipy-notebook | pandas and scikit-Includes data analysis libraries in Python, such as learn. |
datascience-notebook | scipy-notebookR and Julia have been added to. R plyr etc.conda It is managed by. |
tensorflow-notebook | scipy-notebookTensorflow has been added to. There is no GPU support. |
pyspark-notebook | scipy-notebookTo Spark 2.2.0 and Hadoop 2.7 has been added. Mesos 1.2 clients are also included. |
all-spark-notebook | pyspark-notebookR and Toree and Spylon have been added to. |
r-notebook | minimal-notebookR has been added to. plyr etc.conda It is managed by. If you don't use Python or Juliadatascience-notebookCan be prepared lighter than. |
The relationship between each image is easy to understand in the Visual Overview diagram (ʻinternal / inherit-diagram.png`) in the GitHub repository.
Since the startup settings are managed by the shell script of base-notebook, it is recommended to read the README of "Base Jupyter Notebook Stack" for options and so on.
[off topic] With Toree and Spylon, you can write notebooks that use Spark in Scala. However, the status of the project is from alpha to beta, so you need to be careful which version you are using when using it. If you don't care about Jupyter and want to configure a notebook that uses Spark, Apache Zeppelin is also a good option.
Try using * base-notebook *. I don't think it will actually be operated with this image, but since it is an image of about 600MB, it is just right for the purpose of starting up quickly.
First, download the image with the pull command.
$ docker pull jupyter/base-notebook
$ docker images jupyter/base-notebook
REPOSITORY TAG IMAGE ID CREATED SIZE
jupyter/base-notebook latest 749eef0adf19 4 days ago 599MB
Start the container on the default port 8888. When it starts normally, the authentication token will be displayed on the console. Copy the URL and access it with a browser. You can access the web UI without a token and enter a value after * token = * in the input box.
$ docker run -it --rm -p 8888:8888 jupyter/base-notebook
(abridgement)
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8888/?token=1898d7309d85940ac77acc59b15b4f9c572c96afdfc0e198
Then, to run in the background, run with different options. In this case, nothing is displayed on the standard output after executing the command. You will need an authentication token to access the notebook, so use the * logs * command to check the startup logs. For the argument, specify the name that is set when the container is started.
$ docker run -d --name basenb -p 8888:8888 jupyter/base-notebook
$ docker logs basenb
There is a problem that it is difficult to read the log display later. You can also use the jupyter
command to return a list of running servers, so you can get clear output by accessing the container with the * exec * command and running jupyter notebook list
.
$ docker exec basenb jupyter notebook list
Currently running servers:
http://localhost:8888/?token=1898d7309d85940ac77acc59b15b4f9c572c96afdfc0e198 :: /home/jovyan
The security of the Jupyter notebook server can be found in the following documents:
There are two methods, one is to use an authentication token and the other is to use a password. The authentication token can be given in the * Authorization * header of HTTP or in the * token * parameter of the URL. By default, the method of using the authentication token is enabled, and it can be said that the method given by the * token * parameter of the URL is displayed in the startup message.
The behavior of enabling authentication tokens by default has been introduced since Jupyter 4.3. However, it seems that the release was not fully documented, and there are many misunderstandings when combining several options.
If you have your local machine open the browser automatically at startup, you can use it seamlessly by accessing the URL with a temporary token and saving the information that you can access it in a cookie. However, when starting from Docker or starting from a remote server, the automatic startup of the browser is disabled, so it takes a lot of work for authentication.
The authentication token and password can be specified as startup options.
---- NotebookApp.token
: Specify a fixed string instead of the default auto-generation. The set value is not output to the log at startup.
----NotebookApp.password
: Password authentication is used without using the authentication token in the URL and header. The password string is hashed with the notebook.auth.passwd
function.
In principle, you can disable the authentication feature by specifying spaces for both, but this method is * strongly deprecated * unless you are restricting access at other layers of your web application.
[off topic] The processing of setting options is implemented in notebook / notebookapp.py, so the relationship between options is the source code. Is accurate to read. I'm using Traitlets for the implementation, so it's a good idea to open the reference while reading.
The notebook.auth.passwd
function is simply imported and called with no arguments.
Enter the password you want to specify and re-enter it for confirmation. Outputs the hashed result.
To save it as a configuration file, run jupyter notebook password
or python -m notebook.auth password
. The former is better.
Enter the password you want to run and set twice and the configuration file will be saved in .jupyter / jupyter_notebook_config.json
. If you keep it as a file, you can manage it by mounting the file on the host machine from the Docker container. The configuration file can also be managed by Python.
$ docker exec -it basenb jupyter notebook password
$ docker exec -it basenb cat .jupyter/jupyter_notebook_config.json
{
"NotebookApp": {
"password": "sha1:2165e2ddd92d:8131245514d60dd9eb91433af30bf1ccbbc36962"
}
}
Here, restart the container to reflect the contents of the configuration file. When you reload the browser, a text box for entering the password will be displayed, so make sure that you can log in with the password set above.
$ docker stop basenb
$ docker start basenb
Also, because the authentication token no longer exists, the URL with the token is not printed in the log, and the jupyter notebook list
command does not show the token.
This allowed us to switch from the default authentication token to password authentication.
In some environments it is not suitable to check the log or mount the file every time you boot. In such cases, give an authentication token or password hash as a boot option.
Use start-notebook.sh
when starting the container and specify the above options.
When giving an authentication token:
$ docker run -d --name basenb-token -p 8080:8888 jupyter/base-notebook start-notebook.sh --NotebookApp.token=foobarbaz
When giving a password hash:
$ docker run -d --name basenb-passwd -p 8088:8888 jupyter/base-notebook start-notebook.sh --NotebookApp.password=sha1:2165e2ddd92d:8131245514d60dd9eb91433af30bf1ccbbc36962
If you want to give both an authentication token and a password hash, run with both options.
Basically, it is better to manage with a password, but if you want to run it in a cloud environment and easily share it with just a URL, the authentication token method is also good. However, please follow the security policy of the user environment for IP address restrictions of the access source.
For other SSL settings, refer to the following documents.
Now that we've organized the startup issues, let's draw a graph using a different Docker image.
Download the * scipy-notebook * image and check the capacity.
$ docker pull jupyter/scipy-notebook
$ docker images jupyter/scipy-notebook
REPOSITORY TAG IMAGE ID CREATED SIZE
jupyter/scipy-notebook latest 092599e85093 5 days ago 3.91GB
Launch in the background to see the token.
$ docker run -d --name scipynb -p 8888:8888 jupyter/scipy-notebook
$ docker exec scipynb jupyter notebook list
Access the notebook server with a web browser and create a new Python 3 notebook.
First, execute the standard module loading as shown below.
Loading python module
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
Next, generate trigonometric sine and cosine data.
sin/Definition of cos
import math
cycle = math.pi * 2
x = np.linspace(-1 * cycle, cycle, 100)
y1 = np.sin(x)
y2 = np.cos(x)
Let's draw with the matplotlib API. Specify two data series, name the X and Y axes, and give the graph a title. If you want to use Japanese, place the font file in the container. [^ 1]
[^ 1]: Please refer to this article for how to set the graph axis in Japanese. Japanese settings for matplotlib and Seaborn axes-Qiita
Draw with matplotlib API
sns.set_style('whitegrid')
plt.plot(x, y1, color='red', linewidth=2, label='sin')
plt.plot(x, y2, color='blue', linewidth=1, label='cos')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.title('sin/cos curve')
plt.legend()
plt.show()
Draw the same thing with the pandas DataFrame API. The graph plotting itself is easy, but the return value matplotlib.AxesSubplot
is used for axis settings and so on. To make it easier to see the difference from the graph above, Seaborn has changed the style to darker ruled lines.
Draw with DataFrame API
df = pd.DataFrame({'sin': y1, 'cos': y2}, index=x)
sns.set_style('darkgrid')
ax = df.plot(title='sin/cos curve')
ax.set_xlabel('x axis')
ax.set_ylabel('y axis')
Next, let's draw a stock chart using Bokeh. Stock price data is obtained from Yahoo! using * pandas-datareader *. * pandas-datareader * is independent of the * pandas.io * package.
Use pip to install the module. Prefix the command with a "!" To execute the command from the notebook.
Install the module using pip
!pip install -y pandas-datareader
Get the data by specifying the date.
Get data with API
import datetime
import pandas_datareader.data as web
start = datetime.date(2014, 4, 1)
end = datetime.date.today()
stocks = web.DataReader("^N225", 'yahoo', start, end)
stocks.head(5)
Enable Bokeh on your notebook.
Enable Bokeh
import bokeh.plotting as bplt
bplt.output_notebook()
Draw a graph. Bokeh renders in a browser using JavaScript, so you can use Japanese without putting Japanese fonts on the container side.
Draw a graph with Bokeh
p = bplt.figure(title='Nikkei average', x_axis_type='datetime', plot_width=640, plot_height=320)
p.segment(stocks.index, stocks.Open, stocks.index, stocks.Close, color='black')
bplt.show(p)
I will also use the image of PySpark. Additional settings are required to connect to a running Spark cluster, but standalone operation may be sufficient for simple API operation checks.
Download the * pyspark-notebook * image and check the capacity.
$ docker pull jupyter/pyspark-notebook
$ docker images jupyter/pyspark-notebook
REPOSITORY TAG IMAGE ID CREATED SIZE
jupyter/pyspark-notebook latest 26c919b64b68 2 days ago 4.46GB
Launch in the background to see the token.
$ docker run -d --name pysparknb -p 8888:8888 jupyter/pyspark-notebook
$ docker exec pysparknb jupyter notebook list
Access the notebook server with a web browser and create a new Python 3 notebook.
Try to run from top to bottom according to the official documentation. The source code is not cloned, so the data will be downloaded and used individually.
Build a Spark session
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
Download the file using the wget
command.
Download data
!wget https://raw.githubusercontent.com/apache/spark/master/examples/src/main/resources/people.json
Load the data using the DataFrame API. View the data and schema and also make sure that you can select the data by column specification.
Data reading
df = spark.read.json("people.json")
df.show()
df.printSchema()
df.select("name").show()
Also make sure you can use SQL. This will be easier for simple tabulations.
Use of SQL
df.createOrReplaceTempView("people")
sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
I will also try using the image of TensorFlow. No GPU should be enough to try out the TensorFlow Core API.
Download the * tensorflow-notebook * image and check the capacity.
$ docker pull jupyter/tensorflow-notebook
$ docker images jupyter/tensorflow-notebook
REPOSITORY TAG IMAGE ID CREATED SIZE
jupyter/tensorflow-notebook latest 374b8bc43218 3 days ago 4.5GB
Launch in the background to see the token.
$ docker run -d --name tfnb -p 8888:8888 jupyter/tensorflow-notebook
$ docker exec tfnb jupyter notebook list
Access the notebook server with a web browser and create a new Python 3 notebook.
Follow the official documentation to see how the API works.
First, import the module and define the constant node.
import tensorflow as tf
node1 = tf.constant(3.0, dtype=tf.float32)
node2 = tf.constant(4.0)
node1, node2
Define a session and evaluate the node. Give the session's * run () * methods a list of nodes.
sess = tf.Session()
sess.run([node1, node2])
Define a node that applies the add operation to the two constant nodes and evaluate the node.
node3 = tf.add(node1, node2)
sess.run(node3)
Allows input to be given during evaluation rather than a constant node. TensorFlow uses a data type called * placeholder *. Note that ʻa + bis a shortcut for
tf.add (a, b)`.
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b
Specify a value for the * feed_dict * argument when calling the session's * run () * method.
sess.run(adder_node, {a: 3, b: 4.5})
If you give an argument as a list, the addition operation is applied to each element. Isn't it convenient to treat constants and vectors in the same way?
sess.run(adder_node, {a: [1, 3], b: [2, 4]})
The flow of connecting nodes and evaluating them in a session is the same for other operations. However, rounding may be different from what you expected in decimal calculations, so be careful when checking the operation with a simple calculation.
add_and_triple = adder_node * 3.
sess.run(add_and_triple, {a: [1, 2], b: [3.4, 5.6]})
The above is summarized in a notebook as follows.
The official documentation goes on from here on how to handle machine learning. It describes how to define variables to define a linear model, or give training data to build a model and evaluate execution results. You can also try MNIST image recognition with the * tensorflow.examples.tutorials.mnist * module, so it's a good idea to read through the tutorials in order.
Now that the Jupyter Docker image has been organized, I tried using * base-notebook *, * scipy-notebook *, * pyspark-notebook *, * tensorflow-notebook *.
The number of toolsets available in data science is increasing, and managing each one can be difficult, but using Docker images can eliminate the complexity of initial deployment. Large-scale deployment and performance tuning are also concerns, but in cloud environments such as AWS and GCP, the use of containers is also simplified, making it easy to create a highly portable reproduction environment. It can be said that it is good that the range of choices is increasing, such as using public images properly according to the purpose or expanding by yourself.
Recommended Posts