I tried to make deep learning scalable with Spark × Keras × Docker 2

Previously, I put Dist-keras on Docker to create a scalable deep learning. http://qiita.com/cvusk/items/3e6c3bade8c0e1c0d9bf

The point of reflection at that time was that the performance did not come out, but after careful review, it seems that the parameter settings were incorrect. So, after reflecting on it, I tried various things.

Synopsis up to the last time

For a description of Dist-Keras itself, please refer to the previous Post, but the point is Keras running on a Spark cluster. I made this a Docker image to make it easier to scale out.

The Dockerfile is available on GitHub. https://github.com/shibuiwilliam/distkeras-docker

What to do this time

This time, I would like to verify Dist-Keras on Docker on single host and multi-host to improve performance. Last time, I launched multiple containers on a single host to create a Spark cluster. This time I will increase the pattern.

It is MNIST that runs. The MNIST learning program is customized from the one provided by dist-keras.

Verified configuration

Validate on single host and multihost. Both are configured as Spark Master + Worker, and adjust the number of workers and worker specifications. In the case of multi-host, there are two servers. The host is AWS EC2 CentOS 7.3 m4.xlarge.

no	hosts	workers	resources
1	single	1	1 processor, 2GB RAM
2	single	2	2 processors, 5GB RAM
3	single	3	1 processor, 3GB RAM
4	multihost	2	2 processors, 5GB RAM
5	multihost	2	3 processors, 8GB RAM
6	multihost	4	2 processors, 5GB RAM

How to set up a single host

The image for a single host looks like this. The number of Docker containers will fluctuate depending on the verification conditions.

In the case of a single host, start multiple Docker containers on the same host.

# docker dist-keras for spark master and slave
docker run -it -p 18080:8080 -p 17077:7077 -p 18888:8888 -p 18081:8081 -p 14040:4040 -p 17001:7001 -p 17002:7002 \
 -p 17003:7003 -p 17004:7004 -p 17005:7005 -p 17006:7006 --name spmaster -h spmaster distkeras /bin/bash

# docker dist-keras for spark slave1
docker run -it --link spmaster:master -p 28080:8080 -p 27077:7077 -p 28888:8888 -p 28081:8081 -p 24040:4040 -p 27001:7001 \
-p 27002:7002 -p 27003:7003 -p 27004:7004 -p 27005:7005 -p 27006:7006 --name spslave1 -h spslave1 distkeras /bin/bash

# docker dist-keras for spark slave2
docker run -it --link spmaster:master -p 38080:8080 -p 37077:7077 -p 38888:8888 -p 38081:8081 -p 34040:4040 -p 37001:7001 \
-p 37002:7002 -p 37003:7003 -p 37004:7004 -p 37005:7005 -p 37006:7006 --name spslave2 -h spslave2 distkeras /bin/bash

Spark master launches Spark master and worker, and Spark slave launches worker only.

# for spark master
${SPARK_HOME}/sbin/start-master.sh

# for spark worker
${SPARK_HOME}/sbin/start-slave.sh -c 1 -m 3G spark://spmaster:${SPARK_MASTER_PORT}

Then customize the MNIST program with Spark Master. MNIST sample code is provided by Dist-Keras. The directory is / opt / dist-keras / examples, which contains the following sample data and programs.

[root@spm examples]# tree
.
|-- cifar-10-preprocessing.ipynb
|-- data
|   |-- atlas_higgs.csv
|   |-- mnist.csv
|   |-- mnist.zip
|   |-- mnist_test.csv
|   `-- mnist_train.csv
|-- example_0_data_preprocessing.ipynb
|-- example_1_analysis.ipynb
|-- kafka_producer.py
|-- kafka_spark_high_throughput_ml_pipeline.ipynb
|-- mnist.ipynb
|-- mnist.py
|-- mnist_analysis.ipynb
|-- mnist_preprocessing.ipynb
|-- spark-warehouse
`-- workflow.ipynb

Copy the original file, back it up, and apply the following changes.

cp mnist.py mnist.py.bk

Change 1 Import of Spark Session

Add the following at the beginning.

from pyspark.sql import SparkSession

Change 2 Parameter setting

Change Spark parameters for this environment. The intent of the change is as follows. --Using Spark2 --Using the local environment --Define master url in local environment --Change the number of processors according to the verification conditions --Change the number of workers according to the verification conditions

# Modify these variables according to your needs.
application_name = "Distributed Keras MNIST"
using_spark_2 = True  # False to True
local = True  # False to True
path_train = "data/mnist_train.csv"
path_test = "data/mnist_test.csv"
if local:
    # Tell master to use local resources.
#     master = "local[*]"   comment out
    master = "spark://spm:7077"  # add
    num_processes = 1 # change to number of processors per worker
    num_executors = 3  # change to number of workers
else:
    # Tell master to use YARN.
    master = "yarn-client"
    num_executors = 20
    num_processes = 1

Change 3 Worker memory

Change the worker's memory to match the validation criteria.

conf = SparkConf()
conf.set("spark.app.name", application_name)
conf.set("spark.master", master)
conf.set("spark.executor.cores", `num_processes`)
conf.set("spark.executor.instances", `num_executors`)
conf.set("spark.executor.memory", "4g") # change RAM size
conf.set("spark.locality.wait", "0")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");

You can then run it on the Spark master with python mnist.py.

How to set up multi-host

The multi-host configuration looks like this.

Multihost requires Docker containers to be connected via an overlay network. Please refer to the following for details on how to build a Docker network with multiple hosts. http://knowledge.sakura.ad.jp/knowledge/4786/ http://christina04.hatenablog.com/entry/2016/05/16/065853

I will write only my procedure here. Prepare host1 and host2 on EC2, install etcd on host1 and start it.

yum -y install etcd

vi /etc/etcd/etcd.conf
systemctl enable etcd
systemctl start etcd

Next, add docker-network settings to both host1 and host2.

# edit docker-network file
vi /etc/sysconfig/docker-network

# for host1
DOCKER_NETWORK_OPTIONS='--cluster-store=etcd://<host1>:2379 --cluster-advertise=<host1>:2376'

# for host2
DOCKER_NETWORK_OPTIONS='--cluster-store=etcd://<host1>:2379 --cluster-advertise=<host2>:2376'

# from host2 to ensure network connection to host1 etcd is available
curl -L http://<host1>:2379/version
{"etcdserver":"3.1.3","etcdcluster":"3.1.0"}

Now that you can connect to the network between dockers, create a docker network with host1. Here, create a docker network called test1 on subnet 10.0.1.0/24.


# for host1
docker network create --subnet=10.0.1.0/24 -d overlay test1

Finally, run docker network ls and it's OK if the test1 network is added.


NETWORK ID          NAME                DRIVER              SCOPE
feb90a5a5901        bridge              bridge              local
de3c98c59ba6        docker_gwbridge     bridge              local
d7bd500d1822        host                host                local
d09ac0b6fed4        none                null                local
9d4c66170ea0        test1               overlay             global

Then add a Docker container to the test1 network. Let's deploy a Docker container to each of host1 and host2.

# for host1 as spark master
docker run -it --net=test1 --ip=10.0.1.10 -p 18080:8080 -p 17077:7077 -p 18888:8888 -p 18081:8081 -p 14040:4040 -p 17001:7001 -p 17002:7002 \
-p 17003:7003 -p 17004:7004 -p 17005:7005 -p 17006:7006 --name spm -h spm distkeras /bin/bash

# for host2 as spark slave
docker run -it --net=test1 --ip=10.0.1.20 --link=spm:master -p 28080:8080 -p 27077:7077 -p 28888:8888 -p 28081:8081 -p 24040:4040 -p 27001:7001 \
-p 27002:7002 -p 27003:7003 -p 27004:7004 -p 27005:7005 -p 27006:7006 --name sps1 -h sps1 distkeras /bin/bash

You have now deployed two Docker containers on the test1 network with multiple hosts. After that, start the Spark master and Spark worker in the same procedure as a single host, edit MNIST.py and execute python mnist.py.

Performance verification result

This is the result of verifying the performance of each configuration. This time, we are measuring the time (seconds) required to complete.

no.	hosts	workers	resources	time in second
1	single	1	1 processor, 2GB RAM	1615.63757
2	single	2	2 processors, 5GB RAM	1418.56935
3	single	3	1 processor, 3GB RAM	1475.84212
4	multihost	2	2 processors, 5GB RAM	805.382518
5	multihost	2	3 processors, 8GB RAM	734.290324
6	multihost	4	2 processors, 5GB RAM	723.878466

Performance is better with multi-host. I think that the amount of free resources simply makes a difference in performance. Verification 2 and Verification 4 have the same settings for the Docker container configuration and the resources used by the workers, but there is still a difference of 600 seconds. Comparing validation 1 and validation 2, or validation 4 and validation 5 and validation 6 does not seem to make a big difference in the number of Spark workers and the amount of resources themselves. If you want to improve the performance significantly, it is better to make it multi-host obediently.

[2017/05/26 postscript] I clustered it with Kubernetes. http://qiita.com/cvusk/items/42a5ffd4e3228963234d

I tried to make deep learning scalable with Spark × Keras × Docker 2 Multi-host edition