Previously, I put Dist-keras on Docker to create a scalable deep learning. http://qiita.com/cvusk/items/3e6c3bade8c0e1c0d9bf
The point of reflection at that time was that the performance did not come out, but after careful review, it seems that the parameter settings were incorrect. So, after reflecting on it, I tried various things.
For a description of Dist-Keras itself, please refer to the previous Post, but the point is Keras running on a Spark cluster. I made this a Docker image to make it easier to scale out.
The Dockerfile is available on GitHub. https://github.com/shibuiwilliam/distkeras-docker
This time, I would like to verify Dist-Keras on Docker on single host and multi-host to improve performance. Last time, I launched multiple containers on a single host to create a Spark cluster. This time I will increase the pattern.
It is MNIST that runs. The MNIST learning program is customized from the one provided by dist-keras.
Validate on single host and multihost. Both are configured as Spark Master + Worker, and adjust the number of workers and worker specifications. In the case of multi-host, there are two servers. The host is AWS EC2 CentOS 7.3 m4.xlarge.
no | hosts | workers | resources |
---|---|---|---|
1 | single | 1 | 1 processor, 2GB RAM |
2 | single | 2 | 2 processors, 5GB RAM |
3 | single | 3 | 1 processor, 3GB RAM |
4 | multihost | 2 | 2 processors, 5GB RAM |
5 | multihost | 2 | 3 processors, 8GB RAM |
6 | multihost | 4 | 2 processors, 5GB RAM |
The image for a single host looks like this. The number of Docker containers will fluctuate depending on the verification conditions.
In the case of a single host, start multiple Docker containers on the same host.
# docker dist-keras for spark master and slave
docker run -it -p 18080:8080 -p 17077:7077 -p 18888:8888 -p 18081:8081 -p 14040:4040 -p 17001:7001 -p 17002:7002 \
-p 17003:7003 -p 17004:7004 -p 17005:7005 -p 17006:7006 --name spmaster -h spmaster distkeras /bin/bash
# docker dist-keras for spark slave1
docker run -it --link spmaster:master -p 28080:8080 -p 27077:7077 -p 28888:8888 -p 28081:8081 -p 24040:4040 -p 27001:7001 \
-p 27002:7002 -p 27003:7003 -p 27004:7004 -p 27005:7005 -p 27006:7006 --name spslave1 -h spslave1 distkeras /bin/bash
# docker dist-keras for spark slave2
docker run -it --link spmaster:master -p 38080:8080 -p 37077:7077 -p 38888:8888 -p 38081:8081 -p 34040:4040 -p 37001:7001 \
-p 37002:7002 -p 37003:7003 -p 37004:7004 -p 37005:7005 -p 37006:7006 --name spslave2 -h spslave2 distkeras /bin/bash
Spark master launches Spark master and worker, and Spark slave launches worker only.
# for spark master
${SPARK_HOME}/sbin/start-master.sh
# for spark worker
${SPARK_HOME}/sbin/start-slave.sh -c 1 -m 3G spark://spmaster:${SPARK_MASTER_PORT}
Then customize the MNIST program with Spark Master. MNIST sample code is provided by Dist-Keras. The directory is / opt / dist-keras / examples, which contains the following sample data and programs.
[root@spm examples]# tree
.
|-- cifar-10-preprocessing.ipynb
|-- data
| |-- atlas_higgs.csv
| |-- mnist.csv
| |-- mnist.zip
| |-- mnist_test.csv
| `-- mnist_train.csv
|-- example_0_data_preprocessing.ipynb
|-- example_1_analysis.ipynb
|-- kafka_producer.py
|-- kafka_spark_high_throughput_ml_pipeline.ipynb
|-- mnist.ipynb
|-- mnist.py
|-- mnist_analysis.ipynb
|-- mnist_preprocessing.ipynb
|-- spark-warehouse
`-- workflow.ipynb
Copy the original file, back it up, and apply the following changes.
cp mnist.py mnist.py.bk
Add the following at the beginning.
from pyspark.sql import SparkSession
Change Spark parameters for this environment. The intent of the change is as follows. --Using Spark2 --Using the local environment --Define master url in local environment --Change the number of processors according to the verification conditions --Change the number of workers according to the verification conditions
# Modify these variables according to your needs.
application_name = "Distributed Keras MNIST"
using_spark_2 = True # False to True
local = True # False to True
path_train = "data/mnist_train.csv"
path_test = "data/mnist_test.csv"
if local:
# Tell master to use local resources.
# master = "local[*]" comment out
master = "spark://spm:7077" # add
num_processes = 1 # change to number of processors per worker
num_executors = 3 # change to number of workers
else:
# Tell master to use YARN.
master = "yarn-client"
num_executors = 20
num_processes = 1
Change the worker's memory to match the validation criteria.
conf = SparkConf()
conf.set("spark.app.name", application_name)
conf.set("spark.master", master)
conf.set("spark.executor.cores", `num_processes`)
conf.set("spark.executor.instances", `num_executors`)
conf.set("spark.executor.memory", "4g") # change RAM size
conf.set("spark.locality.wait", "0")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
You can then run it on the Spark master with python mnist.py
.
The multi-host configuration looks like this.
Multihost requires Docker containers to be connected via an overlay network. Please refer to the following for details on how to build a Docker network with multiple hosts. http://knowledge.sakura.ad.jp/knowledge/4786/ http://christina04.hatenablog.com/entry/2016/05/16/065853
I will write only my procedure here. Prepare host1 and host2 on EC2, install etcd on host1 and start it.
yum -y install etcd
vi /etc/etcd/etcd.conf
systemctl enable etcd
systemctl start etcd
Next, add docker-network settings to both host1 and host2.
# edit docker-network file
vi /etc/sysconfig/docker-network
# for host1
DOCKER_NETWORK_OPTIONS='--cluster-store=etcd://<host1>:2379 --cluster-advertise=<host1>:2376'
# for host2
DOCKER_NETWORK_OPTIONS='--cluster-store=etcd://<host1>:2379 --cluster-advertise=<host2>:2376'
# from host2 to ensure network connection to host1 etcd is available
curl -L http://<host1>:2379/version
{"etcdserver":"3.1.3","etcdcluster":"3.1.0"}
Now that you can connect to the network between dockers, create a docker network with host1. Here, create a docker network called test1 on subnet 10.0.1.0/24.
# for host1
docker network create --subnet=10.0.1.0/24 -d overlay test1
Finally, run docker network ls
and it's OK if the test1 network is added.
NETWORK ID NAME DRIVER SCOPE
feb90a5a5901 bridge bridge local
de3c98c59ba6 docker_gwbridge bridge local
d7bd500d1822 host host local
d09ac0b6fed4 none null local
9d4c66170ea0 test1 overlay global
Then add a Docker container to the test1 network. Let's deploy a Docker container to each of host1 and host2.
# for host1 as spark master
docker run -it --net=test1 --ip=10.0.1.10 -p 18080:8080 -p 17077:7077 -p 18888:8888 -p 18081:8081 -p 14040:4040 -p 17001:7001 -p 17002:7002 \
-p 17003:7003 -p 17004:7004 -p 17005:7005 -p 17006:7006 --name spm -h spm distkeras /bin/bash
# for host2 as spark slave
docker run -it --net=test1 --ip=10.0.1.20 --link=spm:master -p 28080:8080 -p 27077:7077 -p 28888:8888 -p 28081:8081 -p 24040:4040 -p 27001:7001 \
-p 27002:7002 -p 27003:7003 -p 27004:7004 -p 27005:7005 -p 27006:7006 --name sps1 -h sps1 distkeras /bin/bash
You have now deployed two Docker containers on the test1 network with multiple hosts.
After that, start the Spark master and Spark worker in the same procedure as a single host, edit MNIST.py and execute python mnist.py
.
This is the result of verifying the performance of each configuration. This time, we are measuring the time (seconds) required to complete.
no. | hosts | workers | resources | time in second |
---|---|---|---|---|
1 | single | 1 | 1 processor, 2GB RAM | 1615.63757 |
2 | single | 2 | 2 processors, 5GB RAM | 1418.56935 |
3 | single | 3 | 1 processor, 3GB RAM | 1475.84212 |
4 | multihost | 2 | 2 processors, 5GB RAM | 805.382518 |
5 | multihost | 2 | 3 processors, 8GB RAM | 734.290324 |
6 | multihost | 4 | 2 processors, 5GB RAM | 723.878466 |
Performance is better with multi-host. I think that the amount of free resources simply makes a difference in performance. Verification 2 and Verification 4 have the same settings for the Docker container configuration and the resources used by the workers, but there is still a difference of 600 seconds. Comparing validation 1 and validation 2, or validation 4 and validation 5 and validation 6 does not seem to make a big difference in the number of Spark workers and the amount of resources themselves. If you want to improve the performance significantly, it is better to make it multi-host obediently.
[2017/05/26 postscript] I clustered it with Kubernetes. http://qiita.com/cvusk/items/42a5ffd4e3228963234d
Recommended Posts