The ipython cluster starts up locally as soon as ipyparallel is installed, but that's not interesting, so I put the distribution destination on AWS. Memorandum of understanding.
--Cluster
The story explained here is such a composition. Since each host is connected by SSH, it is necessary to set up each connection information in advance. Actually, it would be easier if you had the trouble of preparing an AMI that functions as a node for Engine just by starting it.
Create an instance on Amazon Linux as appropriate.
Here, I used ʻAmazon Linux AMI 2015.09 (HVM)` and created it with t2.micro in the Tokyo region.
Once installed, install ʻipython and ʻipyparallel
.
sudo yum groupinstall -y 'Development Tools'
sudo pip install ipython ipyparallel
Here, the Engine is started via SSH, so make sure that you can ssh from the Controller to the Engine server with ec2-user without a password.
ipython profile create --parallel --profile=myprofile
This will generate a file like the one below.
tree .ipython/profile_myprofile/
~/.ipython/profile_myprofile/
├── ipcluster_config.py
├── ipcontroller_config.py
├── ipengine_config.py
├── ipython_config.py
├── ipython_kernel_config.py
├── log
├── pid
├── security
└── startup
└── README
We will make settings for cluster operation.
ipcluster_config.py
Let's add the following code at the beginning.
ipcluster_config.py
c = get_config()
c.IPClusterEngines.engine_launcher_class = 'SSHEngineSetLauncher'
c.SSHEngineSetLauncher.engines = {
'<ENGINE_HOST1>' : 3,
'<ENGINE_HOST2>' : 3
}
ipcontroller_config.py
Find the commented out part inside and rewrite the relevant part.
ipcontroller_config.py
...Omission...
c.IPControllerApp.reuse_files = True
...Omission...
c.RegistrationFactory.ip = u'*'
...
If reuse_files is set to True, the files under .ipython / profile_ <PROFILE_NANE> / security / will be reused. What happens when it's reused is that the cluster key isn't regenerated so you don't have to update the config file every time you restart the controller.
Execute the following command on the controller server.
ipcluster start --profile=myprofile
If you want more detailed output, use the --debug
option. Also, the log file is .ipython/profile_myprofile/log/
.
If it works, after the controller starts, you will see a log that looks like you have transferred ʻipcontroller-client.json and ʻipcontroller-engine.json
to each node. If the controller fails to start, ʻipcontroller- * `related files will not be generated. (Therefore, if the controller fails to start, it will output a message that the controller has stopped casually, and then it will be told that ipcontroller-client.json cannot be transferred to each Engine node. It is unfriendly.)
2015-10-04 11:15:21.568 [IPClusterStart] Engines appear to have started successfully
When a log like this appears, the startup is complete.
If you want to make it a daemon, add the --daemonize
option.
From now on, I will do it from the local machine.
This file is created when the Controller starts.
Drop this file locally on the Client with scp or something.
scp ec2-user@<CONTROLLER_HOST>:./.ipython/profile_myprofile/security/ipcontroller-client.json ./
Here, I prepared the following confirmation script while referring to various things.
cluster_check.py
from ipyparallel import Client
rc = Client("ipcontroller-client.json", sshserver="ec2-user@<CONTROLLER_HOST>")
print rc.ids
dview = rc[:]
@dview.remote(block=True)
def gethostname():
import socket
return socket.getfqdn()
print gethostname()
If you run it with this, it should look like this.
$ python cluster_check.py
[0, 1, 2, 3, 4, 5]
[<HOSTNAME>, <HOSTNAME>, <HOSTNAME>, <HOSTNAME2>, <HOSTNAME2>, <HOSTNAME2>]
If you follow the above settings, you should be booting 3 hosts each.
Now ipython cluster can distribute the process across multiple nodes.
Like the operation check code above, you need to do the necessary import etc. in the code that runs on multiple nodes.
Use the @ require
and @ depend
decorators to execute code that can be executed properly on that node. (Please refer to this area)
Use LoadBalancedView if you want to process tasks in sequence on multiple nodes.
In [1]: from ipyparallel import Client
In [3]: a = Client("ipcontroller-client.json", sshserver="ec2-user@<CONTROLLER_HOST>")
In [4]: views = a.load_balanced_view()
In [5]: result = views.map_async(lambda x: x*x , (x for x in xrange(10000)))
In [6]: result
Out[6]: <AsyncMapResult: <lambda>>
It's useless in this example as it transfers one value to another controller, transfers the lambda expression, allocates it to each engine separately, calculates and receives the result for every list element. It's a nice place, but it's just an example.
In [8]: result.progress
Out[8]: 8617
In [9]: result.progress
Out[9]: 9557
In [11]: result.progress
Out[11]: 10000
In [12]: result.ready()
Out[12]: True
In [13]: result
Out[13]: <AsyncMapResult: finished>
.result
.In [14]: len(result.result)
Out[14]: 10000
In [15]: type(result.result)
Out[15]: list
Please note that if you access .result
carelessly, it will be locked until the distributed processing is completed.
So, it is now possible to create a cluster and perform distributed processing.
I thought ipython was only a "convenient repl environment", but this is amazing.
It seems to be difficult to dynamically allocate nodes or increase or decrease the number of clusters according to the load, but above all, it is a dream that the code created by localizing the distribution destination can be put on a large-scale distribution. On the contrary, is it fun to be able to run code that works in a large-scale distribution as it is in local mode? As the scale grows, cluster management will probably come up as another challenge.
I used it very much as a reference.
http://qiita.com/chokkan/items/750cc12fb19314636eb7
Recommended Posts