Try Distributed TensorFlow

Distributed TensorFlow has been published (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/distributed_runtime/README.md). Explanation of the person inside is easy to understand, but it is a mechanism to support parallel computing in the distributed environment of TensorFlow.

It is supposed to make a lot of Docker images and use it, but for me at the bottom of low heat computing, for the time being, I run the server on Ubuntu 14.04 (64bit) of my home desktop PC and hit it from the MacBook. I will try.

In my home environment, the desktop PC has a smaller CPU and no GPU, so it's not worth it in practice, but it's a practice.

Server build (Ubuntu 14.04)

At the moment (February 28, 2016) you need to build from source. Follow TensorFlow Official and try from environment construction to build.

First, install bazel.

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer
$ sudo apt-get install pkg-config zip g++ zlib1g-dev unzip
$ wget https://github.com/bazelbuild/bazel/releases/download/0.2.0/bazel-0.2.0-installer-linux-x86_64.sh
$ chmod +x ./bazel-0.2.0-installer-linux-x86_64.sh
$ ./bazel-0.2.0-installer-linux-x86_64.sh --user

Bazel will be installed in ~ / bin, so pass it through. Next, install dependent packages other than bazel.

$ sudo apt-get install python-numpy swig python-dev

Build the server by dropping TensorFlow itself from git.

$ git clone --recurse-submodules https://github.com/tensorflow/tensorflow
$ cd tensorflow
$ ./configure
$ bazel build --jobs 2 -c opt //tensorflow/core/distributed_runtime/rpc:grpc_tensorflow_server

If you forget "--jobs 2", the build failed due to lack of resources.

If you need gRPC compatible TensorFlow itself, you need to build it from source. I don't need it on the server, but I will build it to check the operation. On Ubuntu 14.04 (64bit), it's as instructed.

$ sudo pip install wheel
$ bazel build --jobs 2 -c opt //tensorflow/tools/pip_package:build_pip_package
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.7.1-py2-none-any.whl

It will take a long time. .. Well, it's a powerless PC, so it can't be helped.

Operation check (in the same machine)

Start the server on the desktop PC and try hitting the server from the client of the desktop PC. Tutorial That's right.

`bash@desktop(server)`


$ bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server --cluster_spec='local|localhost:2222' --job_name=local --task_index=0 &

`bash@desktop(client)`


$ python
>>> import tensorflow as tf
>>> c = tf.constant("Hello, distributed TensorFlow!")
>>> sess = tf.Session("grpc://localhost:2222")
>>> sess.run(c)
'Hello, distributed TensorFlow!'

I got it.

Operation check (remote)

Next is from the laptop. It seems that the client side also needs to be rebuilt from source.

The build procedure is omitted. Since it is OSX, I think that it should be supported according to here. In my case, I got an error "I don't know the linker option such as -Bsymbolic!", So I removed the option from the BUILD file at that point and it passed.

Furthermore, if you import tensorflow after installing the created wheel with pip

ImportError: No module named core.framework.graph_pb2

It is said that. After investigating various things, it is easy to avoid it with virtualenv, so I tried it,

`bash@note(client)`


$ virtualenv -p /usr/local/bin/python distributed_tensorflow
$ . distributed_tensorflow/bin/activate
$ pip install /tmp/tensorflow_pkg/tensorflow-0.7.1-py2-none-any.whl

It is now available in. Huh. ..

I also tried to build the server, but the link did not pass due to the following error (Addition: As of 2016/3/4, I tried git pulling the latest version and the build passed)

duplicate symbol __ZNK10tensorflow17BuildGraphOptions11DebugStringEv in:
    bazel-out/local_darwin-opt/bin/tensorflow/core/distributed_runtime/libsimple_graph_execution_state.a(simple_graph_execution_state.o)    bazel-out/local_darwin-opt/bin/tensorflow/core/distributed_runtime/libbuild_graph_options.a(build_graph_options.o)
ld: 1 duplicate symbol for architecture x86_64

Now, let's call desktop (home.local) from note.

`bash@note(client)`


$ python
>>> import tensorflow as tf
>>> c = tf.constant("Hello, distributed TensorFlow!")
>>> sess = tf.Session("grpc://home.local:2222")
>>> sess.run(c)
'Hello, distributed TensorFlow!'

It worked. Now you can exchange graphs over the network.

I will try using it

Well, I will use it.

TensorFlow version with the same processing as chainer and deep learning learned by function approximation that I uploaded to Qiita the other day. /ashitani/jupyter_examples/blob/master/tensorflow.ipynb) Let's parallelize. It's not a big load to begin with, so there's no point in distributing it, but it's a practice.

The cluster configuration consists of two parameter servers (ps) and one master server (master) that works.

The server startup argument seems to give the cluster configuration as --cluster \ _spec and the server name and task number as --job \ _name, --task \ _index.

grpc_tensorflow_server --cluster_spec='master|localhost:2222,ps|localhost:2223,ps_|localhost:2224' --job_name=master --task_index=0 &
grpc_tensorflow_server --cluster_spec='master|localhost:2222,ps|localhost:2223,ps_|localhost:2224' --job_name=ps --task_index=0 &
grpc_tensorflow_server --cluster_spec='master|localhost:2222,ps|localhost:2223,ps_|localhost:2224' --job_name=ps_ --task_index=0 &

As far as I can see the documentation and help, I should be able to distinguish by job \ _name = ps, task \ _index = 0,1, but ps1 that I want to hit port 2224 fails when trying to get 2223, so ps is unavoidable. Renamed to, ps \ _.

Now you have a cluster of three servers. It's a poor configuration so they all run on the same machine, but it seems easy to assign them to different containers.

Separate the parameter server with a weight of $ W $ and a bias of $ b $ (whether meaningful or not). The master server is responsible for graphing sessions and errors.

It looks like this when you plot the machine configuration and the division of servers in the graph.

The client-side code looks like this:

`python@note(client)`


import tensorflow as tf
import numpy as np

def get_batch(n):
    x = np.random.random(n)
    y = np.exp(x)
    return x,y

def leaky_relu(x,alpha=0.2):
    return tf.maximum(alpha*x,x)

x_ = tf.placeholder(tf.float32, shape=[None, 1])
t_ = tf.placeholder(tf.float32, shape=[None, 1])

with tf.device("/job:ps/task:0"):
    W1 = tf.Variable(tf.zeros([1,16]))
    W2 = tf.Variable(tf.zeros([16,32]))
    W3 = tf.Variable(tf.zeros([32,1]))

with tf.device("/job:ps_/task:0"):
    b1 = tf.Variable(tf.zeros([16]))
    b2 = tf.Variable(tf.zeros([32]))
    b3 = tf.Variable(tf.zeros([1]))

with tf.device("/job:master/task:0"):
    h1 = leaky_relu(tf.matmul(x_,W1)+b1)
    h2 = leaky_relu(tf.matmul(h1,W2)+b2)
    y  = leaky_relu(tf.matmul(h2,W3)+b3)
    e  = tf.nn.l2_loss(y-t_)

opt=tf.train.AdamOptimizer()
train_step=opt.minimize(e)

with tf.Session("grpc://home.local:2222") as sess:

    sess.run(tf.initialize_all_variables())
    for i in range(10000):
        x0,t0 = get_batch(100)
        x = x0.astype(np.float32).reshape(100,1)
        t = t0.astype(np.float32).reshape(100,1)

        sess.run(train_step,feed_dict={x_: x, t_:t})

        if i%100==0:
            print "loss,", sess.run(e,feed_dict={x_: x, t_:t})

The result was much slower than moving it alone (laughs). Not surprisingly, the network bandwidth and the machine itself are slow.

The name of the job is just for the sake of distinction, and it seems that ps is not a parameter server. Similarly, anyone can throw the session to the server. Well, usually it is allocated according to the type of resource, so it would be convenient to be able to distinguish by name.

Summary

I really wanted to try data parallelism, but unfortunately it didn't. It seems that in multiple sessions, for example, batches are subdivided and trained by each server, the derivatives of the parameters output by each server are collected, and the parameters are updated on average.

However, I got a feel for the time being. The great thing is that all the code is done on the client side only. It's a little troublesome to form a cluster, but I'm looking forward to it because I'm considering a mechanism that can be easily managed using Kubernetes. It has nothing to do with me, who has low firepower (laughs)

It took a lot of time to build. I was particularly addicted to the OSX version. Well, the official Docker image will be created soon, and the client side will also be included in the binary release.

Until now, spending money on the GPU has been the mainstream for speeding up learning, but you are moving to the stage of how much money you can put into the cloud computing environment.

Anyone who has a hobby like me can't afford it, so I'll look for a cheap GPU. I'm looking forward to the day when there will be a board or a box full of Altera, which has become cheaper.

I wrote a sequel.

-Try data parallelism with Distributed TensorFlow -Try running Distributed TensorFlow on Google Cloud Platform