Last time did from build of Distributed TensorFlow to model parallelism, but this time I will try learning by data parallelism. There are two types of parallelization, model parallelism and data parallelism, but roughly speaking, they are as follows.
--Model parallelism: One huge operation for 1000 data is shared by 100 people --Data parallelism: Divide 10 pieces of data per person and share it with 100 people
Of course, model parallelism depends on the model, so it can be said that it is more versatile to reduce the data handled at one time in data parallelism.
In data parallelism in training, multiple copies of models with the same parameters are made, batches are subdivided and passed to each copy of the model, and each is made to calculate the derivative. In other words, it is necessary to have a model with the same parameters for each device, but the handling around that is a little difficult to understand. I will not use GPU this time, but for calculation and parameter sharing on multiple devices, the official How To description around GPU (Using GPUs .7 / how_tos / using_gpu / index.html # using-gpus), Sharing Variables) Will be helpful.
You can use tf.variable_scope () to define the scope of a variable. If you want to use variables with the same name in the same scope, you can call get_variable () with the reuse flag set. get_variable () behaves like returning a new creation if the reuse flag is not set, and returning a link to an existing variable with the same name if it is set. You can use this to share parameters.
Use Collections to refer to the graph later.
In Distrubuted Tensorflow White Paper, the server that manages Variable, which is a parameter device, manages and updates parameters, and the master works for each worker. There is a description about throwing (sorry, I haven't read it properly ...). While reading diagonally around that, I decided to manage Variables collectively with master, and make it a configuration that two workers handle the subdivided batches. It is possible to separate the server (ps) for parameters separately, but this time it will be included in master.
--Create a ps scoped variable on the master device --Reuse ps scoped variables to calculate derivatives on each worker device --On the master device, update the parameters using the average of the derivatives calculated by each worker
I will describe it in the division. The figure is as follows.
Start the grpc server. The cluster configuration is one master and two workers, so
grpc_tensorflow_server --cluster_spec='master|localhost:2222,worker|localhost:2223,worker_|localhost:2224' --job_name=master --task_index=0 &
grpc_tensorflow_server --cluster_spec='master|localhost:2222,worker|localhost:2223,worker_|localhost:2224' --job_name=worker --task_index=0 &
grpc_tensorflow_server --cluster_spec='master|localhost:2222,worker|localhost:2223,worker_|localhost:2224' --job_name=worker_ --task_index=0 &
Is it?
Last time Let's learn the approximation of $ y = e ^ x $ in parallel data. See here for the code. By the way, I also put a single CPU version and a model parallel version in the same place.
To share parameters with master and worker, just specify the device and scope and replace the variable creation part created by tf.Variable () with get_variable (). I want to reuse it, so I will unify everything with the ps scope. It seems to pass an initializer to initialize the variable.
W1 = tf.Variable(tf.random_uniform([1,16], 0.0, 1.0)) # before
W1 = tf.get_variable("W1",shape=[1,16],initializer=tf.random_uniform_initializer(0,1)) # after
If you call it with the reuse flag set in the same scope, it will be reused. I'm a little confused here, but since the scope here is a variable scope, only variables are reused, and the graph is a separate instance.
We will pass the batch to each worker later in the main loop, but we will use a collection so that we can identify which worker's placefolder at that time.
tf.add_to_collection("x",x) #Collect x for later use
...
x0= tf.get_collection("x")[0] #Extract the 0th of the x collection
Similarly, collect the derivatives.
Optimization is a procedure of cost calculation-differential calculation-parameter update, but I think that usually optimizer.minimize () etc. is used to perform differential calculation and parameter update at once. However, this time we will not use the obtained derivative immediately, so apply_gradiends () after compute_gradients (). As a whole process
--Differential operation compute_gradients () in each worker --Average those derivatives with master --App_gradients () using the average derivative
Follow the procedure. The place to average the differentiation is diverted from TensorFlow sample code.
Let's compare the degree of convergence.
Blue is the single CPU version and red is the data parallel version. Since the seeds of random numbers are aligned, they almost overlap. (Not exactly the same value)
I haven't increased the number of server machines, the overheads are huge, the batch operation is not heavy in the first place, etc. There are various disadvantages in this example, so the parallel version is slower. It was roughly twice as slow at hand.
Up to this point, the purpose is to understand the mechanism rather than speeding up, so every time I get the result of slowing down with parallelization, Next time is finally about this. Let's make it a Docker container and run it on Google Clout Platform. If you make a lot of containers with the momentum to use up the free frame of \ $ 300, it should be faster. The hot topic AlphaGo is said to have 1200 CPUs, but I'm looking forward to seeing how much it can do for \ $ 300.
Recommended Posts