Introduction

Deep learning takes time not only for learning itself, but also for running a trained learning model. However, I want to make real-time object recognition with SSD (Single Shot MultiBox Detector) etc.! Or, when you want to play a real-time action battle game between AI and people who have been strengthened and learned by DQN (DeepQ Network), the real-time nature of model execution becomes very important.

Buy a good PC! Speaking of which, I don't have that much money, and maybe I want to run it portablely on a laptop computer. So, this time, we will consider how to execute Keras (TensorFlow) at high speed.

Execution environment

Python3.5.2
Keras 1.2.1
tensorflow 1.0.0
MacBookPro(Late 2013)

Speeding up

Let's do it. This time, I will try MNIST sample beginners and experts as an example that is as easy to understand as possible and easy to try. Put the code created this time on github.

Major premise

It's a story that doesn't really have a body or a lid, but if you can load a good grabber, charge AWS, or have no particular restrictions on the execution environment, stab TitanX and buy a memory-filled PC right now. Various sites are doing comparison of CPU and GPU speed of TensorFlow, for example, this article (Comparison of Tensorflow execution speed on CPU / GPU / AWS In 5aa4a746b31b9be4838d)), there is a difference of several tens of times in CPU and GPU.

Even if you do your best in this article to speed up, it will be about 2 to 5 times faster than the original, so if you can take that measure from the beginning, that is definitely better. I'm in trouble because I can't take such measures! I'm already doing it, but I want to make it faster! Let's see the continuation.

start

First, let's check with the beginner version MNIST sample. If you implement it in Keras without thinking,

#Modeling
model = Sequential()
model.add(InputLayer(input_shape=input_shape, name='input'))
model.add(Dense(nb_classes))
model.add(Activation('softmax', name='softmax'))

optimizer = SGD(lr=0.5)
model.compile(loss='categorical_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])
...
#Model training
model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch,
          verbose=1, validation_data=(X_test, Y_test))
...
#Model evaluation
score = model.evaluate(X_test, Y_test, verbose=0)
...
#Model execution
model.predict(np.array([x])

I think it will look like the above. Note that the method of executing the model at once with evaluate etc. is different from the situation where new data comes in sequentially like in real-time execution, so this time

# X_test is 10000 1-channel 784-dimensional data
start = time.perf_counter()
n_loop = 5
for n in range(n_loop):
    predictions = [model.predict(np.array([x])) for x in X_test]
print('elapsed time for {} prediction {} [msec]'.format(len(X_test), (time.perf_counter()-start) * 1000 / n_loop))

The model execution speed is measured by rotating the prediction 10000 times and taking the average elapsed time for 5 weeks (to measure the millisecond accuracy, not time.time () but time. .pref_counter () is used).

By the way, the above result is

elapsed time for 10000 prediction 3768.8394089927897 [msec]

was.

1. Run from backend using `K.function`

from keras import backend as K
pred = K.function([model.input], [model.output])
for n in range(n_loop):
    predictions = [pred([np.array([x])]) for x in X_test]

Keras can call the backend like from keras import backend as K, which is also mentioned in the Official Documentation, but K. You can create an instance of the Keras function by using function. By running the model from here, you can speed up the execution slightly compared to hitting Keras as it is. In this case

elapsed time for 10000 prediction 3210.0291186012328 [msec]

It became like.

2. Implemented with TensorFlow

In the first place, even if the same model is built between Keras and TensorFlow, there will be a considerable difference in execution speed and learning speed.

#Modeling
x = tf.placeholder(tf.float32, [None, imageDim], name="input")
W = tf.Variable(tf.zeros([imageDim, outputDim]), dtype=tf.float32, name="Weight")
b = tf.Variable(tf.zeros([outputDim]), dtype=tf.float32, name="bias")
y = tf.nn.softmax(tf.matmul(x, W)+b, name="softmax")

#Objective function setting
cross_entropy = tf.reduce_mean(
   　-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1])
)

#Optimizer settings
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

#Model training
sess.run(tf.global_variables_initializer())
for i in range(1000):
       batch_xs, batch_ys = tfmnist.train.next_batch(100)
       sess.run(train_step,feed_dict={x: batch_xs, y_: batch_ys})

#Model evaluation
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
result = sess.run(
            accuracy,
            feed_dict={x: tfmnist.test.images, y_:tfmnist.test.labels}
        )

#Model execution
sess.run(y, feed_dict={x: np.array([test_x])})

It is an advantage of TensorFlow that you can make detailed settings compared to Keras, but even so, the part of writing the model is inevitably complicated. However, the result of executing the model created with TensorFlow

elapsed time for 10000 prediction 2662.211540598946 [msec]

It can be seen that the speed is considerably improved compared to the Keras implementation.

3. Hit the Keras model from TensorFlow to run it

It's not that Keras users have to weep and move to TensorFlow. You can improve the execution speed by doing only model creation with Keras and executing the rest (training, prediction, etc.) from TensorFlow.

import keras.backend.tensorflow_backend as KTF
import tensorflow as tf

old_session = KTF.get_session()
sess = tf.Session()
KTF.set_session(sess)

#Modeling
model = Sequential()
model.add(InputLayer(input_shape=input_shape, name='input'))
model.add(Dense(nb_classes))
model.add(Activation('softmax', name='softmax'))

x = tf.placeholder(tf.float32, [None, imageDim], name="input")
y = model(x)
y_ = tf.placeholder(tf.float32, [None, nb_classes])

#The objective function, optimizer creation, and training evaluation execution are the same as above, so they are omitted.

KTF.set_session(old_session)

You get the output y by creating the input placeholderx and assigning it to the model. After that, set the objective function and optimizer according to the TensorFlow implementation method, and turn the training.

With this method, the execution result will be

elapsed time for 10000 prediction 2685.7926497992594 [msec]

Even if the model part is a Keras implementation, the execution speed is quite close to the TensorFlow implementation.

4. Execute the model created by Keras and TensorFlow from C ++

The amount that can be easily increased by hitting from Python is at most the above level (it may be faster if you use PyPy etc.), and if you want to make it faster, you need to execute the model from C ++. TensorFlow Serving, which is an API for using trained models in applications, is open to the public in TensorFlow. With this API, you can execute the model at high speed by loading the TensorFlow model on the C ++ side.

If you are a Linux user, you can run it without problems if you follow the tutorial, but it is still difficult to run on OSX (I also failed to build the environment, so I could not write the details this time ...), github There are also many issues for OS X. So this time, I will hit TensorFlow c ++ directly without using Serving. Even if Serving becomes available, it will be helpful if you want to hit the Keras model from C ++.

4.1. Preparation

Since you can directly operate the TensorFlow directory, put a link to the tensorflow folder located under pyenv etc. from a place that is easy to operate. If you don't want to pollute the directory installed by pip, clone the corresponding version from github. Execute ./configure from the root directory of your tensorflow. You will be asked to specify the compiler to use and set the default options, but basically there is no problem with the default specification or yes. However, if you do not have a GPU, answer N to the question of whether to enable OpenCL or CUDA.

To compile TensorFlow, you need Bazel, which is an open source build tool originally used by Google in-house. Installation will proceed while referring to here. If it is OSX, you can install it in one shot with brew install bazel & brew upgrade bazel.

4.2. Export graph

Export the model data in a form that can be read from C ++.

sess = tf.Session()

#For Keras
import keras.backend.tensorflow_backend as KTF
KTF.set_session(sess)

...
saver = tf.train.Saver()
saver.save(sess, "models/"　+ "model.ckpt")
tf.train.write_graph(sess.graph.as_graph_def(), "models/", "graph.pb")

4.3. Freeze the model

Weights can be fixed for models that do not need to be trained. Official documentation

What this does is load the GraphDef, pull in the values for all the variables from the latest checkpoint file, and then replace each Variable op with a Const that has the numerical data for the weights stored in its attributes It then strips away all the extraneous nodes that aren't used for forward inference, and saves out the resulting GraphDef into an output file.

It seems that the model data size can be reduced by setting the parameter variable to Const and deleting the nodes unnecessary for execution (since the parameter variable is set to Const, will the access speed be improved a little? ).

Use freeze_graph.py to freeze. Go to the root directory of tensorflow and

bazel build tensorflow/python/tools:freeze_graph && \
bazel-bin/tensorflow/python/tools/freeze_graph \
--input_graph=/path/to/graph.pb \
--input_checkpoint=/path/to/model.ckpt \
--output_graph=/path/to/output/frozen_graph.pb --output_node_names=softmax

By hitting, a graph fixed to the path specified by ʻoutput_graph will be generated (it will take a long time because various compilations will be executed at the first time). If you don't specify ʻoutput_node_names, you will get angry, but this is in TensorFlow.

y = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2)+b_fc2, name="softmax")

You can set it by giving it a name like. However, in Keras

model.add(Activation('softmax', name='softmax'))

Even if you specify as, an error will be printed because it has an alias internally. in this case

[print(n.name) for n in sess.graph.as_graph_def().node]

Please print the name of node directly like this and check the name set internally (it was Softmax in your environment).

4.4. Run from C ++

We will write the cpp code that reads the model and executes it. For details, see github, and I will introduce the main parts by excerpting them. First, create a directory for this time under tensorflow_ROOT / tensorflow (loadgraph this time), and create a cc file in it (tensorflow_ROOT / tensorlow / loadgraph / mnist_tf.cc).

  GraphDef graph_def;
  status = ReadBinaryProto(Env::Default(), graph_file_name, &graph_def);
  if (!status.ok()) {
    cout << status.ToString() << "\n";
    return 1;
  }
  cout << "loaded graph" << "\n";
  // Add the graph to the session
  status = session->Create(graph_def);
  if (!status.ok()) {
    cout << status.ToString() << "\n";
    return 1;
  }

First, read the graph data and start session.

  Tensor x(DT_FLOAT, TensorShape({nTests, imageDim}));

  MNIST mnist = MNIST("./MNIST_data/");
  auto dst = x.flat<float>().data();
  for (int i = 0; i < nTests; i++) {
    auto img = mnist.testData.at(i).pixelData;
    std::copy_n(img.begin(), imageDim, dst);
    dst += imageDim;
  }

  const char* input_name = "input";
  vector<pair<string, Tensor>> inputs = {
    {input_name, x}
  };

Next, create an input tensor x and fill it with MNIST test data. Since 10000 768-dimensional float vectors are stored in mnist.testData, they are registered in x one by one. Then, create a pair of tensor and the name created on the python side. This name is

# TensorFlow
x = tf.placeholder(tf.float32, [None, imageDim], name="input")

# Keras
InputLayer(input_shape=input_shape, name='input')

It is necessary to take the correspondence with the name given on the python side given like. On the output side, create a Tensor vector in the same way, register the output name (softmax in this case), the output tensor, and the Input vector created earlier in session and run it.

  vector<Tensor> outputs;
  // Run the session, evaluating our "softmax" operation from the graph
  status = session->Run(inputs, {output_name}, {}, &outputs);
  if (!status.ok()) {
    cout << status.ToString() << "\n";
    return 1;
  }else{
  	cout << "Success run graph !! " << "\n";
  }

If the model runs successfully, the outputs should contain the output values.

  int nHits = 0;
  for (vector<Tensor>::iterator it = outputs.begin() ; it != outputs.end(); ++it) { //I'm turning the loop, but this time there is only one outputs, so item= outputs.front()Synonymous with
  	auto items = it->shaped<float, 2>({nTests, 10}); //Classification result of 10 numbers 10 dimensions x 10000 test data
	for(int i = 0 ; i < nTests ; i++){
	     int arg_max = 0;
      	     float val_max = items(i, 0);
      	     for (int j = 0; j < 10; j++) {
        	if (items(i, j) > val_max) {
          	    arg_max = j;
          	    val_max = items(i, j);
                }
	     } //Calculate the index of the maximum value of the 10-dimensional vector
	     if (arg_max == mnist.testData.at(i).label) {
        	 nHits++;
      	     } 
	}
  }
  float accuracy = (float)nHits/nTests;

The accuracy is calculated by comparing the teacher data with the execution result as shown in.

Finally, create a BUILD file (a file that describes dependencies, etc., like a make file) in the same hierarchy as the cpp file.

cc_binary(
    name = "mnistpredict_tf",
    srcs = ["mnist_tf.cc", "MNIST.h"],
    deps = [
        "//tensorflow/core:tensorflow",
    ],
)

cc_binary(
    name = "mnistpredict_keras",
    srcs = ["mnist_keras.cc", "MNIST.h"],
    deps = [
        "//tensorflow/core:tensorflow",
    ],
)

Build it.

bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mfma :mnistpredict_tf
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-msse4.2 --copt=-msse4.1 --copt=-msse3 --copt=-mfma :mnistpredict_keras

I have various options, but it is not essential.

The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
...

I was angry at various things, so I'm wearing it this time.

If the compilation is successful, an executable file named by BUILD will be created in tensorflow_ROOT / bazel-bin / tensorflow / loadgraph.

cd tensorflow_ROOT/bazel-bin/tensorflow/loadgraph
./mnistpredict_tf

Go to that level, bring the MNIST_TEST folder and fixed graph data, and execute (the contents of the MNIST_TEST folder need to be expanded).

Result comparison

Now, let's summarize all the results of the patterns introduced so far. This is the time (msec) required when the prediction process of "returning the determination result of which number is returned when one 768-dimensional (28x28) pixel data is input" is repeated 1000 times (5 times each). Is taking the average).

msec	Keras	Keras(K.function)	Keras(tf)	TensorFlow
Python	3787	3242	2711	2588
C++	578	-	577	576

In the first place, Python and C ++ have overwhelmingly different loop processing performance, so it can't be helped, but in a simple comparison, C ++ is still the overwhelming victory. The result is that plain TensorFlow is the fastest among Python implementations, followed by the model Keras-execution tf type. Plain Keras is considerably slower than them, but running from K.function shows a slight speed improvement.

Try the MNIST expert edition

I will also compare the speed in the expert edition that uses the Convolution layer. The model uses the following.

It's almost the same as the beginner's edition, so I'll omit it roughly, but there is one point to be aware of.

Note: Handling of learning_phase when there is a Dropout layer

There is no problem when using Keras alone, but there is a point to be careful when using K.function or using Keras and TensorFlow together with the handling of learning_phase. It is necessary to specify the learning_phase flag when the model used is different between training and test / execution, such as when there is a Dropout layer. The learning_phase flag specifies 1 during training and 0 during execution.

python side

It is necessary to specify K.learning_phase () for input, and 0 is input at the time of execution.

# K.When using function
pred = K.function([model.input, K.learning_phase()], [model.output])
[pred([np.array([x]), 0]) for x in X_test]

#When using Keras model from TensorFlow
[sess.run(y, feed_dict={x: np.array([test_x]), K.learning_phase(): 0}) for test_x in X_test]

c ++ side

A Bool type tensor is created, 0 is assigned, and it is registered in input with the name keras_learning_phase.

Tensor lp(DT_BOOL, TensorShape({}));
lp.flat<bool>().setZero();
...
vector<pair<string, Tensor>> inputs = {
    {input_name, x}, {"keras_learning_phase", lp}
};

Result comparison

The results will be compared in the same way as for beginners.

msec	Keras	Keras(K.function)	Keras(tf)	TensorFlow
Python	9693	9087	8571	8124
C++	5528	-	5530	5512

To be honest, I thought that there would be more performance difference between Python and C ++, but there was not much difference. Comparing the Python side, the order is almost the same as the beginner's edition.

Bonus. Speed up matrix vector calculation using OpenBLAS and MKL

Since it is not directly related to this theme, I added it as a bonus, but the calculation speed will increase depending on how to select BLAS (Basic Linear Algebra Subprograms), which defines the specifications of basic operations related to matrices and vectors.

-Reference BLAS: Reference implementation. slow. This is probably the default in most cases. -OpenBLAS: Fast open source implementation. -ATLAS: Auto-tuning open source implementation. -Intel MKL: Explosive implementation by Intel. Recently it became free.

For how to use OpenBLAS, I wrote an article before, so please refer to that if you like. For Intel MKL,

Linux: How to install mkl numpy OSX: Building mkl & numpy on mac

Etc., they have given us very easy-to-understand commentary articles, so I think you should take a look there. By the way, if you insert python from anaconda, numpy and scipy of MKL compilation will be included by default (this method is overwhelmingly easy). But,[ According to the article of Building mkl & numpy on mac, it is better to install from MKL than via anaconda. It seems that it is expensive, so I can not say which one is better.

For the story of how much the calculation specs will increase, see the articles mentioned above and

Comparing the speed of Python's singular value decomposition SVD

There is an article that is compared like this, so I think you should look there, but it seems that it is 20% to several times faster (30% more in the environment at hand).

in conclusion

It's been a little long, but I tried to see how much the execution speed changes with Keras, TensorFlow, and their c ++ execution. As a result, TensorFlow is the fastest if you do it only on the Python side, and if it is troublesome to create a model etc., the hybrid type that uses Keras seems to be good. If you use C ++ execution, the speed will not change so much no matter which one you take, so it will be faster than the Python implementation.

In the expert edition, Python and C ++ were nearly 1.8 times faster, but there was no difference than I expected. Originally, the model execution part of TensorFlow should be running a compilation of C ++ behind, so the difference that came out this time was not the difference in model execution speed, but the difference in loop processing performance etc. in other places. I think it's. So, I felt that there is no big difference in turning the model on the Python side, except when I really want the processing speed of C ++ in parts other than model execution, such as combining it with image processing. Next time, I will try SSD.

References

Tensorflow]Loading a tensorflow graph with the C++ API by using Mnist (http://jackytung8085.blogspot.jp/2016/06/loading-tensorflow-graph-with-c-api-by.html)

[Python] Hit Keras from TensorFlow and TensorFlow from c ++ to speed up execution

Introduction

Execution environment

Speeding up

Major premise

start

1. Run from backend using K.function

2. Implemented with TensorFlow

3. Hit the Keras model from TensorFlow to run it

4. Execute the model created by Keras and TensorFlow from C ++

4.1. Preparation

4.2. Export graph

4.3. Freeze the model

4.4. Run from C ++

Result comparison

Try the MNIST expert edition

Note: Handling of learning_phase when there is a Dropout layer

python side

c ++ side

Result comparison

Bonus. Speed up matrix vector calculation using OpenBLAS and MKL

in conclusion

References

1. Run from backend using `K.function`