This article is the 18th day article of TensorFlow Advent Calendar 2016.

Originally, I was implementing ConvLSTM with the intention of implementing PredNet, but since I can predict the frame of the video by itself, I wrote this article with the intention of trying it. I made a presentation at the previous TensorFlow User Group event "NN Paper Drinking Party", so I'm curious about the outline of the original Paper. Please see this slide.

Convolutional LSTM (Convolutional LSTM)

I think it's easy to imagine what it looks like from the name. In the conventional LSTM, the time transition state was the 2nd floor tensor (batch size, number of units in the middle layer), but it is now the 4th floor tensor (batch size, length, width, number of channels). At that time, since the state to be handled is image information, the connection between layers was changed from total connection to convolution in the past. Conventional LSTM is ↓ 従来のLSTM Convolution LSTM is ↓ スクリーンショット 2016-12-13 18.34.41.png

It doesn't look very different, but all the multiplications in the matrix are convolutions. However, note that the part of the Hadamard product contributed by the peep hole remains the Hadamard product. (I always think, why is the peephole part not a matrix multiplication but a Hadamard product ...?)

Implementation in TensorFlow

Of course, convolution LSTM doesn't exist as a default feature, so you need to implement it. Let's implement ConvLSTMCell by inheriting tf.nn.rnn_cell.RNNCell. The source code is given at here. I also made a data set processing and DL script, but it gets messy, and the original data set is terribly large, so I process it as much as I use and put it in the repository as it is.

The reference code is here.

What I made this time

What I created this time is a prediction of driving scenery using KITTI dataset. Normally, we should predict how many frames in the future from the past few frames, but since the amount of code increases and it takes time to learn, we will build a network that predicts one frame in the future from the past four frames. スクリーンショット 2016-12-15 14.02.42.png

I tried various image sizes, but when I moved it with GFORCE GTX 1070, I felt that 128 x 128 was the limit. So I verified it with 64x64. In the paper, the LSTM layer was multi-layered, but it was troublesome to modify the wrapper, so I built it in a single layer. For some reason, the error function used cross entropy in the paper, but I felt uncomfortable, so I will use the absolute error.

About tf.nn.RNNCell

There are at least three methods that must be inherited in the original tf.nn.RNNCell: state_size, ʻoutput_size, and call. There is one remaining zero_state, but this time it creates all the initial values of the internal state with 0, so it is not necessary to implement it originally, but this time the shape of the internal state is the 4th floor tensor In order to take it, we need to change it. The role of each is ʻoutput_size, which is the number of units of output (not internal state). Due to the nature of the RNN, it matches the number of units in the middle layer that are in the internal state. If you want to project the output to reduce the amount of calculation, change it accordingly.

`rnn_cell.py`


    if num_proj:
      self._state_size = (
          LSTMStateTuple(num_units, num_proj)
          if state_is_tuple else num_units + num_proj)
      self._output_size = num_proj
    else:
      self._state_size = (
          LSTMStateTuple(num_units, num_units)
          if state_is_tuple else 2 * num_units)
      self._output_size = num_units

  @property
  def state_size(self):
    return self._state_size

  @property
  def output_size(self):
    return self._output_size

Next is state_size, which is the number of units in the internal state. In the case of a general RNN or GRU, it naturally matches the number of intermediate layers that are in the internal state, but in LSTM, both the internal state and the output affect the next state, so the above It doubles in size, like rnn_cell.py. zero_state returns the initial state padded with 0 to match this state_size. Finally, the __call__ of the object's function call, but the processing here is the part that describes the processing at each time step by actually multiplying the weight and input. TensorFlow's RNN-related operations have a small number of lines, so if you are interested, please read it.

tf.nn.ConvLSTMCell Now, the implementation of the main subject, ConvLSTMCell. There are two points.

Concat the output from the previous time and the new input, which is done in a typical LSTM, at the channel level of the image.
In order to make the internal state and input / output all the same size (excluding channels), 0 padding is performed when folding, and the stride is fixed at 1 pixel both vertically and horizontally.

The first point is that when creating an input gate or forgetting gate, you need to combine the input with the output of the previous time. (Processing in the lower part of the figure below) スクリーンショット 2016-12-13 19.42.57.png

At that time, in the conventional LSTM, the size may be different between the output and the input of the previous time, so it is not possible to simply add. Therefore, the tensors are combined in the input length and output length directions. Imagine PPAP. (Figure below)

However, since the input, output, and state of this time have a 4th-order tensor, they cannot be combined as they are. As a way to solve it, the second point is also covered, but the input and output are combined in the channel direction by unifying only the vertical and horizontal sizes of the image. The image is as shown in the figure below. スクリーンショット 2016-12-13 20.13.16.png

`rnn_cell.py`


    if len(args) == 1:
      res = math_ops.matmul(args[0], weights)
    else:
      res = math_ops.matmul(array_ops.concat(1, args), weights)

`conv_lstm_cell.py`


        #Be sure to padding because it will be a shared weight='SAME'Convolution with
        if len(args) == 1:
            res = tf.nn.conv2d(args[0],kernel, stride, padding='SAME')
        else:
            res = tf.nn.conv2d(array_ops.concat(3, args), kernel, stride, padding='SAME')

The branching by if minutes is only divided between the general rnn case and lstm case. As the difference in the nested part of else, concat is applied before convolution. In the case of the conventional method, it is connected in the rank 1 direction, but in conv_lstm, you can see that it is connected in the rank 3 (channel) direction.

The second point is the above-mentioned coupling problem, and above all, due to the characteristics of RNN and the characteristics of time propagation using shared weights, the tensor in the internal state must always have the same shape. There is. Therefore, the convolutional padding is of course SAME. Also, as a matter of course, the convolution padding only corrects for the filter size, so if you set stride to 1 or more, the image will become smaller. Therefore, always fix the stride size to [1,1,1,1]. Because of this, the calculation cost becomes very high, and learning will not proceed at all unless it is done with a somewhat small image.

Time development

Now that we have implemented the behavior of each time with Convlstmcell, we will expand this with the time of RNN. There are roughly two methods for expanding cells in time. One is to use a for statement while using reuse_variables (), and the other is to use tf.nn.rnn () or tf.nn.dynamic_rnn (). is. This time, I will use the TensorFlow function because it is a big deal. In that case, tf.nn.rnn () is used this time. Personally, I wanted to use dynamic_rnn (), which is not troublesome to create input data, but since the time axis is fixed to the second floor part of the tensor with the time_major option etc., I modify that part. Since it was annoying, I will use rnn (). Therefore, the input data will be a list of 4th floor tensors (batch size, horizontal, vertical, channel).

`train.py`


    #Input data(batch, width, height, channel)4th floor tensor time series list
    images = []
    for i in xrange(4):
        input_ph = tf.placeholder(tf.float32,[None, IMG_SIZE[0], IMG_SIZE[1], 3])
        tf.add_to_collection("input_ph", input_ph)
        images.append(input_ph)

    #Correct answer data(batch, width, height, channel)4th floor tensor
    y = tf.placeholder(tf.float32,[None, IMG_SIZE[0], IMG_SIZE[1], 3])

Mmm, I ended up doing feed_dict in a pretty clunky way, but wasn't there a better way?

`train.py`


            feed_dict = {}

            #Get the first frame of the image used for training for the batch size
            target = []
            for i in xrange(FLAGS.batch_size):
                target.append(random.randint(0,104))

            #Feed for placeholder of input image_Fill the dict
            for i in xrange(4):
                inputs = []
                for j in target:
                    file = FLAGS.data_dir+str(i+j)+'.png'
                    img = cv2.imread(file)/255.0
                    inputs.append(img)

                feed_dict[tf.get_collection("input_ph")[i]] = inputs

However, for the time being, the model construction of the time expansion part can be written very simply.

`train.py`


    cell = conv_lstm_cell.ConvLSTMCell(FLAGS.conv_channel, img_size=IMG_SIZE, kernel_size=KERNEL_SIZE,
        stride= STRIDE, use_peepholes=FLAGS.use_peepholes, cell_clip=FLAGS.cell_clip, initializer=initializer,
        forget_bias=FLAGS.forget_bias, state_is_tuple=False, activation=activation)

    outputs, state = tf.nn.rnn(cell=cell, inputs=images, dtype=tf.float32)

Image generation

An image is generated based on the output of the last time of the convolution LSTM. The necessary information is the last tensor of the list ʻoutputs returned by tf.nn.rnn () , so get it with ʻoutputs [-1] and that (batch size, width, height, number of channels) Image is generated by folding the 4th floor tensor of. As I mentioned in the point of convolution LSTM, all the image data that appear in the network are the same size. Using it, the image of the expected frame is output by convolving with 1x1.

`train.py`


    #Get output at last time
    last_output=outputs[-1]

    #Fold the result in 1x1 and process it to the same size as the original image
    kernel = tf.Variable(tf.truncated_normal([1,1 ,FLAGS.conv_channel, 3],stddev=0.1))
    result = tf.nn.conv2d(last_output, kernel,[1,1,1,1], padding='SAME')
    result = tf.nn.sigmoid(result)

Since the pixel value of the output image must be 0 to 255, the firing function is a sigmoid function and the result is multiplied by 255. You can output the image safely.

result

Well, I don't log with TensorBoard or generate a checkpoint file, but paste the result of dripping appropriately.

Certainly, I'm learning, and at the end, the white line on the road seems to be surprisingly good, but I think that the parameters are appropriate, so it's like this. In the latter half, the average absolute error was about 0.1. By the way, it took a lot of time to learn by increasing the image size, but the absolute error became even smaller and it became quite clear.

Consideration and future

I haven't done any parameter tuning, so the result is hmm. Occasionally, there were cases where it worked quite well, but the trees on the road were gone and it was still full of things.

Since we only implemented the convolution LSTM to a minimum, if you want to try various things, you need to play with the cell wrapper, tf.nn method, and seq2seq. If I use it for work, I think I'll implement it and tune it. Anyway, the best result was that I could read the code around the RNN firmly.

Have a nice year, everyone.

Video frame prediction using convolution LSTM with TensorFlow

Convolutional LSTM (Convolutional LSTM)

Implementation in TensorFlow

What I made this time

About tf.nn.RNNCell

rnn_cell.py

rnn_cell.py

conv_lstm_cell.py

Time development

train.py

train.py

train.py

Image generation

train.py

result

Consideration and future

`rnn_cell.py`

`rnn_cell.py`

`conv_lstm_cell.py`

`train.py`

`train.py`

`train.py`

`train.py`