environment

tensorflow == 2.2.0 keras == 2.3.1 (Default version of Google Colab as of 202.6.10)

code

You can find all the code on github. https://github.com/milky1210/Segnet The code in the article is an excerpt, so if you want to actually run it, please download the code.

Summarize the content of SegNet's paper

Abstract

In the problem of inferring what is reflected for each pixel of the image called SEMANTIC segmentation by deep learning, it is accurate to restore the feature map that was lowered in resolution by Pooling etc. to the original dimension. We propose a model that maps to the boundary line. スクリーンショット 2020-06-09 13.29.43.png

Differences from other studies

SegNet performs UpSampling after reducing the resolution in the convolution layer and the pooling layer like a normal FCN, but when increasing the resolution, it uses a technique called pooling indice to prevent the boundary from becoming blurred. There is. スクリーンショット 2020-06-09 13.34.45.png Here, Encode and Decode inherit the shape of the VGG16 model (a model famous for image classification). Pooling indices スクリーンショット 2020-06-09 13.39.25.png As shown in this figure, remember where Max was when Max Pooling was performed, and transfer each feature map to that position during UpSampling.

Performance comparison using VOC12

What is VOC12

It is a data set that supports problems such as image recognition, image detection, and segmentation, which are also used in SegNet papers for performance verification. You can download it from here.

When downloaded, JPEGImages / and SegmentationObject / are included in VOCdevkit / VOC2012 /, and training and verification are performed using JPEGImage as an input image and SegmentationObject as an output image.

JPEGImages / ~ .jpg and Segmentation Object / ~ .png are supported in each directory. 22 classes are classified including background and boundaries.

Implementation

In this article, we will only cover the definition of the model, the definition of the loss function, and training. In addition, training and verification will be conducted at a resolution of 64x64.

Model definition

First, as a comparison target, SegNet (Encoder-decoder) without pooling indice is modeled as VGG16 as follows.

def build_FCN():
  ffc = 32
  inputs = layers.Input(shape=(64,64,3))
  for i in range(2):
    x = layers.Conv2D(ffc,kernel_size=3,padding="same")(inputs)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
  x = layers.MaxPooling2D((2,2))(x)
  for i in range(2):
    x = layers.Conv2D(ffc*2,kernel_size=3,padding="same")(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
  x = layers.MaxPooling2D((2,2))(x)
  for i in range(3):
    x = layers.Conv2D(ffc*4,kernel_size=3,padding="same")(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
  x = layers.MaxPooling2D((2,2))(x)
  for i in range(3):
    x = layers.Conv2D(ffc*8,kernel_size=3,padding="same")(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
  x = layers.MaxPooling2D((2,2))(x)
  for i in range(3):
    x = layers.Conv2D(ffc*8,kernel_size=3,padding="same")(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
  x = layers.UpSampling2D((2,2))(x)
  for i in range(3):
    x = layers.Conv2D(ffc*4,kernel_size=3,padding="same")(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
  x = layers.UpSampling2D((2,2))(x)
  for i in range(3):
    x = layers.Conv2D(ffc*2,kernel_size=3,padding="same")(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
  x = layers.UpSampling2D((2,2))(x)
  for i in range(2):
    x = layers.Conv2D(ffc*2,kernel_size=3,padding="same")(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
  x = layers.UpSampling2D((2,2))(x)
  for i in range(2):
    x = layers.Conv2D(ffc,kernel_size=3,padding="same")(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
  x = layers.Conv2D(22,kernel_size=3,padding="same",activation="softmax")(x)
  return models.Model(inputs,x)

When it is modeled after vgg16, it has such a structure, and it becomes a network with 24 convolution layers. Note that MaxPooling2D is used to make the image smaller and UpSampling2D is used to make the image larger. Next, let's look at the difference between Segnet and this model. First, Segnet holds the information corresponding to ArgMaxPooling2D in that layer as follows before performing MaxPooling2D. This function is not in Keras and uses tensorflow's. Therefore, it is necessary to create the original Keras Layer. If you define the function as below, it will be a layer that runs on Keras.

class MaxPoolingWithArgmax2D(Layer):
    def __init__(self):
        super(MaxPoolingWithArgmax2D,self).__init__()
    def call(self,inputs):
        output,argmax = tf.nn.max_pool_with_argmax(inputs,ksize=[1,2,2,1],strides=[1,2,2,1],padding='SAME')
        argmax = K.cast(argmax,K.floatx())
        return [output,argmax]
    def compute_output_shape(self,input_shape):
        ratio = (1,2,2,1)
        output_shape = [dim//ratio[idx] if dim is not None else None for idx, dim in enumerate(input_shape)]
        output_shape = tuple(output_shape)
        return [output_shape,output_shape]

Define a layer to return to the location where it was argmax the next time you perform Up Sampling (this is quite long)

class MaxUnpooling2D(Layer):
    def __init__(self):
        super(MaxUnpooling2D,self).__init__()
    def call(self,inputs,output_shape = None):
        updates, mask = inputs[0],inputs[1]
        with tf.variable_scope(self.name):
            mask = K.cast(mask, 'int32')
            input_shape = tf.shape(updates, out_type='int32')
            #  calculation new shape
            if output_shape is None:
                output_shape = (input_shape[0],input_shape[1]*2,input_shape[2]*2,input_shape[3])
            self.output_shape1 = output_shape
            # calculation indices for batch, height, width and feature maps
            one_like_mask = K.ones_like(mask, dtype='int32')
            batch_shape = K.concatenate([[input_shape[0]], [1 ], [1], [1]],axis=0)
            batch_range = K.reshape(tf.range(output_shape[0], dtype='int32'),shape=batch_shape)
            b = one_like_mask * batch_range
            y = mask // (output_shape[2] * output_shape[3])
            x = (mask // output_shape[3]) % output_shape[2]
            feature_range = tf.range(output_shape[3], dtype='int32')
            f = one_like_mask * feature_range

            # transpose indices & reshape update values to one dimension
            updates_size = tf.size(updates)
            indices = K.transpose(K.reshape(
                K.stack([b, y, x, f]),
                [4, updates_size]))
            values = K.reshape(updates, [updates_size])
            ret = tf.scatter_nd(indices, values, output_shape)
            return ret
    def compute_output_shape(self,input_shape):
        shape = input_shape[1]
        return (shape[0],shape[1]*2,shape[2]*2,shape[3])

If Segnet is defined using the layer defined by these, it will be as follows.

def build_Segnet():
    ffc = 32
    inputs = layers.Input(shape=(64,64,3))
    for i in range(2):
      x = layers.Conv2D(ffc,kernel_size=3,padding="same")(inputs)
      x = layers.BatchNormalization()(x)
      x = layers.ReLU()(x)
    x,x1 = MaxPoolingWithArgmax2D()(x)
    for i in range(2):
      x = layers.Conv2D(ffc*2,kernel_size=3,padding="same")(x)
      x = layers.BatchNormalization()(x)
      x = layers.ReLU()(x)
    x,x2 = MaxPoolingWithArgmax2D()(x)
    for i in range(3):
      x = layers.Conv2D(ffc*4,kernel_size=3,padding="same")(x)
      x = layers.BatchNormalization()(x)
      x = layers.ReLU()(x)
    x,x3 = MaxPoolingWithArgmax2D()(x)
    for i in range(3):
      x = layers.Conv2D(ffc*8,kernel_size=3,padding="same")(x)
      x = layers.BatchNormalization()(x)
      x = layers.ReLU()(x)
    x,x4 = MaxPoolingWithArgmax2D()(x)
    for i in range(3):
      x = layers.Conv2D(ffc*8,kernel_size=3,padding="same")(x)
      x = layers.BatchNormalization()(x)
      x = layers.ReLU()(x)
    x = layers.Dropout(rate = 0.5)(x)
    x = MaxUnpooling2D()([x,x4])
    for i in range(3):
      x = layers.Conv2D(ffc*4,kernel_size=3,padding="same")(x)
      x = layers.BatchNormalization()(x)
      x = layers.ReLU()(x)
    x = MaxUnpooling2D()([x,x3])
    for i in range(3):
      x = layers.Conv2D(ffc*2,kernel_size=3,padding="same")(x)
      x = layers.BatchNormalization()(x)
      x = layers.ReLU()(x)
    x = MaxUnpooling2D()([x,x2])
    for i in range(2):
      x = layers.Conv2D(ffc,kernel_size=3,padding="same")(x)
      x = layers.BatchNormalization()(x)
      x = layers.ReLU()(x)
    x = MaxUnpooling2D()([x,x1])
    for i in range(2):
      x = layers.Conv2D(ffc,kernel_size=3,padding="same")(x)
      x = layers.BatchNormalization()(x)
      x = layers.ReLU()(x)
    x = layers.Conv2D(22,kernel_size=3,padding="same",activation="softmax")(x)
    return models.Model(inputs,x)

Loss function and optimization

This time, the loss function uses the cross entropy of each pixel. In addition, Adam (lr = 0.001, beta_1 = 0.9, beta_2 = 0.999) was used for optimization.

result

We confirmed how much the result changes depending on the presence or absence of pooling index. The loss in the training and the average of the correct answer rate at each pixel were graphed. First, the result of the model without Pooling Indice acc (1).png