Grad-CAM and dilated convolution

Introduction

Grad-CAM is a good way to visualize where the model is looking to determine. However, as a personal complaint, Grad-CAM has a low resolution of (14,14) compared to the image size (224,224). The reason why the resolution of Grad-CAM is low like this is that the VGG16 model has a total of 4 poolings. However, except for the pooling layer, only short-distance features can be extracted and long-distance features cannot be extracted. I wondered if I could get a high resolution Grad-CAM by using dilated convolution, so I created an equivalent model of VGG16 that uses dilated convolution and experimented. As a result, the resolution of Grad-CAM increased, but it did not return to the original high resolution. gradcam.jpg gradcam.jpg Left: Normal Grad-CAM, Right: Grad-CAM using dilated convolution

What is dilated convolution?

As shown in the figure below, this is a method of convolving a toothless filter with a gap. If you increase dilation_rate, you can convolve over long distances with a small filter size without using pooling. If you use this, the image size will not be reduced because pooling is not used. image.png

model

I created a model written in Keras below. This model is named the dilated_VGG16 model for convenience. It can calculate long-distance convolutions with size (224,224) by adjusting dilation_rate. Therefore, the resolution before full combination has a resolution of (224,224) instead of (14,14). The name of the final convolution layer is'block5_conv3'for later Grad-CAM. Notice that the VGG16 model and the dilated_VGG16 model have the same number of parameters.

python


    inputs = Input(shape=(224,224,3))
    x = Conv2D( 64, (3, 3), padding='same', activation='relu', dilation_rate=1)(inputs)
    x = Conv2D( 64, (3, 3), padding='same', activation='relu', dilation_rate=1)(x)
    x = Conv2D(128, (3, 3), padding='same', activation='relu', dilation_rate=2)(x)
    x = Conv2D(128, (3, 3), padding='same', activation='relu', dilation_rate=2)(x)
    x = Conv2D(256, (3, 3), padding='same', activation='relu', dilation_rate=4)(x)
    x = Conv2D(256, (3, 3), padding='same', activation='relu', dilation_rate=4)(x)
    x = Conv2D(256, (3, 3), padding='same', activation='relu', dilation_rate=4)(x)
    x = Conv2D(512, (3, 3), padding='same', activation='relu', dilation_rate=8)(x)
    x = Conv2D(512, (3, 3), padding='same', activation='relu', dilation_rate=8)(x)
    x = Conv2D(512, (3, 3), padding='same', activation='relu', dilation_rate=8)(x)
    x = Conv2D(512, (3, 3), padding='same', activation='relu', dilation_rate=16)(x)
    x = Conv2D(512, (3, 3), padding='same', activation='relu', dilation_rate=16)(x)
    x = Conv2D(512, (3, 3), padding='same', activation='relu', dilation_rate=16, name='block5_conv3')(x)
    x = MaxPooling2D(pool_size=32)(x)
    x = Flatten()(x)
    x = Dense(4096, activation='relu')(x)
    x = Dense(4096, activation='relu')(x)
    y = Dense(1000, activation='softmax')(x)
    
    model = Model(inputs=inputs, outputs=y)

dilated_VGG16

python


Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 224, 224, 3)       0
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 224, 224, 64)      1792
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 224, 224, 64)      36928
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 224, 224, 128)     73856
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 224, 224, 128)     147584
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 224, 224, 256)     295168
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 224, 224, 256)     590080
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 224, 224, 256)     590080
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 224, 224, 512)     1180160
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 224, 224, 512)     2359808
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 224, 224, 512)     2359808
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 224, 224, 512)     2359808
_________________________________________________________________
conv2d_12 (Conv2D)           (None, 224, 224, 512)     2359808
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 224, 224, 512)     2359808
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 512)         0
_________________________________________________________________
flatten_1 (Flatten)          (None, 25088)             0
_________________________________________________________________
dense_1 (Dense)              (None, 4096)              102764544
_________________________________________________________________
dense_2 (Dense)              (None, 4096)              16781312
_________________________________________________________________
dense_3 (Dense)              (None, 1000)              4097000
=================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
_________________________________________________________________

Reference: VGG16

python


_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_2 (InputLayer)         (None, 224, 224, 3)       0
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0
_________________________________________________________________
block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808
_________________________________________________________________
block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0
_________________________________________________________________
flatten (Flatten)            (None, 25088)             0
_________________________________________________________________
fc1 (Dense)                  (None, 4096)              102764544
_________________________________________________________________
fc2 (Dense)                  (None, 4096)              16781312
_________________________________________________________________
predictions (Dense)          (None, 1000)              4097000
=================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
_________________________________________________________________

Diversion of VGG16 weight

The problem with the dilated_VGG16 model is that it does not use pooling, so the image size is large and it takes a very long time to learn. Since the convolution time of the deep layer takes ** 16 * 16 = 256 times ** of VGG16 from the image ratio, it seems that it is probably not realistic to train with this model. I wrote the following and diverted the weight of VGG16 to dilated_VGG16. This is possible because VGG16 and dilated_VGG16 have the same number of parameters.

python


    model1 = build_dilated_model()
    model2 = VGG16(include_top=True, weights='imagenet')

    model1.set_weights(model2.get_weights())

Classification accuracy

I made the usual image classification prediction with dilated_VGG16 using VGG16 weights. The classification accuracy was very degraded with dilated_VGG16, but it seems to be somewhat effective. cat_dog.png

Prediction with dilated_VGG16 using VGG16 weights

Model prediction:
        Saint_Bernard   (247)   with probability 0.029
        boxer           (242)   with probability 0.026
        whippet         (172)   with probability 0.020
        tiger_cat       (282)   with probability 0.019
        vacuum          (882)   with probability 0.017

Reference: Forecast with VGG16

Model prediction:
        boxer           (242)   with probability 0.420
        bull_mastiff    (243)   with probability 0.282
        tiger_cat       (282)   with probability 0.053
        tiger           (292)   with probability 0.050
        Great_Dane      (246)   with probability 0.050

Grad-CAM results

I was asked to write the Grad-CAM result for boxer prediction. A normal VGG16 Grad-CAM map has only (14,14) resolution, while dilated_VGG16 has (224,224) resolution. However, a grid pattern appeared and the resolution was not high. gradcam.jpg gradcam.jpg Left: Normal Grad-CAM, Right: Grad-CAM using dilated convolution

Summary

I wondered if I could get a high resolution Grad-CAM by doing dilated convolution, but it didn't work. When I searched, there seemed to be a solution for the grid pattern of dilated convolution, and the following paper was found. https://www.cs.princeton.edu/~funk/drn.pdf (I haven't read the contents ...) image.png image.png

Recommended Posts

Grad-CAM and dilated convolution