Deep learning image recognition

Image recognition

Image recognition is a technology that detects "things" and "features" such as characters and faces that appear in images and videos.

More specifically, classification of images, estimation of the position of objects (upper figure), area division of images (lower figure), etc. There are various recognition technologies.

In 2012, a team at the University of Toronto announced a study on high-precision image recognition using deep learning. Interest in deep learning has increased, and now character recognition, face recognition, autonomous driving, domestic robots, etc. It is being put to practical use in various fields.

In this post, it's called CNN (Convolutional Neural Network). I will learn the technology.

CNN

CNN overview

What is CNN (Convolutional Neural Network)?

It is a neural network that extracts features using a layer called the "convolution layer" that has a structure similar to the visual cortex of the human brain.

Compared to fully connected layer-only neural networks learned in the deep learning basics course Demonstrates higher performance in fields such as image recognition.

The convolution layer is a layer that extracts features in the same way as the fully connected layer, but unlike the fully connected layer. Because it can process image data that remains two-dimensional and extract features. It is excellent for extracting 2D features such as lines and corners.

Also, on CNN, often with a convolution layer

A layer called the "pooling layer" is used.

The pooling layer reduces the information obtained from the convolution layer Finally, the images are classified.

From the next session, we will learn about each layer, build a CNN model as shown in Fig. 2, and actually classify the images.

Convolution layer

As shown in the figure, the convolution layer is a layer that focuses on a part of the input data and examines the characteristics of that part image.

What kind of features should be focused on is determined by appropriately defining training data and loss functions. It will be learned automatically.

For example, in the case of CNN that recognizes faces, if learning progresses properly In the convolution layer close to the input layer, it is a feature of low-dimensional concepts such as lines and points. Layers closer to the output layer will focus on higher dimensional conceptual features such as eyes and nose.

(Actually, higher-order concepts such as eyes and nose are not detected directly from the original input image. It is detected based on the positional combination of low-order concepts detected in the layer close to the input layer. )

A noteworthy feature is treated internally as a weight matrix called a filter (kernel). Use one filter for each feature.

In the figure below, for an image of 9 x 9 x 3 (vertical x horizontal x number of channels (3 channels of R, G, B)) It looks like convolution is being performed with a 3 x 3 x 3 (vertical x horizontal x number of channels) filter.

I am creating a new 4x4x1 feature map (like a black and white image) using one 3x3x3 filter. In addition, use several different filters to create a total of N 4x4x1 maps. Overall, this convolution layer transforms a 9x9x3 image into a 4x4xN feature map.

(Including the following problems in this session, 2D filters are often used as an example to explain the convolution layer, but in reality, 3D filters are often used as shown in the figure below. .)

Click here for a simple implementation example Let's implement it without using Keras + TensorFlow.

import numpy as np
import matplotlib.pyplot as plt
import urllib.request

#Defines a very simple convolution layer
class Conv:
    #For a simple example, W is fixed at 3x3, not strides or padding for later sessions.
    def __init__(self, W):
        self.W = W
    def f_prop(self, X):
        out = np.zeros((X.shape[0]-2, X.shape[1]-2))
        for i in range(out.shape[0]):
            for j in range(out.shape[1]):
                x = X[i:i+3, j:j+3]
                #I'm taking the sum of the products for each element
                out[i,j] = np.dot(self.W.flatten(), x.flatten())
        return out

X = np.load('./5100_cnn_data/circle.npy')

plt.imshow(X)
plt.title("base image", fontsize=12)
plt.show()

#Please set the kernel properly
W1 = np.array([[0,1,0],
               [0,1,0],
               [0,1,0]])

W2 = np.array([[0,0,0],
               [1,1,1],
               [0,0,0]])
W3 = np.array([[1,0,0],
               [0,1,0],
               [0,0,1]])
W4 = np.array([[0,0,1],
               [0,1,0],
               [1,0,0]])

plt.subplot(1,4,1); plt.imshow(W1)
plt.subplot(1,4,2); plt.imshow(W2)
plt.subplot(1,4,3); plt.imshow(W3)
plt.subplot(1,4,4); plt.imshow(W4)
plt.suptitle("kernels", fontsize=12)
plt.show()

#Convolution
conv1 = Conv(W1); C1 = conv1.f_prop(X)
conv2 = Conv(W2); C2 = conv2.f_prop(X)
conv3 = Conv(W3); C3 = conv3.f_prop(X)
conv4 = Conv(W4); C4 = conv4.f_prop(X)

plt.subplot(1,4,1); plt.imshow(C1)
plt.subplot(1,4,2); plt.imshow(C2)
plt.subplot(1,4,3); plt.imshow(C3)
plt.subplot(1,4,4); plt.imshow(C4)
plt.suptitle("convolution results", fontsize=12)
plt.show()

Pooling layer

As shown in the figure, the pooling layer can be said to be a layer that reduces the output of the convolution layer and reduces the amount of data.

As shown in the figure

Max pooling:Take the maximum value of the subsection of the feature map
Average pooling:Take the average of the special map
Data can be compressed by doing such things.

You can examine the distribution of features in the image by performing the convolution that was dealt with in the "Convolution Layer" session. The same features are often clustered and distributed in similar locations In addition, there are times when places where features cannot be found are widely distributed. The feature map output from the convolution layer is wasteful for the size of its data.

Pooling can reduce the waste of such data and compress the data while reducing the loss of information.

On the other hand, pooling deletes detailed location information. On the contrary, this means that the features extracted by the pooling layer are not affected by the translation of the original image.

It plays a role in giving robustness.

For example, when classifying handwritten numbers in a photo, the position of the numbers is not important, Pooling removes such less important information You can build a model that is resistant to changes in the position of the object to be detected with respect to the input image.

The figure below shows a 5x5 (vertical x horizontal) feature map being pooled every 3x3 (vertical x horizontal).

Max pooling

Average pooling

Click here for a simple implementation

import numpy as np
import matplotlib.pyplot as plt
import urllib.request

#It defines a very simple convolution layer.
class Conv:
    #For a simple example, W is fixed at 3x3, not strides or padding for later sessions.
    def __init__(self, W):
        self.W = W
    def f_prop(self, X):
        out = np.zeros((X.shape[0]-2, X.shape[1]-2))
        for i in range(out.shape[0]):
            for j in range(out.shape[1]):
                x = X[i:i+3, j:j+3]
                out[i,j] = np.dot(self.W.flatten(), x.flatten())
        return out

#It defines a very simple pooling layer.
class Pool:
    #For the sake of a simple example, we won't consider strides or padding for later sessions.
    def __init__(self, l):
        self.l = l
    def f_prop(self, X):
        l = self.l
        out = np.zeros((X.shape[0]//l, X.shape[1]//l))
        for i in range(out.shape[0]):
            for j in range(out.shape[1]):
                #Do Max pooling with the code below.
                out[i,j] = np.max(X[i*l:(i+1)*l, j*l:(j+1)*l])
        return out

X = np.load('./5100_cnn_data/circle.npy')

plt.imshow(X)
plt.title("base image", fontsize=12)
plt.show()

#kernel
W1 = np.array([[0,1,0],
               [0,1,0],
               [0,1,0]])
W2 = np.array([[0,0,0],
               [1,1,1],
               [0,0,0]])
W3 = np.array([[1,0,0],
               [0,1,0],
               [0,0,1]])
W4 = np.array([[0,0,1],
               [0,1,0],
               [1,0,0]])

#Convolution
conv1 = Conv(W1); C1 = conv1.f_prop(X)
conv2 = Conv(W2); C2 = conv2.f_prop(X)
conv3 = Conv(W3); C3 = conv3.f_prop(X)
conv4 = Conv(W4); C4 = conv4.f_prop(X)

plt.subplot(1,4,1); plt.imshow(C1)
plt.subplot(1,4,2); plt.imshow(C2)
plt.subplot(1,4,3); plt.imshow(C3)
plt.subplot(1,4,4); plt.imshow(C4)
plt.suptitle("convolution images", fontsize=12)
plt.show()

#Pooling
pool = Pool(2)
P1 = pool.f_prop(C1)
P2 = pool.f_prop(C2)
P3 = pool.f_prop(C3)
P4 = pool.f_prop(C4)

plt.subplot(1,4,1); plt.imshow(P1)
plt.subplot(1,4,2); plt.imshow(P2)
plt.subplot(1,4,3); plt.imshow(P3)
plt.subplot(1,4,4); plt.imshow(P4)
plt.suptitle("pooling results", fontsize=12)
plt.show()

The bottom figure is the result of max pooling.

CNN implementation

Implement CNN using Keras + TensorFlow.

In practice, these libraries are often used to implement the model. In Keras, first create an instance that manages the model, and use the add method to define the layers layer by layer.

Create an instance.

model = Sequential()

Add layers of the model layer by layer using the add method as shown below. The fully connected layer was defined as follows.

model.add(Dense(128))

Add the convolution layer as follows: You will learn the parameters in a later session.

model.add(Conv2D(filters=64, kernel_size=(3, 3)))

Add the pooling layer as follows. You will learn the parameters in a later session.

model.add(MaxPooling2D(pool_size=(2, 2)))

Finally, it compiles and finishes generating the neural network model.

model.compile(optimizer=sgd, loss="categorical_crossentropy", metrics=["accuracy"])

The following will output a table of model structures that looks like the problem.

model.summary()

Click here for a simple example

from keras.layers import Activation, Conv2D, Dense, Flatten, MaxPooling2D
from keras.models import Sequential, load_model
from keras.utils.np_utils import to_categorical

#Model definition
model = Sequential()

#Implementation example
# --------------------------------------------------------------
model.add(Conv2D(input_shape=(28, 28, 1), filters=32, kernel_size=(2, 2), strides=(1, 1), padding="same"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding="same"))
model.add(Conv2D(filters=32, kernel_size=(2, 2), strides=(1, 1), padding="same"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(1,1)))
# --------------------------------------------------------------
model.add(Flatten())
model.add(Dense(256))
model.add(Activation('sigmoid'))
model.add(Dense(128))
model.add(Activation('sigmoid'))
model.add(Dense(10))
model.add(Activation('softmax'))

model.summary()

Classification using CNN (MNIST)

MNIST is a data set of handwritten numbers as shown in the figure below. Each image is 28 pixels x 28 pixels in size and is 1 channel (monochrome) data. Each has a class label of 0-9.

We will use CNN to classify the MNIST dataset.

Click here for an implementation example

from keras.datasets import mnist
from keras.layers import Dense, Dropout, Flatten, Activation
from keras.layers import Conv2D, MaxPooling2D
from keras.models import Sequential, load_model
from keras.utils.np_utils import to_categorical
from keras.utils.vis_utils import plot_model
import numpy as np
import matplotlib.pyplot as plt

#Data loading
(X_train, y_train), (X_test, y_test) = mnist.load_data()

#This time, we will use 300 data for training and 100 data for testing.
#The Conv layer receives a 4D array. (Batch size x length x width x number of channels)
#Since the MNIST data is originally 3D data, not RGB images, it is converted to 4D in advance.
X_train = X_train[:300].reshape(-1, 28, 28, 1)
X_test = X_test[:100].reshape(-1, 28, 28, 1)
y_train = to_categorical(y_train)[:300]
y_test = to_categorical(y_test)[:100]

#Model definition
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=(3, 3),input_shape=(28,28,1)))
model.add(Activation('relu'))
model.add(Conv2D(filters=64, kernel_size=(3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10))
model.add(Activation('softmax'))


model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

model.fit(X_train, y_train,
          batch_size=128,
          epochs=1,
          verbose=1,
          validation_data=(X_test, y_test))

#Evaluation of accuracy
scores = model.evaluate(X_test, y_test, verbose=1)
print('Test loss:', scores[0])
print('Test accuracy:', scores[1])

#Data visualization (first 10 sheets of test data)
for i in range(10):
    plt.subplot(2, 5, i+1)
    plt.imshow(X_test[i].reshape((28,28)), 'gray')
plt.suptitle("10 images of test data",fontsize=20)
plt.show()

#Prediction (first 10 sheets of test data)
pred = np.argmax(model.predict(X_test[0:10]), axis=1)
print(pred)

model.summary()

Classification using CNN (CIFAR-10)

CIFAR-10 (Cipher Ten) is a data set of images showing 10 types of objects as shown in the picture below.

Each image is 32 pixels x 32 pixels in size and has 3 channels (R, G, B) of data. Each has a class label of 0-9. The objects corresponding to each class label are as follows.

0: Airplane 1: Car 2: Bird 3: Cat 4: Deer 5: Dog 6: Frog 7: Horse 8: Ship 9: Truck We will use CNN to classify the CIFAR-10 dataset.

import keras
from keras.datasets import cifar10
from keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D
from keras.models import Sequential, load_model
from keras.utils.np_utils import to_categorical
import numpy as np
import matplotlib.pyplot as plt

#Data loading
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

#This time, we will use 300 of all data for training and 100 for testing.
X_train = X_train[:300]
X_test = X_test[:100]
y_train = to_categorical(y_train)[:300]
y_test = to_categorical(y_test)[:100]


#Model definition
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same',
                 input_shape=X_train.shape[1:]))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10))
model.add(Activation('softmax'))


#compile
opt = keras.optimizers.rmsprop(lr=0.0001, decay=1e-6)
model.compile(loss='categorical_crossentropy',
              optimizer=opt,
              metrics=['accuracy'])

#It takes a few minutes to learn, so load the weights obtained by training in advance.
#model.load_weights('./cnn_data/param_cifar10.hdf5')

#Learning
model.fit(X_train, y_train, batch_size=32, epochs=1)

#Use the following to save weights. It cannot be executed here.
# model.save_weights('param_cifar10.hdf5')

#Evaluation of accuracy
scores = model.evaluate(X_test, y_test, verbose=1)
print('Test loss:', scores[0])
print('Test accuracy:', scores[1])

#Data visualization (first 10 sheets of test data)
for i in range(10):
    plt.subplot(2, 5, i+1)
    plt.imshow(X_test[i])
plt.suptitle("10 images of test data",fontsize=20)
plt.show()

#Prediction (first 10 sheets of test data)
pred = np.argmax(model.predict(X_test[0:10]), axis=1)
print(pred)

model.summary()

Hyperparameters

filters (Conv layer)

The filters parameter of the convolution layer is
Specifies the number of feature maps to generate, that is, the type of features to extract.

In the figure below Filters are 20 in the first convolution layer The filters will be 20 even in the second convolution layer.

If the filters are too small to extract the required features, you will not be able to proceed with learning well. On the other hand, if it is too large, it will be easy to overfit, so be careful.

Let's implement it without using Keras + TensorFlow.

import numpy as np
import matplotlib.pyplot as plt

#It defines a very simple convolution layer.
#Only 1-channel image convolution is assumed.
#For a simple example, the kernel is fixed at 3x3, not strides or padding.
class Conv:
    def __init__(self, filters):
        self.filters = filters
        self.W = np.random.rand(filters,3,3)
    def f_prop(self, X):
        out = np.zeros((filters, X.shape[0]-2, X.shape[1]-2))
        for k in range(self.filters):
            for i in range(out[0].shape[0]):
                for j in range(out[0].shape[1]):
                    x = X[i:i+3, j:j+3]
                    out[k, i, j] = np.dot(self.W[k].flatten(), x.flatten())
        return out

X = np.load('./5100_cnn_data/circle.npy')

filters=10

#Generation of convolutional layer
conv = Conv(filters=filters)

#Performing convolution
C = conv.f_prop(X)

# --------------------------------------------------------------
#Below is all the code for visualization.
# --------------------------------------------------------------

plt.imshow(X)
plt.title('base image', fontsize=12)
plt.show()

plt.figure(figsize=(5,2))
for i in range(filters):
    plt.subplot(2,filters/2,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(conv.W[i])
plt.suptitle('kernels', fontsize=12)
plt.show()

plt.figure(figsize=(5,2))
for i in range(filters):
    plt.subplot(2,filters/2,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(C[i])
plt.suptitle('convolution results', fontsize=12)
plt.show()

kernel_size (Conv layer)

The convolution layer kernel_size parameter is Specifies the size of the kernel (the weight matrix used for convolution).

As mentioned above, the feature map is determined by the set of products of the input data and the kernel. The figure below is a 3x3 kernel. Each element is given an arbitrary number for optimal convolution.

Also, in the figure below, kernel_size is 5x5 for the first convolution layer.

If kernel_size is too small, even very small features cannot be detected and learning cannot proceed well.

On the contrary, if it is too large, it should have been detected as a collection of small features. Even large features will be detected

Not taking advantage of the strength of neural network models, which are good at capturing hierarchical structures It will be an inefficient model.

Click here for an implementation example

import numpy as np
import matplotlib.pyplot as plt

#It defines a very simple convolution layer.
#Only 1-channel image convolution is assumed.
#I don't think about strides or padding, just to think of a simple example.
class Conv:
    def __init__(self, filters, kernel_size):
        self.filters = filters
        self.kernel_size = kernel_size
        self.W = np.random.rand(filters, kernel_size[0], kernel_size[1])
    def f_prop(self, X):
        k_h, k_w = self.kernel_size
        out = np.zeros((filters, X.shape[0]-k_h+1, X.shape[1]-k_w+1))
        for k in range(self.filters):
            for i in range(out[0].shape[0]):
                for j in range(out[0].shape[1]):
                    x = X[i:i+k_h, j:j+k_w]
                    out[k,i,j] = np.dot(self.W[k].flatten(), x.flatten())
        return out

X = np.load('./5100_cnn_data/circle.npy')

#Convolution 1
filters = 4
kernel_size = (3,3)

#Generation of convolutional layer
conv1 = Conv(filters=filters, kernel_size=kernel_size)

#Performing convolution
C1 = conv1.f_prop(X)

#Convolution 2
filters = 4
kernel_size = (6,6)

#Generation of convolutional layer
conv2 = Conv(filters=filters, kernel_size=kernel_size)

#Performing convolution
C2 = conv2.f_prop(X)

#Below is all the code for visualization

plt.imshow(X)
plt.title('base image', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(conv1.W[i])
plt.suptitle('kernel visualization', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(C1[i])
plt.suptitle('convolution results 1', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(conv2.W[i])
plt.suptitle('kernel visualization', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(C2[i])
plt.suptitle('convolution results 2', fontsize=12)
plt.show()

strides (Conv layer)

The strides parameter of the convolution layer is the interval at which features are extracted. In other words, specify the distance to move the kernel.

strides=(1,1)

strides=(2,2)

The smaller the strides, the finer the features can be extracted. For example, the same feature in the same place in the image is detected multiple times. It seems that there are a lot of useless calculations.

However, it is generally said that smaller strides are better. In Keras Conv2D layers, strides defaults to (1,1).

import numpy as np
import matplotlib.pyplot as plt

#It defines a very simple convolution layer.
#Only 1-channel image convolution is assumed.
#I don't think about padding because I think of a simple example.
class Conv:
    def __init__(self, filters, kernel_size, strides):
        self.filters = filters
        self.kernel_size = kernel_size
        self.strides = strides
        self.W = np.random.rand(filters, kernel_size[0], kernel_size[1])
    def f_prop(self, X):
        k_h = self.kernel_size[0]
        k_w = self.kernel_size[1]
        s_h = self.strides[0]
        s_w = self.strides[1]
        out = np.zeros((filters, (X.shape[0]-k_h)//s_h+1, (X.shape[1]-k_w)//s_w+1))
        for k in range(self.filters):
            for i in range(out[0].shape[0]):
                for j in range(out[0].shape[1]):
                    x = X[i*s_h:i*s_h+k_h, j*s_w:j*s_w+k_w]
                    out[k,i,j] = np.dot(self.W[k].flatten(), x.flatten())
        return out

X = np.load('./5100_cnn_data/circle.npy')

#Convolution 1
filters = 4
kernel_size = (3,3)
strides = (1,1)

#Generation of convolutional layer
conv1 = Conv(filters=filters, kernel_size=kernel_size, strides=strides)

#Performing convolution
C1 = conv1.f_prop(X)

#Convolution 2
filters = 4
kernel_size = (3,3)
strides = (2,2)

#Generation of convolutional layer
conv2 = Conv(filters=filters, kernel_size=kernel_size, strides=strides)
conv2.W = conv1.W #Unified kernel

#Performing convolution
C2 = conv2.f_prop(X)

#Below is all the code for visualization

plt.imshow(X)
plt.title('base image', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(conv1.W[i])
plt.suptitle('kernel visualization', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(C1[i])
plt.suptitle('convolution results 1', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(conv2.W[i])
plt.suptitle('kernel results', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(C2[i])
plt.suptitle('convolution results 2', fontsize=12)
plt.show()

padding (Conv layer)

Padding is to prevent the image from shrinking when folded. Adding pixels around the input image.

In general, set the pixel to be added to 0. Fill the area around this input image with 0 This is called zero padding.

Although padding also takes into account the characteristics of the edge data. In addition, the frequency of data updates will increase There are merits such as being able to adjust the number of I / O units in each layer.

The white frame around the orange panel in the figure below represents padding. This is a figure with 1 padding up and down and 1 padding left and right.

In Keras' Conv2D layer Specify the padding method such as padding = valid, padding = same.

padding=If valid, no padding is done
padding=If same, the output feature map should match the size of the input
The input is padded.

In the same code, the padding width is taken as an argument, such as padding = (1,1).

import numpy as np
import matplotlib.pyplot as plt
import urllib.request

#It defines a very simple convolution layer.
#Only 1-channel image convolution is assumed.
class Conv:
    def __init__(self, filters, kernel_size, strides, padding):
        self.filters = filters
        self.kernel_size = kernel_size
        self.strides = strides
        self.padding = padding
        self.W = np.random.rand(filters, kernel_size[0], kernel_size[1])
    def f_prop(self, X):
        k_h, k_w = self.kernel_size
        s_h, s_w = self.strides
        p_h, p_w = self.padding
        out = np.zeros((filters, (X.shape[0]+p_h*2-k_h)//s_h+1, (X.shape[1]+p_w*2-k_w)//s_w+1))
        #Padding
        X = np.pad(X, ((p_h, p_h), (p_w, p_w)), 'constant', constant_values=((0,0),(0,0)))
        self.X = X #Save it for later visualization of the padding results.
        for k in range(self.filters):
            for i in range(out[0].shape[0]):
                for j in range(out[0].shape[1]):
                    x = X[i*s_h:i*s_h+k_h, j*s_w:j*s_w+k_w]
                    out[k,i,j] = np.dot(self.W[k].flatten(), x.flatten())
        return out

X = np.load('./5100_cnn_data/circle.npy')

#Convolution 1
filters = 4
kernel_size = (3,3)
strides = (1,1)
padding = (0,0)

#Generation of convolutional layer
conv1 = Conv(filters=filters, kernel_size=kernel_size, strides=strides, padding=padding)

#Performing convolution
C1 = conv1.f_prop(X)

#Convolution 2
filters = 4
kernel_size = (3,3)
strides = (1,1)
padding = (2,2)

#Generation of convolutional layer
conv2 = Conv(filters=filters, kernel_size=kernel_size, strides=strides, padding=padding)
conv2.W = conv1.W #The weight is unified

#Performing convolution
C2 = conv2.f_prop(X)

#Below is all the code for visualization.

plt.imshow(conv1.X)
plt.title('padding results of the convolution 1', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(conv1.W[i])
plt.suptitle('kernel visualization of the convolution 1', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(C1[i])
plt.suptitle('results of the convolution 1', fontsize=12)
plt.show()

plt.imshow(conv2.X)
plt.title('padding results of the convolution 2', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(conv2.W[i])
plt.suptitle('kernel visualization of the convolution 2', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(C2[i])
plt.suptitle('results of the convolution 2', fontsize=12)
plt.show()

pool_size (Pool layer)

The pool_size parameter of the pooling layer is A parameter that specifies the size of the area to which pooling is applied at one time (pooling roughness).

In the figure below, the first pooling size is 2x2 and the second pooling size is also 2x2.

By increasing pool_size Increases robustness to position (The output does not change even if the position where the object appears in the image changes slightly) Basically, pool_size should be 2x2.

import numpy as np
import matplotlib.pyplot as plt

#It defines a very simple convolution layer.
class Conv:
    def __init__(self, W, filters, kernel_size):
        self.filters = filters
        self.kernel_size = kernel_size
        self.W = W # np.random.rand(filters, kernel_size[0], kernel_size[1])
    def f_prop(self, X):
        k_h, k_w = self.kernel_size
        out = np.zeros((filters, X.shape[0]-k_h+1, X.shape[1]-k_w+1))
        for k in range(self.filters):
            for i in range(out[0].shape[0]):
                for j in range(out[0].shape[1]):
                    x = X[i:i+k_h, j:j+k_w]
                    out[k,i,j] = np.dot(self.W[k].flatten(), x.flatten())
        return out

#It defines a very simple pooling layer.
#Only 1-channel feature map pooling is assumed.
class Pool:
    def __init__(self, pool_size):
        self.pool_size = pool_size
    def f_prop(self, X):
        k_h, k_w = self.pool_size
        out = np.zeros((X.shape[0]-k_h+1, X.shape[1]-k_w+1))
        for i in range(out.shape[0]):
            for j in range(out.shape[1]):
                out[i,j] = np.max(X[i:i+k_h, j:j+k_w])
        return out

X = np.load('./5100_cnn_data/circle.npy')

W = np.load('./5100_cnn_data/weight.npy') 

#Convolution
filters = 4
kernel_size = (3,3)
conv = Conv(W=W, filters=filters, kernel_size=kernel_size)
C = conv.f_prop(X)

#Pooling 1
pool_size = (2,2)
pool1 = Pool(pool_size)
P1 = [pool1.f_prop(C[i]) for i in range(len(C))]

#Pooling 2
pool_size = (4,4)
pool2 = Pool(pool_size)
P2 = [pool2.f_prop(C[i]) for i in range(len(C))]

#Below is all the code for visualization.

plt.imshow(X)
plt.title('base image', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(C[i])
plt.suptitle('convolution results', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(P1[i])
plt.suptitle('pooling results', fontsize=12)
plt.show()

plt.figure(figsize=(10,1))
for i in range(filters):
    plt.subplot(1,filters,i+1)
    ax = plt.gca() # get current axis
    ax.tick_params(labelbottom="off", labelleft="off", bottom="off", left="off") #Delete axis
    plt.imshow(P2[i])
plt.suptitle('pooling results', fontsize=12)
plt.show()

strides (Pool layer)

The stripes parameter of the pooling layer is Similar to the strides parameter of the convolution layer, it specifies the interval at which the feature map is pooled.