The theme of Chapter 7 is Convolutional neural network: ** CNN **
CNN is the same as the neural network we have seen so far, and it is possible to create a combination of layers like a Lego block. Next is new ・ "Convolution layer" ・ "Pooling layer"
General CNN features ・ Flow of "Convolution --ReLU-(Pooling)" ・ Pooling layer may be omitted ・ The combination "Affine --ReLU" is used in the layer close to the output. -The last output layer is a combination of "Affine --Softmax"
The following terms appeared ・ Putting ·stride
In addition, three-dimensional data has appeared
The problem with full joins is that ** the structure of the data is "ignored" **
For example, in the case of an image, it is usually a three-dimensional shape in the vertical, horizontal, and channel directions. This shape contains important spatial information For example ・ Pixels that are spatially close have similar values ・ There is a close relationship between each RBG channel ・ Pixels that are far apart are not so related to each other There is an essential pattern to be picked up in the three-dimensional shape.
The fully connected layer ignores the above shape and treats them as all equivalent neurons (neurons of the same dimension). On the other hand, the convolution layer maintains its shape
In CNN, the input / output data of the convolution layer is ** feature map ** (feature map). Input data ** Input feature map ** (feature map) Output data ** Output feature map ** (feature map) Sometimes I say.
"Convolution operation" Equivalent to "filter processing" in image processing In some literature, the term "filter" is sometimes referred to as "kernel".
The parameters used for this filter correspond to the "weights" in the fully coupled neural network.
Calculation example
Biased operation
Putting: Filling fixed data (eg 0) around the input data
The figure below is filled with 0 pixels wide
Stride: The distance between the positions where the filter is applied.
Input size (H, W) Filter size (FH, FW) Output size (OH, OW) Padding P Stride S The output size is as follows
OH = \frac{H + 2P - FH}{S} + 1\\
OW = \frac{W + 2P - FW}{S} + 1
Considering the operation of 3D convolution in a rectangular parallelepiped block in an easy-to-understand manner, it is as follows.
The above is a feature map with one output. In other words, it is a feature map with one channel.
The following is a diagram when having multiple channel methods.
Adding the bias term is as follows.
When batch processing N pieces of data, the shape of the data is the same
Pooling: Calculation to reduce the vertical and horizontal view of the sky
In the figure below, the space size is reduced by performing processing that aggregates the 2x2 area into one element.
In this example, it is the process when 2x2 Max pooling is performed on slide 2.
Max pooling: Calculation to take the maximum value in the area Also, in general, the pooling window size and slides are set to the same value.
In addition to Max pooling, there is Average pooling that takes the average value in the area.
・ There are no parameters to learn
Since pooling is a process that only takes the maximum value (or average value) from the target, there are no parameters to learn.
・ The number of channels does not change
The number of channels of input data and output data does not change due to the pooling operation. (OH and OW change, but FN does not)
・ Robust against minute changes in position
Pooling returns similar results for small deviations in the input data. Therefore, it is robust against slight deviations in input data.
#Randomly generate data
x = np.random.rand(10,1,28,28)
x.shape
# (10, 1, 28, 28)
x[0].shape
# (1, 28, 28)
x[1].shape
# (1, 28, 28)
x[0, 0].shape # x[0][0]But OK
# (28, 28)
If you implement the convolution as shown in the previous figure, you need to combine multiple for statements. Also, NumPy is slow when using a for statement.
Therefore, implement using a function called im2col instead of a for statement. im2col is a function that expands the input data to suit the filter.
In this figure, we emphasized the ease of understanding and gave an example in which the filter areas do not overlap.
・ Im2col Advantages Disadvantages Advantage: Since it can be reduced to matrix calculation, the library of linear algebra can be effectively used. Disadvantages: Consume more memory than usual
#----------------------------------------------------
# Parameters
# input_data : (The number of data,Channel,height,Width)Input data consisting of a 4-dimensional array of
# filter_h :Filter height
# filter_w :Filter width
# stride :stride
# pad :Padding
# Returns
# col :A two-dimensional array
#----------------------------------------------------
def im2col(input_data, filter_h, filter_w, stride=1, pad=0):
N, C, H, W = input_data.shape
out_h = (H + 2*pad - filter_h)//stride + 1
out_w = (W + 2*pad - filter_w)//stride + 1
img = np.pad(input_data, [(0,0), (0,0), (pad, pad), (pad, pad)], 'constant')
col = np.zeros((N, C, filter_h, filter_w, out_h, out_w))
for y in range(filter_h):
y_max = y + stride*out_h
for x in range(filter_w):
x_max = x + stride*out_w
col[:, :, y, x, :, :] = img[:, :, y:y_max:stride, x:x_max:stride]
col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
return col
View using im2col
import sys, os
sys.path.append(os.pardir)
from common.util import im2col
x1 = np.random.rand(1, 3, 7, 7)
col1 = im2col(x1, 5, 5, stride=1, pad=0)
print(col1.shape)
x2 = np.random.rand(10, 3, 7, 7)
col2 = im2col(x2, 5, 5, stride=1, pad=0)
print(col2.shape)
result (9, 75) (90, 75)
x1 is 7x7 data with batch size 1 and 3 channels x2 is 7x7 data with batch size 10 and 3 channels
In both cases, the number of elements in the second dimension is 75, which is the sum of the number of aspects of the filter. (Channel 3, size 5 x 5)
After expanding the data with im2col, all you have to do is expand the filter (weight) of the convolution layer into one column and calculate the inner product of the two matrices. This is almost the same as what we did in the Affine layer of the fully connected layer.
class Convolution:
def __init__(self, W, b, stride=1, pad=0):
self.W = W
self.b = b
self.stride = stride
self.pad = pad
#Intermediate data (used during backward)
self.x = None
self.col = None
self.col_W = None
#Gradient of weight / bias parameters
self.dW = None
self.db = None
def forward(self, x):
FN, C, FH, FW = self.W.shape
N, C, H, W = x.shape
out_h = 1 + int((H + 2*self.pad - FH) / self.stride)
out_w = 1 + int((W + 2*self.pad - FW) / self.stride)
col = im2col(x, FH, FW, self.stride, self.pad)
#to the reshape function-If 1 is specified, the number of elements will be summarized so that the tsuji of the multidimensional array matches.
col_W = self.W.reshape(FN, -1).T
out = np.dot(col, col_W) + self.b
#Finally, shape the output size to the appropriate shape
#reshape reconfigures the specified output size
#transpose swaps the order of the axes
out = out.reshape(N, out_h, out_w, -1).transpose(0, 3, 1, 2)
self.x = x
self.col = col
self.col_W = col_W
return out
def backward(self, dout):
FN, C, FH, FW = self.W.shape
dout = dout.transpose(0,2,3,1).reshape(-1, FN)
#The calculation of the inverse matrix itself is done in the following two rows, which is the same as in Affine, the only difference is the alignment of the dimensions of the matrix.
self.db = np.sum(dout, axis=0)
self.dW = np.dot(self.col.T, dout)
self.dW = self.dW.transpose(1, 0).reshape(FN, C, FH, FW)
dcol = np.dot(dout, self.col_W.T)
#Reverse processing of im2col
dx = col2im(dcol, self.x.shape, FH, FW, self.stride, self.pad)
return dx
As with the Convolution layer, use im2col to expand and implement the input data However, in the case of pooling, the difference is that they are independent in the channel direction.
class Pooling:
def __init__(self, pool_h, pool_w, stride=1, pad=0):
self.pool_h = pool_h
self.pool_w = pool_w
self.stride = stride
self.pad = pad
self.x = None
self.arg_max = None
def forward(self, x):
N, C, H, W = x.shape
out_h = int(1 + (H - self.pool_h) / self.stride)
out_w = int(1 + (W - self.pool_w) / self.stride)
col = im2col(x, self.pool_h, self.pool_w, self.stride, self.pad)
col = col.reshape(-1, self.pool_h*self.pool_w)
arg_max = np.argmax(col, axis=1)
out = np.max(col, axis=1)
out = out.reshape(N, out_h, out_w, C).transpose(0, 3, 1, 2)
self.x = x
self.arg_max = arg_max
return out
def backward(self, dout):
dout = dout.transpose(0, 2, 3, 1)
pool_size = self.pool_h * self.pool_w
dmax = np.zeros((dout.size, pool_size))
#flatten reinserts the structure into a one-dimensional array
dmax[np.arange(self.arg_max.size), self.arg_max.flatten()] = dout.flatten()
dmax = dmax.reshape(dout.shape + (pool_size,))
dcol = dmax.reshape(dmax.shape[0] * dmax.shape[1] * dmax.shape[2], -1)
dx = col2im(dcol, self.x.shape, self.pool_h, self.pool_w, self.stride, self.pad)
return dx
# coding: utf-8
import sys, os
sys.path.append(os.pardir) #Settings for importing files in the parent directory
import pickle
import numpy as np
from collections import OrderedDict
from common.layers import *
from common.gradient import numerical_gradient
#Simple ConvNet
# conv - relu - pool - affine - relu - affine - softmax
class SimpleConvNet:
#----------------------------------------------------
# Parameters
# input_size :Input size (784 for MNIST)
# hidden_size_list :List of numbers of neurons in the hidden layer (e.g. [100, 100, 100])
# output_size :Output size (10 for MNIST)
# activation : 'relu' or 'sigmoid'
# weight_init_std :Specify the standard deviation of the weights (e.g. 0.01)
# 'relu'Or'he'If is specified, "Initial value of He" is set.
# 'sigmoid'Or'xavier'If is specified, "Initial value of Xavier" is set.
#----------------------------------------------------
def __init__(self, input_dim=(1, 28, 28),
conv_param={'filter_num':30, 'filter_size':5, 'pad':0, 'stride':1},
hidden_size=100, output_size=10, weight_init_std=0.01):
#Initialization of weights, calculation of output size of convolution layer
filter_num = conv_param['filter_num']
filter_size = conv_param['filter_size']
filter_pad = conv_param['pad']
filter_stride = conv_param['stride']
input_size = input_dim[1]
conv_output_size = (input_size - filter_size + 2*filter_pad) / filter_stride + 1
pool_output_size = int(filter_num * (conv_output_size/2) * (conv_output_size/2))
#Weight initialization
self.params = {}
self.params['W1'] = weight_init_std * \
np.random.randn(filter_num, input_dim[0], filter_size, filter_size)
self.params['b1'] = np.zeros(filter_num)
self.params['W2'] = weight_init_std * \
np.random.randn(pool_output_size, hidden_size)
self.params['b2'] = np.zeros(hidden_size)
self.params['W3'] = weight_init_std * \
np.random.randn(hidden_size, output_size)
self.params['b3'] = np.zeros(output_size)
#Layer generation
self.layers = OrderedDict()
self.layers['Conv1'] = Convolution(self.params['W1'], self.params['b1'],
conv_param['stride'], conv_param['pad'])
self.layers['Relu1'] = Relu()
self.layers['Pool1'] = Pooling(pool_h=2, pool_w=2, stride=2)
self.layers['Affine1'] = Affine(self.params['W2'], self.params['b2'])
self.layers['Relu2'] = Relu()
self.layers['Affine2'] = Affine(self.params['W3'], self.params['b3'])
self.last_layer = SoftmaxWithLoss()
#Make inferences
def predict(self, x):
for layer in self.layers.values():
x = layer.forward(x)
return x
#Find the loss function
def loss(self, x, t):
"""Find the loss function
The argument x is the input data and t is the teacher label.
"""
y = self.predict(x)
return self.last_layer.forward(y, t)
def accuracy(self, x, t, batch_size=100):
if t.ndim != 1 : t = np.argmax(t, axis=1)
acc = 0.0
for i in range(int(x.shape[0] / batch_size)):
tx = x[i*batch_size:(i+1)*batch_size]
tt = t[i*batch_size:(i+1)*batch_size]
y = self.predict(tx)
y = np.argmax(y, axis=1)
acc += np.sum(y == tt)
return acc / x.shape[0]
def numerical_gradient(self, x, t):
"""Find the gradient (numerical differentiation)
Parameters
----------
x :Input data
t :Teacher label
Returns
-------
Dictionary variable with gradient for each layer
grads['W1']、grads['W2']、...Is the weight of each layer
grads['b1']、grads['b2']、...Is the bias of each layer
"""
loss_w = lambda w: self.loss(x, t)
grads = {}
for idx in (1, 2, 3):
grads['W' + str(idx)] = numerical_gradient(loss_w, self.params['W' + str(idx)])
grads['b' + str(idx)] = numerical_gradient(loss_w, self.params['b' + str(idx)])
return grads
def gradient(self, x, t):
"""Find the gradient (error backpropagation method)
Parameters
----------
x :Input data
t :Teacher label
Returns
-------
Dictionary variable with gradient for each layer
grads['W1']、grads['W2']、...Is the weight of each layer
grads['b1']、grads['b2']、...Is the bias of each layer
"""
# forward
self.loss(x, t)
# backward
dout = 1
dout = self.last_layer.backward(dout)
layers = list(self.layers.values())
layers.reverse()
for layer in layers:
dout = layer.backward(dout)
#Setting
grads = {}
grads['W1'], grads['b1'] = self.layers['Conv1'].dW, self.layers['Conv1'].db
grads['W2'], grads['b2'] = self.layers['Affine1'].dW, self.layers['Affine1'].db
grads['W3'], grads['b3'] = self.layers['Affine2'].dW, self.layers['Affine2'].db
return grads
def save_params(self, file_name="params.pkl"):
params = {}
for key, val in self.params.items():
params[key] = val
with open(file_name, 'wb') as f:
pickle.dump(params, f)
def load_params(self, file_name="params.pkl"):
with open(file_name, 'rb') as f:
params = pickle.load(f)
for key, val in params.items():
self.params[key] = val
for i, key in enumerate(['Conv1', 'Affine1', 'Affine2']):
self.layers[key].W = self.params['W' + str(i+1)]
self.layers[key].b = self.params['b' + str(i+1)]
The point is that it can be implemented simply by increasing the number of layers and increasing the values of hyperparameters used in the hidden layer.
Perform learning Also, my Macbook Air had a lot of CPU usage, so I uncommented the data reduction and ran it.
# coding: utf-8
import sys, os
sys.path.append(os.pardir) #Settings for importing files in the parent directory
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from simple_convnet import SimpleConvNet
from common.trainer import Trainer
#Data reading
(x_train, t_train), (x_test, t_test) = load_mnist(flatten=False)
#Reduce data if processing takes time
#x_train, t_train = x_train[:5000], t_train[:5000]
#x_test, t_test = x_test[:1000], t_test[:1000]
max_epochs = 20
network = SimpleConvNet(input_dim=(1,28,28),
conv_param = {'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
hidden_size=100, output_size=10, weight_init_std=0.01)
trainer = Trainer(network, x_train, t_train, x_test, t_test,
epochs=max_epochs, mini_batch_size=100,
optimizer='Adam', optimizer_param={'lr': 0.001},
evaluate_sample_num_per_epoch=1000)
trainer.train()
#Save parameters
network.save_params("params.pkl")
print("Saved Network Parameters!")
#Drawing a graph
markers = {'train': 'o', 'test': 's'}
x = np.arange(max_epochs)
plt.plot(x, trainer.train_acc_list, marker='o', label='train', markevery=2)
plt.plot(x, trainer.test_acc_list, marker='s', label='test', markevery=2)
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()
…… train loss:0.0145554384445 train loss:0.0275851756417 train loss:0.00785021651885 train loss:0.00986611950473 =============== Final Test Accuracy =============== test acc:0.956 Saved Network Parameters!
Before learning: There is no regularity in black and white shades because the filter is initialized randomly.
After learning: regular
What are these regular filters "looking at"? ・ ** Edge **: Border where color changes ・ ** Blob **: Locally lumpy area
1st layer convolution: Extracts low-level information such as edges and blobs Overlay convolution layers: extract more complex and abstract information
The following DEMO 1 was quoted http://vision03.csail.mit.edu/cnn_art/index.html#v_single
In the demo it was as follows. Cov1: Edge, blob (Edge + Blob) Cov3: Texture Cov5: Object Parts Fc8: Object Classes such as dogs and cats
Therefore, as the layer becomes deeper, neurons change from simple shapes to "advanced" information **. In other words, it is the day when the object that reacts changes so that you can understand the "meaning" of things.
This book explains the following ・ CNN, which was first proposed in 1998, is also the original LeNet ・ AlexNet in 2012, when deep learning attracted attention
LeNet
Compared to "current CNN", the following points are different -Use the sigmoid function as the activation function (Currently ReLU function) ・ The size of intermediate data is reduced by subsampling. (Currently Max Pooling) http://dx.doi.org/10.1109/5.726791
AlexNet
AlexNet stacks the convolution layer and the pooling layer, and finally outputs the result via the fully connected layer. The following points are different from LeNet -Use the ReLU function as the activation function -Use a layer that performs local normalization called LRN (Local Response Normalization) ・ Use Dropout
There is no big difference between LeNet and AlexNe in network configuration, but there have been major advances in computer technology. In particular ・ A large amount of data can now be obtained by anyone. -GPUs that specialize in large amounts of parallel computing have become widespread, making it possible to perform large amounts of calculations at high speed.
Recommended Posts