For those who want to know more about the ʻim2col` function that appears in image recognition using CNN We will thoroughly explain from the initial implementation to the improved version, batch channel compatible version, stride padding compatible version using gifs and images.
-[What is ʻim2col](What is # im2col) -[Why you need it](#Why you need it) -[What is CNN](What is #cnn) -[Filtering](#Filtering) -[ʻIm2col
behavior and initial implementation](# im2col behavior and initial implementation)
-[Operation of ʻim2col](Operation of # im2col) -[Initial implementation of ʻim2col
](Initial implementation of # im2col)
-[Problems with early ʻim2col](#Problems with early im2col) -[Improved version ʻim2col
(initial ver)](# improved version im2col initial ver)
-[Change 1](# Change 1)
-[Change 2](# Change 2)
-[Change 3](# Change 3)
-[Extension to multidimensional array](#Extension to multidimensional array)
-[Chase with formula](#Chase with formula)
-[Try to implement](# Try to implement)
-[Stride and padding](#Stride and padding)
-Stride
-Padding
-
ʻIm2col` is a function used in image recognition. The operation is to reversibly convert a multidimensional array to a two-dimensional array. The biggest advantage of this is that you can ** maximize the benefits of numpy for fast matrix operations **. It is no exaggeration to say that today's image recognition would not have developed without it (probably).
You think that images originally have a two-dimensional data structure, right?
It looks two-dimensional, but when you actually do machine learning, you often use images decomposed into RGB (this is called ** channel **).
In other words, a color image has a three-dimensional data structure.
In addition, although black-and-white images have one channel, multiple images are streamed in one propagation (this is called ** batch **), so it has a three-dimensional data structure.
In practice, it is inefficient to bother to implement only a black-and-white image in 3D, so the black-and-white image has a total of 4D data structure by aligning it with the color image with 1 channel.
If you use a double loop, you can process the images one by one, but that erases the advantage of numpy (numpy has the property that it is slow if you turn it with a for
loop etc.).
Therefore, we need a function called ʻim2col` that can maximize the advantages of numpy by making 4D data 2D.
CNN is an abbreviation for Convolutional Neural Network, which is used for data that is closely related to a certain coordinate point and the coordinate points around it. A simple example is an image or video. Before the advent of CNN, when learning data structures such as images using neural networks, 2D data was smoothed and treated as 1D data, ignoring the important correlations of 2D data. I did. CNN caused a breakthrough in image recognition by extracting features while maintaining the two-dimensional data structure of images. This technology is inspired by the processing performed when transmitting information from the retina to the optic nerve, making it possible to perform processing that is closer to human recognition.
The contents of CNN processing are mainly processing called filtering (convolution layer) and pooling (pooling layer).
Filtering is the process of detecting features such as vertical lines from image data.
This is similar to what human retinal cells do (some human retinal cells respond to specific patterns and emit electrical signals to convey information to the optic nerve).
Pooling is the process of extracting more characteristic features from the features extracted by filtering.
This is similar to what happens in the human optic nerve (the number of nerve cells is decreasing when information is transmitted from the optic nerve to the brain → the information is compressed).
From a data-reducing point of view, this is a very good process, which can save memory and reduce computational complexity while retaining good features.
ʻIm2col and
col2im`, which will be introduced in another article, will also play an active role in implementing pooling, but this time we will pay particular attention to filtering.
The gif above is an image of filtering.
To understand the implementation of ʻim2col`, we will thoroughly dissect its behavior using mathematical formulas, images and gifs.
The previous gif is mathematically
a = 1W + 2X + 5Y + 6Z \\
b = 2W + 3X + 6Y + 7Z \\
c = 3W + 4X + 7Y + 8Z \\
d = 5W + 6X + 9Y + 10Z \\
e = 6W + 7X + 10Y + 11Z \\
f = 7W + 8X + 11Y + 12Z \\
g = 9W + 10X + 13Y + 14Z \\
h = 10W + 11X + 14Y + 15Z \\
i = 11W + 12X + 15Y + 16Z
It looks like. ʻIm2col` transforms the image data nicely to achieve this by matrix multiplication operation. Also check the formula.
\begin{align}
\left(
\begin{array}{c}
a \\
b \\
c \\
d \\
e \\
f \\
g \\
h \\
i
\end{array}
\right)^{\top}
&=
\left(
\begin{array}{cccc}
W & X & Y & Z
\end{array}
\right)
\left(
\begin{array}{ccccccccc}
1 & 2 & 3 & 5 & 6 & 7 & 9 & 10 & 11 \\
2 & 3 & 4 & 6 & 7 & 8 & 10 & 11 & 12 \\
5 & 6 & 7 & 9 & 10 & 11 & 13 & 14 & 15 \\
6 & 7 & 8 & 10 & 11 & 12 & 14 & 15 & 16
\end{array}
\right) \\
&=
\left(
\begin{array}{c}
1W + 2X + 5Y + 6Z \\
2W + 3X + 6Y + 7Z \\
3W + 4X + 7Y + 8Z \\
5W + 6X + 9Y + 10Z \\
6W + 7X + 10Y + 11Z \\
7W + 8X + 11Y + 12Z \\
8W + 9X + 12Y + 13Z \\
10W + 11X + 14Y + 15Z \\
11W + 12X + 15Y + 16Z
\end{array}
\right)^{\top}
\end{align}
So, let's implement this honestly first.
Filtering the $ 4 \ times 4 $ matrix with the $ 2 \ times 2 $ matrix will output the $ 3 \ times 3 $ matrix. Let's generalize this.
Consider filtering the $ I_h \ times I_w $ matrix with $ F_h \ times F_w $.
At this time, the index at the upper left of the filter when the last filter is applied matches the size of the output matrix. That's because the number of filters and the size of the output matrix match.
From the image, the size of the output matrix can be calculated as $ (I_h --F_h + 1) \ times (I_w --F_w + 1) = O_h \ times O_w $.
In other words, $ O_h O_w $ elements are required, so the number of columns in ʻim2col is $ O_h O_w $. On the other hand, since the number of rows is proportional to the size of the filter, it becomes $ F_hF_w $, so when filtering the input matrix of $ I_h \ times I_w $ by $ F_h \ times F_w $, the output matrix of ʻim2col
is It becomes $ F_h F_w \ times O_h O_w $.
The above can be incorporated into the program as follows.
early_im2col.py
import time
import numpy as np
def im2col(image, F_h, F_w):
I_h, I_w = image.shape
O_h = I_h - F_h + 1
O_w = I_w - F_w + 1
col = np.empty((F_h*F_w, O_h*O_w))
for h in range(O_h):
for w in range(O_w):
col[:, w + h*O_w] = image[h : h+F_h, w : w+F_w].reshape(-1)
return col
x = np.arange(1, 17).reshape(4, 4)
f = np.arange(-4, 0).reshape(2, 2)
print(im2col(x, 2, 2))
print(im2col(f, 2, 2).T)
print(im2col(f, 2, 2).T @ im2col(x, 2, 2))
early_im2col.py
for h in range(O_h):
for w in range(O_w):
col[:, w + h*O_w] = image[h : h+F_h, w : w+F_w].reshape(-1)
The writing location to the output matrix corresponding to each h, w
is as follows.
It is the writing location specified by col [:, w + h * O_w]
. Here, the relevant part of the input matrix ʻimage [h: h + F_h, w: w + F_w]is smoothed with
.reshape (-1)` and substituted.
It's still easy.
Now, early_im2col.py has a serious drawback.
The drawback is that, as mentioned earlier, numpy is slow when accessed by loop processing such as for
.
In general, the input array x
introduced as an example of operation in early_im2dol.py is much larger (for example, ** very small dataset ** MNIST. The handwritten digit image in .com / exdb / mnist /) is a $ 28 \ times 28 $ matrix).
Let's measure the processing time.
early_im2col.py
y = np.zeros((28, 28))
start = time.time()
for i in range(1000):
im2col(y, 2, 2)
end = time.time()
print("time: {}".format(end - start))
It takes 1.5 seconds to process a matrix of $ 28 \ times 28 $ at most 1000 times.
Since the MNIST database is a database of 60,000 handwritten numbers, it takes ** 900 seconds ** to filter all images once by simple calculation.
This is not practical because in real machine learning, multiple filters are applied many times.
Reviewing the problem, we find that the problem is that the for
loop frequently accesses the numpy array. That means you can reduce the number of accesses.
In early_im2col.py, the numpy array ʻimage is accessed $ O_h O_w $ times, and the $ 28 \ times 28 $ input matrix is filtered by $ 2 \ times 2 $, and the number of accesses is actually $ 27 \ times. 27 = 729 $ times. By the way, filters are generally much smaller in size than output matrices, so you can use them to dramatically reduce the number of times you access a numpy array with equivalent processing. That is the improved version ʻim2col
(initial ver).
I'm doing something tricky.
improved_early_im2col.py
import time
import numpy as np
def im2col(image, F_h, F_w):
I_h, I_w = image.shape
O_h = I_h - F_h + 1
O_w = I_w - F_w + 1
col = np.empty((F_h, F_w, O_h, O_w))
for h in range(F_h):
for w in range(F_w):
col[h, w, :, :] = image[h : h+O_h, w : w+O_w]
return col.reshape(F_h*F_w, O_h*O_w)
x = np.arange(1, 17).reshape(4, 4)
f = np.arange(-4, 0).reshape(2, 2)
print(im2col(x, 2, 2))
print(im2col(f, 2, 2).T)
print(im2col(f, 2, 2).T @ im2col(x, 2, 2))
y = np.zeros((28, 28))
start = time.time()
for i in range(1000):
im2col(y, 2, 2)
end = time.time()
print("time: {}".format(end - start))
The first change is the memory allocation part of the output matrix.
improved_early_im2col.py
col = np.empty((F_h, F_w, O_h, O_w))
Memory is secured with a four-dimensional data structure like this.
The next change is that the number of loops has been changed from $ O_h O_w $ to $ F_h F_w $ to reduce the number of accesses.
improved_early_im2col.py
for h in range(F_h):
for w in range(F_w):
col[h, w, :, :] = image[h : h+O_h, w : w+O_w]
This will reduce the number of numpy array accesses per MNIST image from 729 to a whopping 4! In addition, the access location to the output array and the access location to the input array in each loop are as follows. When accessed in this way, the following output array will be created.
Finally, it is shaped into the desired shape at the time of output.
improved_early_im2col.py
return col.reshape(F_h*F_w, O_h*O_w)
In terms of the operation of numpy, the one-dimensional data of $ (F_h F_w O_h O_w,) $, which is smoothed $ (F_h, F_w, O_h, O_w) $, is transformed into the two-dimensional data of $ (F_h F_w, O_h O_w) $. I feel like I'm doing it. To put it in more detail, it feels like smoothing each two-dimensional data in the figure into one dimension and stacking it underneath. You think it's good ~
By the way, as mentioned in [What is ʻim2col`](What is # im2col), the matrix that is the target of this function originally has a four-dimensional data structure. The filter also has a four-dimensional data structure in which $ M $ of the set is prepared in addition to securing the number of channels of the input matrix. Taking this into account, we will modify improved_early_im2col.py.
First, let's think about what kind of shape needs to be transformed mathematically.
The structure of the color image is $ (B, C, I_h, I_w) $ when the number of channels is $ C $ and the batch size is $ B $.
On the other hand, the filter has a structure of $ (M, C, F_h, F_w) $.
In improved_early_im2col.py, when the $ (I_h, I_w) $ matrix is filtered by $ (F_h, F_w) $, the output matrix is $ (F_h F_w, O_h O_w) $ and $ (1, F_h F_w) $. You did.
Assuming $ B = 1 $ and $ M = 1 $, in order for filtering to be calculated by matrix multiplication, the rows and columns of the input data transformed by ʻim2col and the shape of the filter must match. Because they have to be $ (C F_h F_w, O_h O_w) $ and $ (1, C F_h F_w) $. Also, since they are generally $ B \ ne M $, these should be combined with those who have nothing to do with $ C F_h F_w $. Combining these facts, the shapes of the array that should be output by ʻim2col
are $ (C F_h F_w, B O_h O_w) $ and $ (M, C F_h F_w) $.
By the way, the result of the filtering calculation is $ (M, C F_h F_w) \ times (C F_h F_w, B O_h O_w) = (M, B O_h O_w) $, which is reshaped
and the dimensions are replaced. (B, M, O_h, O_w): = (B, C', I_h', I_w') $ propagates as input to the next layer.
The implementation content is almost the same as improved_early_im2col.py. I just added batch and channel dimensions to the top.
BC_support_im2col.py
import time
import numpy as np
def im2col(images, F_h, F_w):
B, C, I_h, I_w = images.shape
O_h = I_h - F_h + 1
O_w = I_w - F_w + 1
cols = np.empty((B, C, F_h, F_w, O_h, O_w))
for h in range(F_h):
for w in range(F_w):
cols[:, :, h, w, :, :] = images[:, :, h : h+O_h, w : w+O_w]
return cols.transpose(1, 2, 3, 0, 4, 5).reshape(C*F_h*F_w, B*O_h*O_w)
x = np.arange(1, 3*3*4*4+1).reshape(3, 3, 4, 4)
f = np.arange(-3*3*2*2, 0).reshape(3, 3, 2, 2)
print(im2col(x, 2, 2))
print(im2col(f, 2, 2).T)
print(np.dot(im2col(f, 2, 2).T, im2col(x, 2, 2)))
y = np.zeros((100, 3, 28, 28))
start = time.time()
for i in range(10):
im2col(y, 2, 2)
end = time.time()
print("time: {}".format(end - start))
BC_support_im2col.py
return cols.transpose(1, 2, 3, 0, 4, 5).reshape(C*F_h*F_w, B*O_h*O_w)
Here, the order of dimensions is changed using numpy's transpose
function.
Each corresponds as follows, and the correct output is returned by changing the order and then reshape
.
\begin{array}{ccccccc}
(&0, &1, &2, &3, &4, &5) \\
(&B, &C, &F_h, &F_w, &O_h, &O_w)
\end{array}
\xrightarrow[\textrm{transpose}]{Swap}
\begin{array}{ccccccc}
(&1, &2, &3, &0, &4, &5) \\
(&C, &F_h, &F_w, &B, &O_h, &O_w)
\end{array}
\xrightarrow[\textrm{reshape}]{Deformation}
(C F_h F_w, B O_h O_w)
This completes ʻim2col`, which also supports batch channels!
Well, I don't think this is the end. Lastly, I would like to introduce the processes called ** stride ** and ** padding **. Both are essential elements for a more efficient and effective implementation of CNN.
In the implementation so far, the filters have been shifted by one square as a matter of course, right? This amount of deviation is called ** stride **, but there is no rule that this must be one square at a time. In most cases, the stride will not be 1 because the actual image is less likely to have a large change in information with only a 1 pixel shift.
Unlike strides, ** padding ** has never been mentioned in previous implementations. Its main role is to ** keep the size of the output image unchanged by filtering ** and ** to get all the information towards the edges of the image **. Specifically, the range in which the filter moves is expanded by filling the circumference of the input image with $ 0 $.
Let's take a look at each implementation.
Stride implementation isn't that difficult. It only allows you to change the stride movement width so far from 1. until now
BC_support_im2col.py
cols[:, :, h, w, :, :] = images[:, :, h : h+O_h, w : w+O_w]
I used to do this
im2col.py
cols[:, :, h, w, :, :] = images[:, :, h : h + stride*O_h : stride, w : w + stride*O_w : stride]
Change as follows. The movement of the initial version is like this In the formula
a = 1W + 2X + 5Y + 6Z \\
b = 3W + 4X + 7Y + 8Z \\
c = 9W + 10X + 13Y + 14Z \\
d = 11W + 12X + 15Y + 16Z \\
\Leftrightarrow \left(
\begin{array}{c}
a \\
b \\
c \\
d
\end{array}
\right)^{\top}
=
\left(
\begin{array}{cccc}
W & X & Y & Z
\end{array}
\right)
\left(
\begin{array}{cccc}
1 & 3 & 9 & 11 \\
2 & 4 & 10 & 12 \\
5 & 7 & 13 & 15 \\
6 & 8 & 14 & 16
\end{array}
\right)
It looks like this, and the improved version looks like this. After all it's tricky ... It's too amazing to think about this.
On the other hand, the implementation of padding processing is very simple.
Using the pad
function in numpy
im2col.py
images = np.pad(images, [(0, 0), (0, 0), (pad, pad), (pad, pad)], "constant")
If so, it's OK.
The operation of the pad
function is quite complicated (I will introduce it later), so I will explain the above for the time being.
The first argument of pad
is the target array. This should be okay.
The problem is the second argument.
im2col.py
[(0, 0), (0, 0), (pad, pad), (pad, pad)]
If you enter this in the pad
function,
--The first dimension is (0, 0)
, that is, no padding
--The second dimension is (0, 0)
, that is, no padding
--The third dimension is (pad, pad)
, that is, the upper and lower increase width pad
is filled with 0 (" constant "
)
--The 4th dimension is (pad, pad)
, that is, 0 padding with the left and right increase width pad
(" constant "
)
There are some third arguments that can be specified, but this time I want to pad them with 0, so I specify " constant "
.
See the official documentation (https://numpy.org/devdocs/reference/generated/numpy.pad.html) for more information.
Well, even if I make the above changes and execute it, an error still appears and it does not work. Yes. The reason, as you might expect, is that the output dimension changes with the implementation of stride and padding. Let's think about how it will change.
Increasing the stride width will decrease the number of filters in inverse proportion. You can see that the number of times is halved depending on whether the filter is applied every 1 cell or every 2 cells. Expressed in a mathematical formula
O_h = \cfrac{I_h - F_h}{\textrm{stride}} + 1\\
O_w = \cfrac{I_w - F_w}{\textrm{stride}} + 1
It will be like that.
If $ I_h = 4, F_h = 2, \ textrm {stride} = 1 $
The effect of padding is very simple. Because the size of each input image is up and down $ + \ textrm {pad} \ _ {ud} $, left and right $ + \ textrm {pad} \ _ {lr} $
I_h \leftarrow I_h + 2\textrm{pad}_{ud} \\
I_w \leftarrow I_w + 2\textrm{pad}_{lr}
Can be replaced with, that is
O_h = \cfrac{I_h - F_h + 2\textrm{pad}_{ud}}{\textrm{stride}} + 1 \\
O_w = \cfrac{I_w - F_w + 2\textrm{pad}_{lr}}{\textrm{stride}} + 1
It will be. Conversely, if you want to match the size of the output image to the size of the input image, $ O_h = I_h $ and $ O_w = I_w $.
\textrm{pad}_{ud} = \cfrac{1}{2}\left\{(I_h - 1) \textrm{stride} - I_h + F_h\right\} \\
\textrm{pad}_{lr} = \cfrac{1}{2}\left\{(I_w - 1) \textrm{stride} - I_w + F_w\right\}
Can be calculated as By the way, let's increase the degree of freedom in stride.
O_h = \cfrac{I_h - F_h + 2\textrm{pad}_{ud}}{\textrm{stride}_{ud}} + 1 \\
O_w = \cfrac{I_w - F_w + 2\textrm{pad}_{lr}}{\textrm{stride}_{lr}} + 1 \\
\textrm{pad}_{ud} = \cfrac{1}{2}\left\{(I_h - 1) \textrm{stride}_{ud} - I_h + F_h\right\} \\
\textrm{pad}_{lr} = \cfrac{1}{2}\left\{(I_w - 1) \textrm{stride}_{lr} - I_w + F_w\right\}
The ʻim2col` with increased freedom by adding stride and padding is as follows. I will also make some customizations.
im2col.py
import numpy as np
def im2col(images, filters, stride=1, pad=0, get_out_size=True):
if images.ndim == 2:
images = images.reshape(1, 1, *images.shape)
elif images.ndim == 3:
B, I_h, I_w = images.shape
images = images.reshape(B, 1, I_h, I_w)
if filters.ndim == 2:
filters = filters.reshape(1, 1, *filters.shape)
elif images.ndim == 3:
M, F_h, F_w = filters.shape
filters = filters.reshape(M, 1, F_h, F_w)
B, C, I_h, I_w = images.shape
_, _, F_h, F_w = filters.shape
if isinstance(stride, tuple):
stride_ud, stride_lr = stride
else:
stride_ud = stride
stride_lr = stride
if isinstance(pad, tuple):
pad_ud, pad_lr = pad
elif isinstance(pad, int):
pad_ud = pad
pad_lr = pad
elif pad == "same":
pad_ud = 0.5*((I_h - 1)*stride_ud - I_h + F_h)
pad_lr = 0.5*((I_w - 1)*stride_lr - I_w + F_w)
pad_zero = (0, 0)
O_h = int(np.ceil((I_h - F_h + 2*pad_ud)/stride_ud) + 1)
O_w = int(np.ceil((I_w - F_w + 2*pad_lr)/stride_lr) + 1)
pad_ud = int(np.ceil(pad_ud))
pad_lr = int(np.ceil(pad_lr))
pad_ud = (pad_ud, pad_ud)
pad_lr = (pad_lr, pad_lr)
images = np.pad(images, [pad_zero, pad_zero, pad_ud, pad_lr], \
"constant")
cols = np.empty((B, C, F_h, F_w, O_h, O_w))
for h in range(F_h):
h_lim = h + stride_ud*O_h
for w in range(F_w):
w_lim = w + stride_lr*O_w
cols[:, :, h, w, :, :] \
= images[:, :, h:h_lim:stride_ud, w:w_lim:stride_lr]
if get_out_size:
return cols.transpose(1, 2, 3, 0, 4, 5).reshape(C*F_h*F_w, B*O_h*O_w), (O_h, O_w)
else:
return cols.transpose(1, 2, 3, 0, 4, 5).reshape(C*F_h*F_w, B*O_h*O_w)
I will explain briefly.
im2col.py
def im2col(images, filters, stride=1, pad=0, get_out_size=True):
if images.ndim == 2:
images = images.reshape(1, 1, *images.shape)
elif images.ndim == 3:
B, I_h, I_w = images.shape
images = images.reshape(B, 1, I_h, I_w)
if filters.ndim == 2:
filters = filters.reshape(1, 1, *filters.shape)
elif images.ndim == 3:
M, F_h, F_w = filters.shape
filters = filters.reshape(M, 1, F_h, F_w)
B, C, I_h, I_w = images.shape
_, _, F_h, F_w = filters.shape
if isinstance(stride, tuple):
stride_ud, stride_lr = stride
else:
stride_ud = stride
stride_lr = stride
if isinstance(pad, tuple):
pad_ud, pad_lr = pad
elif isinstance(pad, int):
pad_ud = pad
pad_lr = pad
elif pad == "same":
pad_ud = 0.5*((I_h - 1)*stride_ud - I_h + F_h)
pad_lr = 0.5*((I_w - 1)*stride_lr - I_w + F_w)
pad_zero = (0, 0)
--Changed to take the filter itself as an argument to reduce the number of arguments
--If the input image is not 4D, convert it to 4D
--Convert to 4D if the filter is not 4D
--Get batch size, number of channels, size of one input image
--The number of filters and the number of channels of the filter are unnecessary, so discard them (_, _, ...
part) and get the size of one filter.
--If stride
is tuple
, it is considered that the upper and lower and left and right stride widths are specified individually, otherwise the same value is used.
--If pad
is tuple
, it is considered that the top and bottom and left and right padding widths are specified individually, otherwise the same value is used.
--If pad ==" same "
is specified, the padding width that maintains the size of the input image is calculated with ** float
** (for later output size calculation).
I am doing the processing like that.
im2col.py
O_h = int(np.ceil((I_h - F_h + 2*pad_ud)/stride_ud) + 1)
O_w = int(np.ceil((I_w - F_w + 2*pad_lr)/stride_lr) + 1)
pad_ud = int(np.ceil(pad_ud))
pad_lr = int(np.ceil(pad_lr))
pad_ud = (pad_ud, pad_ud)
pad_lr = (pad_lr, pad_lr)
images = np.pad(images, [pad_zero, pad_zero, pad_ud, pad_lr], \
"constant")
cols = np.empty((B, C, F_h, F_w, O_h, O_w))
here
--Calculate the size of the output image --Change padding to tuples for better readability --Padding the input image --Allocate memory for output array
I am doing.
im2col.py
for h in range(F_h):
h_lim = h + stride_ud*O_h
for w in range(F_w):
w_lim = w + stride_lr*O_w
cols[:, :, h, w, :, :] \
= images[:, :, h:h_lim:stride_ud, w:w_lim:stride_lr]
if get_out_size:
return cols.transpose(1, 2, 3, 0, 4, 5).reshape(C*F_h*F_w, B*O_h*O_w), (O_h, O_w)
else:
return cols.transpose(1, 2, 3, 0, 4, 5).reshape(C*F_h*F_w, B*O_h*O_w)
Finally, about the processing body and the return value.
--For readability, new variables h_lim
and w_lim
are prepared to define the right and bottom edges of the filtering process.
--Get the value from the input image for each stride width and store it in the output array cols
--Swap dimensions, transform and return
Download the MNIST data from the Keras dataset and experiment.
mnist_test.py
#%pip install tensorflow
#%pip install keras
from keras.datasets import mnist
import matplotlib.pyplot as plt
#Specify the number of sheets to acquire
B = 3
#Data set acquisition
(x_train, _), (_, _) = mnist.load_data()
x_train = x_train[:B]
#Try to display
fig, ax = plt.subplots(1, B)
for i, x in enumerate(x_train):
ax[i].imshow(x, cmap="gray")
fig.tight_layout()
plt.savefig("mnist_data.png ")
plt.show()
#Try to detect vertical lines
M = 1
C = 1
F_h = 7
F_w = 7
_, I_h, I_w = x_train.shape
f = np.zeros((F_h, F_w))
f[:, int(F_w/2)] = 1
no_pad, (O_h, O_w) = im2col(x_train, f, stride=2, pad="same")
filters = im2col(f, f, get_out_size=False)
y = np.dot(filters.T, no_pad).reshape(M, B, O_h, O_w).transpose(1, 0, 2, 3).reshape(B, O_h, O_w)
fig2, ax2 = plt.subplots(1, B)
for i, x in enumerate(y):
ax2[i].imshow(x[F_h : I_h-F_h, F_w : I_w-F_w], cmap="gray")
fig2.tight_layout()
plt.savefig("vertical_filtering.png ")
plt.show()
#Try to detect horizontal lines
f = np.zeros((F_h, F_w))
f[int(F_h / 2), :] = 1
no_pad, (O_h, O_w) = im2col(x_train, f, stride=2, pad="same")
filters = im2col(f, f, get_out_size=False)
y = np.dot(filters.T, no_pad).reshape(M, B, O_h, O_w).transpose(1, 0, 2, 3).reshape(B, O_h, O_w)
fig3, ax3 = plt.subplots(1, B)
for i, x in enumerate(y):
ax3[i].imshow(x[F_h : I_h-F_h, F_w : I_w-F_w], cmap="gray")
fig3.tight_layout()
plt.savefig("horizontal_filtering.png ")
plt.show()
#Try to detect a downward slope
f = np.zeros((F_h, F_w))
for i in range(F_h):
f[i, i] = 1
no_pad, (O_h, O_w) = im2col(x_train, f, stride=2, pad="same")
filters = im2col(f, f, get_out_size=False)
y = np.dot(filters.T, no_pad).reshape(M, B, O_h, O_w).transpose(1, 0, 2, 3).reshape(B, O_h, O_w)
fig4, ax4 = plt.subplots(1, B)
for i, x in enumerate(y):
ax4[i].imshow(x[F_h : I_h-F_h, F_w : I_w-F_w], cmap="gray")
fig4.tight_layout()
plt.savefig("right_down_filtering.png ")
plt.show()
#Try to detect rising to the right
f = np.zeros((F_h, F_w))
for i in range(F_h):
f[F_h - i - 1, i] = 1
no_pad, (O_h, O_w) = im2col(x_train, f, stride=2, pad="same")
filters = im2col(f, f, get_out_size=False)
y = np.dot(filters.T, no_pad).reshape(M, B, O_h, O_w).transpose(1, 0, 2, 3).reshape(B, O_h, O_w)
fig4, ax4 = plt.subplots(1, B)
for i, x in enumerate(y):
ax4[i].imshow(x[F_h : I_h-F_h, F_w : I_w-F_w], cmap="gray")
fig4.tight_layout()
plt.savefig("right_up_filtering.png ")
plt.show()
This is the end of the explanation about ʻim2col`. If you have any bugs or smarter writing styles, I would appreciate it if you could let me know in the comments.
-MNIST: Image dataset of handwritten numbers
-Introduction to Deep Learning ~ Basics ~ -Introduction to Deep Learning ~ Code Preparation ~ -Introduction to Deep Learning ~ Forward Propagation ~ -List of activation functions (2020)