Im2col thorough understanding

Target person

For those who want to know more about the ʻim2col` function that appears in image recognition using CNN We will thoroughly explain from the initial implementation to the improved version, batch channel compatible version, stride padding compatible version using gifs and images.

table of contents

-[What is ʻim2col](What is # im2col) -[Why you need it](#Why you need it) -[What is CNN](What is #cnn) -[Filtering](#Filtering) -[ʻIm2col behavior and initial implementation](# im2col behavior and initial implementation) -[Operation of ʻim2col](Operation of # im2col) -[Initial implementation of ʻim2col](Initial implementation of # im2col) -[Problems with early ʻim2col](#Problems with early im2col) -[Improved version ʻim2col (initial ver)](# improved version im2col initial ver) -[Change 1](# Change 1) -[Change 2](# Change 2) -[Change 3](# Change 3) -[Extension to multidimensional array](#Extension to multidimensional array) -[Chase with formula](#Chase with formula) -[Try to implement](# Try to implement) -[Stride and padding](#Stride and padding) -Stride -Padding -

[Implementation of stride and padding](#Implementation of stride and padding) -[Implementation of stride](#Implementation of stride) -[Implementation of padding](#Implementation of padding) -[Calculation of output dimension](#Calculation of output dimension) -[Completed version ʻim2col`](#Completed version im2col) -[Experiment with MNIST](Experiment with #mnist) -[Conclusion](#Conclusion)

What is ʻim2col`?

ʻIm2col` is a function used in image recognition. The operation is to reversibly convert a multidimensional array to a two-dimensional array. The biggest advantage of this is that you can ** maximize the benefits of numpy for fast matrix operations **. It is no exaggeration to say that today's image recognition would not have developed without it (probably).

Why you need it

You think that images originally have a two-dimensional data structure, right? It looks two-dimensional, but when you actually do machine learning, you often use images decomposed into RGB (this is called ** channel **). In other words, a color image has a three-dimensional data structure. color_image.png In addition, although black-and-white images have one channel, multiple images are streamed in one propagation (this is called ** batch **), so it has a three-dimensional data structure. In practice, it is inefficient to bother to implement only a black-and-white image in 3D, so the black-and-white image has a total of 4D data structure by aligning it with the color image with 1 channel. 4dim_image.png If you use a double loop, you can process the images one by one, but that erases the advantage of numpy (numpy has the property that it is slow if you turn it with a for loop etc.). Therefore, we need a function called ʻim2col` that can maximize the advantages of numpy by making 4D data 2D.

What is CNN

CNN is an abbreviation for Convolutional Neural Network, which is used for data that is closely related to a certain coordinate point and the coordinate points around it. A simple example is an image or video. Before the advent of CNN, when learning data structures such as images using neural networks, 2D data was smoothed and treated as 1D data, ignoring the important correlations of 2D data. I did. CNN caused a breakthrough in image recognition by extracting features while maintaining the two-dimensional data structure of images. This technology is inspired by the processing performed when transmitting information from the retina to the optic nerve, making it possible to perform processing that is closer to human recognition.

filtering

The contents of CNN processing are mainly processing called filtering (convolution layer) and pooling (pooling layer). Filtering is the process of detecting features such as vertical lines from image data. This is similar to what human retinal cells do (some human retinal cells respond to specific patterns and emit electrical signals to convey information to the optic nerve). Pooling is the process of extracting more characteristic features from the features extracted by filtering. This is similar to what happens in the human optic nerve (the number of nerve cells is decreasing when information is transmitted from the optic nerve to the brain → the information is compressed). From a data-reducing point of view, this is a very good process, which can save memory and reduce computational complexity while retaining good features. ʻIm2col and col2im`, which will be introduced in another article, will also play an active role in implementing pooling, but this time we will pay particular attention to filtering. filter_image.gif The gif above is an image of filtering.

ʻIm2col` behavior and initial implementation

To understand the implementation of ʻim2col`, we will thoroughly dissect its behavior using mathematical formulas, images and gifs.

ʻIm2col` behavior

The previous gif is mathematically

a = 1W + 2X + 5Y + 6Z \\
b = 2W + 3X + 6Y + 7Z \\
c = 3W + 4X + 7Y + 8Z \\
d = 5W + 6X + 9Y + 10Z \\
e = 6W + 7X + 10Y + 11Z \\
f = 7W + 8X + 11Y + 12Z \\
g = 9W + 10X + 13Y + 14Z \\
h = 10W + 11X + 14Y + 15Z \\
i = 11W + 12X + 15Y + 16Z

It looks like. ʻIm2col` transforms the image data nicely to achieve this by matrix multiplication operation. im2col_image.gif Also check the formula.

\begin{align}
  \left(
    \begin{array}{c}
      a \\
      b \\
      c \\
      d \\
      e \\
      f \\
      g \\
      h \\
      i
    \end{array}
  \right)^{\top}
  &=
  \left(
    \begin{array}{cccc}
      W & X & Y & Z
    \end{array}
  \right)
  \left(
    \begin{array}{ccccccccc}
      1 & 2 & 3 & 5 & 6 & 7 & 9 & 10 & 11 \\
      2 & 3 & 4 & 6 & 7 & 8 & 10 & 11 & 12 \\
      5 & 6 & 7 & 9 & 10 & 11 & 13 & 14 & 15 \\
      6 & 7 & 8 & 10 & 11 & 12 & 14 & 15 & 16
    \end{array}
  \right) \\
  &=
  \left(
    \begin{array}{c}
      1W + 2X + 5Y + 6Z \\
      2W + 3X + 6Y + 7Z \\
      3W + 4X + 7Y + 8Z \\
      5W + 6X + 9Y + 10Z \\
      6W + 7X + 10Y + 11Z \\
      7W + 8X + 11Y + 12Z \\
      8W + 9X + 12Y + 13Z \\
      10W + 11X + 14Y + 15Z \\
      11W + 12X + 15Y + 16Z
    \end{array}
  \right)^{\top}
\end{align}

Early implementation of ʻim2col`

So, let's implement this honestly first. Filtering the $ 4 \ times 4 $ matrix with the $ 2 \ times 2 $ matrix will output the $ 3 \ times 3 $ matrix. Let's generalize this. Consider filtering the $ I_h \ times I_w $ matrix with $ F_h \ times F_w $. At this time, the index at the upper left of the filter when the last filter is applied matches the size of the output matrix. That's because the number of filters and the size of the output matrix match. cal_output_size.png From the image, the size of the output matrix can be calculated as $ (I_h --F_h + 1) \ times (I_w --F_w + 1) = O_h \ times O_w $. In other words, $ O_h O_w $ elements are required, so the number of columns in ʻim2col is $ O_h O_w $. On the other hand, since the number of rows is proportional to the size of the filter, it becomes $ F_hF_w $, so when filtering the input matrix of $ I_h \ times I_w $ by $ F_h \ times F_w $, the output matrix of ʻim2col is It becomes $ F_h F_w \ times O_h O_w $. The above can be incorporated into the program as follows.

Early im2col

early_im2col.py


import time
import numpy as np


def im2col(image, F_h, F_w):
    I_h, I_w = image.shape
    O_h = I_h - F_h + 1
    O_w = I_w - F_w + 1
    col = np.empty((F_h*F_w, O_h*O_w))
    for h in range(O_h):
        for w in range(O_w):
            col[:, w + h*O_w] = image[h : h+F_h, w : w+F_w].reshape(-1)
    return col

x = np.arange(1, 17).reshape(4, 4)
f = np.arange(-4, 0).reshape(2, 2)
print(im2col(x, 2, 2))
print(im2col(f, 2, 2).T)
print(im2col(f, 2, 2).T @ im2col(x, 2, 2))
early_im2col_ex.png The calculation for the size of the matrix is as above. Below, we will look at the implementation where it is actually deformed.

early_im2col.py


for h in range(O_h):
    for w in range(O_w):
        col[:, w + h*O_w] = image[h : h+F_h, w : w+F_w].reshape(-1)

The writing location to the output matrix corresponding to each h, w is as follows. early_image.png It is the writing location specified by col [:, w + h * O_w]. Here, the relevant part of the input matrix ʻimage [h: h + F_h, w: w + F_w]is smoothed with.reshape (-1)` and substituted. It's still easy.

Problems with early ʻim2col`

Now, early_im2col.py has a serious drawback. The drawback is that, as mentioned earlier, numpy is slow when accessed by loop processing such as for. In general, the input array x introduced as an example of operation in early_im2dol.py is much larger (for example, ** very small dataset ** MNIST. The handwritten digit image in .com / exdb / mnist /) is a $ 28 \ times 28 $ matrix). Let's measure the processing time.

early_im2col.py


y = np.zeros((28, 28))
start = time.time()
for i in range(1000):
    im2col(y, 2, 2)
end = time.time()
print("time: {}".format(end - start))
early_im2col_time.png It takes 1.5 seconds to process a matrix of $ 28 \ times 28 $ at most 1000 times. Since the MNIST database is a database of 60,000 handwritten numbers, it takes ** 900 seconds ** to filter all images once by simple calculation. This is not practical because in real machine learning, multiple filters are applied many times.

Improved version ʻim2col` (initial ver)

Reviewing the problem, we find that the problem is that the for loop frequently accesses the numpy array. That means you can reduce the number of accesses. In early_im2col.py, the numpy array ʻimage is accessed $ O_h O_w $ times, and the $ 28 \ times 28 $ input matrix is filtered by $ 2 \ times 2 $, and the number of accesses is actually $ 27 \ times. 27 = 729 $ times. By the way, filters are generally much smaller in size than output matrices, so you can use them to dramatically reduce the number of times you access a numpy array with equivalent processing. That is the improved version ʻim2col (initial ver). I'm doing something tricky.

Improved version ʻim2col` (initial ver)

improved_early_im2col.py


import time
import numpy as np


def im2col(image, F_h, F_w):
    I_h, I_w = image.shape
    O_h = I_h - F_h + 1
    O_w = I_w - F_w + 1
    col = np.empty((F_h, F_w, O_h, O_w))
    for h in range(F_h):
        for w in range(F_w):
            col[h, w, :, :] = image[h : h+O_h, w : w+O_w]
    return col.reshape(F_h*F_w, O_h*O_w)

x = np.arange(1, 17).reshape(4, 4)
f = np.arange(-4, 0).reshape(2, 2)
print(im2col(x, 2, 2))
print(im2col(f, 2, 2).T)
print(im2col(f, 2, 2).T @ im2col(x, 2, 2))

y = np.zeros((28, 28))
start = time.time()
for i in range(1000):
    im2col(y, 2, 2)
end = time.time()
print("time: {}".format(end - start))
improved_early_im2col.png As you can see, the result is 150 times faster! In this case, it is practical (more than before) because it only takes 6 seconds to process once every 60,000 sheets. Now, let's follow what has changed and how it has changed.

Change 1

The first change is the memory allocation part of the output matrix.

improved_early_im2col.py


col = np.empty((F_h, F_w, O_h, O_w))

improved_col.png Memory is secured with a four-dimensional data structure like this.

Change 2

The next change is that the number of loops has been changed from $ O_h O_w $ to $ F_h F_w $ to reduce the number of accesses.

improved_early_im2col.py


for h in range(F_h):
    for w in range(F_w):
        col[h, w, :, :] = image[h : h+O_h, w : w+O_w]

This will reduce the number of numpy array accesses per MNIST image from 729 to a whopping 4! In addition, the access location to the output array and the access location to the input array in each loop are as follows. When accessed in this way, the following output array will be created. improved_im2col_numbering.png

Change 3

Finally, it is shaped into the desired shape at the time of output.

improved_early_im2col.py


return col.reshape(F_h*F_w, O_h*O_w)

In terms of the operation of numpy, the one-dimensional data of $ (F_h F_w O_h O_w,) $, which is smoothed $ (F_h, F_w, O_h, O_w) $, is transformed into the two-dimensional data of $ (F_h F_w, O_h O_w) $. I feel like I'm doing it. To put it in more detail, it feels like smoothing each two-dimensional data in the figure into one dimension and stacking it underneath. improved_im2col_reshape.png You think it's good ~

Extension to multidimensional array

By the way, as mentioned in [What is ʻim2col`](What is # im2col), the matrix that is the target of this function originally has a four-dimensional data structure. The filter also has a four-dimensional data structure in which $ M $ of the set is prepared in addition to securing the number of channels of the input matrix. color_image_and_filter.png Taking this into account, we will modify improved_early_im2col.py.

Chase with formulas

First, let's think about what kind of shape needs to be transformed mathematically. The structure of the color image is $ (B, C, I_h, I_w) $ when the number of channels is $ C $ and the batch size is $ B $. On the other hand, the filter has a structure of $ (M, C, F_h, F_w) $. In improved_early_im2col.py, when the $ (I_h, I_w) $ matrix is filtered by $ (F_h, F_w) $, the output matrix is $ (F_h F_w, O_h O_w) $ and $ (1, F_h F_w) $. You did. Assuming $ B = 1 $ and $ M = 1 $, in order for filtering to be calculated by matrix multiplication, the rows and columns of the input data transformed by ʻim2col and the shape of the filter must match. Because they have to be $ (C F_h F_w, O_h O_w) $ and $ (1, C F_h F_w) $. Also, since they are generally $ B \ ne M $, these should be combined with those who have nothing to do with $ C F_h F_w $. Combining these facts, the shapes of the array that should be output by ʻim2col are $ (C F_h F_w, B O_h O_w) $ and $ (M, C F_h F_w) $. By the way, the result of the filtering calculation is $ (M, C F_h F_w) \ times (C F_h F_w, B O_h O_w) = (M, B O_h O_w) $, which is reshaped and the dimensions are replaced. (B, M, O_h, O_w): = (B, C', I_h', I_w') $ propagates as input to the next layer.

Try to implement

The implementation content is almost the same as improved_early_im2col.py. I just added batch and channel dimensions to the top. BC_cols.png

Batch channel support ʻim2col`

BC_support_im2col.py


import time
import numpy as np


def im2col(images, F_h, F_w):
    B, C, I_h, I_w = images.shape
    O_h = I_h - F_h + 1
    O_w = I_w - F_w + 1
    cols = np.empty((B, C, F_h, F_w, O_h, O_w))
    for h in range(F_h):
        for w in range(F_w):
            cols[:, :, h, w, :, :] = images[:, :, h : h+O_h, w : w+O_w]
    return cols.transpose(1, 2, 3, 0, 4, 5).reshape(C*F_h*F_w, B*O_h*O_w)

x = np.arange(1, 3*3*4*4+1).reshape(3, 3, 4, 4)
f = np.arange(-3*3*2*2, 0).reshape(3, 3, 2, 2)
print(im2col(x, 2, 2))
print(im2col(f, 2, 2).T)
print(np.dot(im2col(f, 2, 2).T, im2col(x, 2, 2)))

y = np.zeros((100, 3, 28, 28))
start = time.time()
for i in range(10):
    im2col(y, 2, 2)
end = time.time()
print("time: {}".format(end - start))
BC_support_im2col_ex.png The biggest change is the return value.

BC_support_im2col.py


return cols.transpose(1, 2, 3, 0, 4, 5).reshape(C*F_h*F_w, B*O_h*O_w)

Here, the order of dimensions is changed using numpy's transpose function. Each corresponds as follows, and the correct output is returned by changing the order and then reshape.

\begin{array}{ccccccc}
  (&0, &1, &2, &3, &4, &5) \\
  (&B, &C, &F_h, &F_w, &O_h, &O_w)
\end{array}
\xrightarrow[\textrm{transpose}]{Swap}
\begin{array}{ccccccc}
  (&1, &2, &3, &0, &4, &5) \\
  (&C, &F_h, &F_w, &B, &O_h, &O_w)
\end{array}
\xrightarrow[\textrm{reshape}]{Deformation}
(C F_h F_w, B O_h O_w)

This completes ʻim2col`, which also supports batch channels!

Stride and padding

Well, I don't think this is the end. Lastly, I would like to introduce the processes called ** stride ** and ** padding **. Both are essential elements for a more efficient and effective implementation of CNN.

stride

In the implementation so far, the filters have been shifted by one square as a matter of course, right? This amount of deviation is called ** stride **, but there is no rule that this must be one square at a time. In most cases, the stride will not be 1 because the actual image is less likely to have a large change in information with only a 1 pixel shift.

Padding

Unlike strides, ** padding ** has never been mentioned in previous implementations. Its main role is to ** keep the size of the output image unchanged by filtering ** and ** to get all the information towards the edges of the image **. Specifically, the range in which the filter moves is expanded by filling the circumference of the input image with $ 0 $. pading_image.png

Stride and padding implementation

Let's take a look at each implementation.

Stride implementation

Stride implementation isn't that difficult. It only allows you to change the stride movement width so far from 1. until now

BC_support_im2col.py


cols[:, :, h, w, :, :] = images[:, :, h : h+O_h, w : w+O_w]

I used to do this

im2col.py


cols[:, :, h, w, :, :] = images[:, :, h : h + stride*O_h : stride, w : w + stride*O_w : stride]

Change as follows. The movement of the initial version is like this stride_image.gif In the formula

a = 1W + 2X + 5Y + 6Z \\
b = 3W + 4X + 7Y + 8Z \\
c = 9W + 10X + 13Y + 14Z \\
d = 11W + 12X + 15Y + 16Z \\
\Leftrightarrow \left(
  \begin{array}{c}
    a \\
    b \\
    c \\
    d
  \end{array}
\right)^{\top}
=
\left(
  \begin{array}{cccc}
    W & X & Y & Z
  \end{array}
\right)
\left(
  \begin{array}{cccc}
    1 & 3 & 9 & 11 \\
    2 & 4 & 10 & 12 \\
    5 & 7 & 13 & 15 \\
    6 & 8 & 14 & 16
  \end{array}
\right)

It looks like this, and the improved version looks like this. improved_stride_image.gif stride_col.png After all it's tricky ... It's too amazing to think about this.

Padding implementation

On the other hand, the implementation of padding processing is very simple. Using the pad function in numpy

im2col.py


images = np.pad(images, [(0, 0), (0, 0), (pad, pad), (pad, pad)], "constant")

If so, it's OK. The operation of the pad function is quite complicated (I will introduce it later), so I will explain the above for the time being. The first argument of pad is the target array. This should be okay. The problem is the second argument.

im2col.py


[(0, 0), (0, 0), (pad, pad), (pad, pad)]

If you enter this in the pad function,

--The first dimension is (0, 0), that is, no padding --The second dimension is (0, 0), that is, no padding --The third dimension is (pad, pad), that is, the upper and lower increase width pad is filled with 0 (" constant ") --The 4th dimension is (pad, pad), that is, 0 padding with the left and right increase width pad (" constant ")

There are some third arguments that can be specified, but this time I want to pad them with 0, so I specify " constant ". See the official documentation (https://numpy.org/devdocs/reference/generated/numpy.pad.html) for more information.

Output dimension calculation

Well, even if I make the above changes and execute it, an error still appears and it does not work. Yes. The reason, as you might expect, is that the output dimension changes with the implementation of stride and padding. Let's think about how it will change.

Impact of stride

Increasing the stride width will decrease the number of filters in inverse proportion. You can see that the number of times is halved depending on whether the filter is applied every 1 cell or every 2 cells. Expressed in a mathematical formula

O_h = \cfrac{I_h - F_h}{\textrm{stride}} + 1\\
O_w = \cfrac{I_w - F_w}{\textrm{stride}} + 1

It will be like that. If $ I_h = 4, F_h = 2, \ textrm {stride} = 1 $ O_h = \cfrac{4 - 2}{1} + 1 = 3 And if $ I_h = 4, F_h = 2, \ textrm {stride} = 2 $ O_h = \cfrac{4 - 2}{2} + 1 = 2 You can see that it matches the previous image.

Impact of padding

The effect of padding is very simple. Because the size of each input image is up and down $ + \ textrm {pad} \ _ {ud} $, left and right $ + \ textrm {pad} \ _ {lr} $

I_h \leftarrow I_h + 2\textrm{pad}_{ud} \\
I_w \leftarrow I_w + 2\textrm{pad}_{lr}

Can be replaced with, that is

O_h = \cfrac{I_h - F_h + 2\textrm{pad}_{ud}}{\textrm{stride}} + 1 \\
O_w = \cfrac{I_w - F_w + 2\textrm{pad}_{lr}}{\textrm{stride}} + 1

It will be. Conversely, if you want to match the size of the output image to the size of the input image, $ O_h = I_h $ and $ O_w = I_w $.

\textrm{pad}_{ud} = \cfrac{1}{2}\left\{(I_h - 1) \textrm{stride} - I_h + F_h\right\} \\
\textrm{pad}_{lr} = \cfrac{1}{2}\left\{(I_w - 1) \textrm{stride} - I_w + F_w\right\}

Can be calculated as By the way, let's increase the degree of freedom in stride.

O_h = \cfrac{I_h - F_h + 2\textrm{pad}_{ud}}{\textrm{stride}_{ud}} + 1 \\
O_w = \cfrac{I_w - F_w + 2\textrm{pad}_{lr}}{\textrm{stride}_{lr}} + 1 \\
\textrm{pad}_{ud} = \cfrac{1}{2}\left\{(I_h - 1) \textrm{stride}_{ud} - I_h + F_h\right\} \\
\textrm{pad}_{lr} = \cfrac{1}{2}\left\{(I_w - 1) \textrm{stride}_{lr} - I_w + F_w\right\}

Completed version ʻim2col`

The ʻim2col` with increased freedom by adding stride and padding is as follows. I will also make some customizations.

im2col.py

im2col.py


import numpy as np


def im2col(images, filters, stride=1, pad=0, get_out_size=True):
    if images.ndim == 2:
        images = images.reshape(1, 1, *images.shape)
    elif images.ndim == 3:
        B, I_h, I_w = images.shape
        images = images.reshape(B, 1, I_h, I_w)
    if filters.ndim == 2:
        filters = filters.reshape(1, 1, *filters.shape)
    elif images.ndim == 3:
        M, F_h, F_w = filters.shape
        filters = filters.reshape(M, 1, F_h, F_w)
    B, C, I_h, I_w = images.shape
    _, _, F_h, F_w = filters.shape
    
    if isinstance(stride, tuple):
        stride_ud, stride_lr = stride
    else:
        stride_ud = stride
        stride_lr = stride
    if isinstance(pad, tuple):
        pad_ud, pad_lr = pad
    elif isinstance(pad, int):
        pad_ud = pad
        pad_lr = pad
    elif pad == "same":
        pad_ud = 0.5*((I_h - 1)*stride_ud - I_h + F_h)
        pad_lr = 0.5*((I_w - 1)*stride_lr - I_w + F_w)
    pad_zero = (0, 0)
    
    O_h = int(np.ceil((I_h - F_h + 2*pad_ud)/stride_ud) + 1)
    O_w = int(np.ceil((I_w - F_w + 2*pad_lr)/stride_lr) + 1)
    
    pad_ud = int(np.ceil(pad_ud))
    pad_lr = int(np.ceil(pad_lr))
    pad_ud = (pad_ud, pad_ud)
    pad_lr = (pad_lr, pad_lr)
    images = np.pad(images, [pad_zero, pad_zero, pad_ud, pad_lr], \
                    "constant")
    
    cols = np.empty((B, C, F_h, F_w, O_h, O_w))
    for h in range(F_h):
        h_lim = h + stride_ud*O_h
        for w in range(F_w):
            w_lim = w + stride_lr*O_w
            cols[:, :, h, w, :, :] \
                = images[:, :, h:h_lim:stride_ud, w:w_lim:stride_lr]
    if get_out_size:
        return cols.transpose(1, 2, 3, 0, 4, 5).reshape(C*F_h*F_w, B*O_h*O_w), (O_h, O_w)
    else:
        return cols.transpose(1, 2, 3, 0, 4, 5).reshape(C*F_h*F_w, B*O_h*O_w)

I will explain briefly.

Plastic surgery, etc.

im2col.py


def im2col(images, filters, stride=1, pad=0, get_out_size=True):
    if images.ndim == 2:
        images = images.reshape(1, 1, *images.shape)
    elif images.ndim == 3:
        B, I_h, I_w = images.shape
        images = images.reshape(B, 1, I_h, I_w)
    if filters.ndim == 2:
        filters = filters.reshape(1, 1, *filters.shape)
    elif images.ndim == 3:
        M, F_h, F_w = filters.shape
        filters = filters.reshape(M, 1, F_h, F_w)
    B, C, I_h, I_w = images.shape
    _, _, F_h, F_w = filters.shape
    
    if isinstance(stride, tuple):
        stride_ud, stride_lr = stride
    else:
        stride_ud = stride
        stride_lr = stride
    if isinstance(pad, tuple):
        pad_ud, pad_lr = pad
    elif isinstance(pad, int):
        pad_ud = pad
        pad_lr = pad
    elif pad == "same":
        pad_ud = 0.5*((I_h - 1)*stride_ud - I_h + F_h)
        pad_lr = 0.5*((I_w - 1)*stride_lr - I_w + F_w)
    pad_zero = (0, 0)
In this part

--Changed to take the filter itself as an argument to reduce the number of arguments --If the input image is not 4D, convert it to 4D --Convert to 4D if the filter is not 4D --Get batch size, number of channels, size of one input image --The number of filters and the number of channels of the filter are unnecessary, so discard them (_, _, ... part) and get the size of one filter. --If stride is tuple, it is considered that the upper and lower and left and right stride widths are specified individually, otherwise the same value is used. --If pad is tuple, it is considered that the top and bottom and left and right padding widths are specified individually, otherwise the same value is used. --If pad ==" same " is specified, the padding width that maintains the size of the input image is calculated with ** float ** (for later output size calculation).

I am doing the processing like that.

Preparation

im2col.py


    O_h = int(np.ceil((I_h - F_h + 2*pad_ud)/stride_ud) + 1)
    O_w = int(np.ceil((I_w - F_w + 2*pad_lr)/stride_lr) + 1)
    
    pad_ud = int(np.ceil(pad_ud))
    pad_lr = int(np.ceil(pad_lr))
    pad_ud = (pad_ud, pad_ud)
    pad_lr = (pad_lr, pad_lr)
    images = np.pad(images, [pad_zero, pad_zero, pad_ud, pad_lr], \
                    "constant")
    
    cols = np.empty((B, C, F_h, F_w, O_h, O_w))

here

--Calculate the size of the output image --Change padding to tuples for better readability --Padding the input image --Allocate memory for output array

I am doing.

Processing body and return value

im2col.py


    for h in range(F_h):
        h_lim = h + stride_ud*O_h
        for w in range(F_w):
            w_lim = w + stride_lr*O_w
            cols[:, :, h, w, :, :] \
                = images[:, :, h:h_lim:stride_ud, w:w_lim:stride_lr]
    if get_out_size:
        return cols.transpose(1, 2, 3, 0, 4, 5).reshape(C*F_h*F_w, B*O_h*O_w), (O_h, O_w)
    else:
        return cols.transpose(1, 2, 3, 0, 4, 5).reshape(C*F_h*F_w, B*O_h*O_w)

Finally, about the processing body and the return value.

--For readability, new variables h_lim and w_lim are prepared to define the right and bottom edges of the filtering process. --Get the value from the input image for each stride width and store it in the output array cols --Swap dimensions, transform and return

Experiment with MNIST

Download the MNIST data from the Keras dataset and experiment.

mnist_test.py

mnist_test.py


#%pip install tensorflow
#%pip install keras
from keras.datasets import mnist
import matplotlib.pyplot as plt


#Specify the number of sheets to acquire
B = 3

#Data set acquisition
(x_train, _), (_, _) = mnist.load_data()
x_train = x_train[:B]

#Try to display
fig, ax = plt.subplots(1, B)
for i, x in enumerate(x_train):
    ax[i].imshow(x, cmap="gray")
fig.tight_layout()
plt.savefig("mnist_data.png ")
plt.show()

#Try to detect vertical lines
M = 1
C = 1
F_h = 7
F_w = 7
_, I_h, I_w = x_train.shape
f = np.zeros((F_h, F_w))
f[:, int(F_w/2)] = 1
no_pad, (O_h, O_w) = im2col(x_train, f, stride=2, pad="same")
filters = im2col(f, f, get_out_size=False)
y = np.dot(filters.T, no_pad).reshape(M, B, O_h, O_w).transpose(1, 0, 2, 3).reshape(B, O_h, O_w)
fig2, ax2 = plt.subplots(1, B)
for i, x in enumerate(y):
    ax2[i].imshow(x[F_h : I_h-F_h, F_w : I_w-F_w], cmap="gray")
fig2.tight_layout()
plt.savefig("vertical_filtering.png ")
plt.show()

#Try to detect horizontal lines
f = np.zeros((F_h, F_w))
f[int(F_h / 2), :] = 1
no_pad, (O_h, O_w) = im2col(x_train, f, stride=2, pad="same")
filters = im2col(f, f, get_out_size=False)
y = np.dot(filters.T, no_pad).reshape(M, B, O_h, O_w).transpose(1, 0, 2, 3).reshape(B, O_h, O_w)
fig3, ax3 = plt.subplots(1, B)
for i, x in enumerate(y):
    ax3[i].imshow(x[F_h : I_h-F_h, F_w : I_w-F_w], cmap="gray")
fig3.tight_layout()
plt.savefig("horizontal_filtering.png ")
plt.show()

#Try to detect a downward slope
f = np.zeros((F_h, F_w))
for i in range(F_h):
    f[i, i] = 1
no_pad, (O_h, O_w) = im2col(x_train, f, stride=2, pad="same")
filters = im2col(f, f, get_out_size=False)
y = np.dot(filters.T, no_pad).reshape(M, B, O_h, O_w).transpose(1, 0, 2, 3).reshape(B, O_h, O_w)
fig4, ax4 = plt.subplots(1, B)
for i, x in enumerate(y):
    ax4[i].imshow(x[F_h : I_h-F_h, F_w : I_w-F_w], cmap="gray")
fig4.tight_layout()
plt.savefig("right_down_filtering.png ")
plt.show()

#Try to detect rising to the right
f = np.zeros((F_h, F_w))
for i in range(F_h):
    f[F_h - i - 1, i] = 1
no_pad, (O_h, O_w) = im2col(x_train, f, stride=2, pad="same")
filters = im2col(f, f, get_out_size=False)
y = np.dot(filters.T, no_pad).reshape(M, B, O_h, O_w).transpose(1, 0, 2, 3).reshape(B, O_h, O_w)
fig4, ax4 = plt.subplots(1, B)
for i, x in enumerate(y):
    ax4[i].imshow(x[F_h : I_h-F_h, F_w : I_w-F_w], cmap="gray")
fig4.tight_layout()
plt.savefig("right_up_filtering.png ")
plt.show()
Output result Original data ![mnist_data.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/640911/0b3333f9-7c32-cef3-697b-8a24bdf8f5e3.png) Vertical line detection result ![vertical_filtering.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/640911/a7561d2c-0951-dd53-d718-1a3d61f02d69.png) Horizontal line detection result ![horizontal_filtering.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/640911/db838924-ab55-d460-6026-467cfc6ef391.png) Downward right detection result ![right_down_filtering.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/640911/aeddeae2-60cc-ffc4-312c-5c075bf760fc.png) Right-up detection result ![right_up_filtering.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/640911/66964a57-9e17-6e4f-7c0f-7fca58832920.png)
The first two lines are included because you need to have `tensorflow` and` keras` installed. If necessary, delete only the comment `#` and execute. Once you run it, you can comment it out again. As you can see from the output result, as a result of applying each filter, only the target line remains dark. This is feature detection.

in conclusion

This is the end of the explanation about ʻim2col`. If you have any bugs or smarter writing styles, I would appreciate it if you could let me know in the comments.

reference

-MNIST: Image dataset of handwritten numbers

Deep learning series

-Introduction to Deep Learning ~ Basics ~ -Introduction to Deep Learning ~ Code Preparation ~ -Introduction to Deep Learning ~ Forward Propagation ~ -List of activation functions (2020)

Recommended Posts

Im2col thorough understanding
col2im Thorough understanding
Understanding VQ-VAE
Understanding Concatenate