I studied "semantic segmentation", which extracts the things of interest from deep learning that deals with images.

Bear image generation

The first problem in learning semantic segmentation was getting images. I couldn't find a good feeling, so I started by automatically generating an image for practice.

"Image data" and "correct answer data" for semantic segmentation using the function created in Automatically generate koala and bear images I will make it myself.

from PIL import ImageFilter
import numpy as np
def getdata_for_semantic_segmentation(im):
    x_im = im.filter(ImageFilter.CONTOUR) #The outline is used as "image data" for input.
    a_im = np.asarray(im) #Convert to numpy
    #The black bear is made a polar bear, and the others are made black as "correct answer data".
    y_im = Image.fromarray(np.where(a_im == 1, 255, 0).astype(dtype='uint8'))
    return x_im, y_im

I created 2000 datasets as follows.

X_data = [] #For storing image data
Y_data = [] #For storing correct answer data
for i in range(2000): #Generate 2000 images
    #Generate bear image
    im = koala_or_bear(bear=True, rotate=True , resize=64, others=True)
    #Processed for semantic segmentation
    x_im, y_im = getdata_for_semantic_segmentation(im)
    X_data.append(x_im) #image data
    Y_data.append(y_im) #Correct answer data

Only the first 8 images of the created image data and correct answer data are shown and confirmed.

%matplotlib inline
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10,10))
for i in range(16):
    ax = fig.add_subplot(4, 4, i+1)
    ax.axis('off')
    if i < 8: #Display the top 8 of image data
        ax.set_title('input_{}'.format(i))
        ax.imshow(X_data[i],cmap=plt.cm.gray, interpolation='none')
    else: #Display the top 8 correct answer data
        ax.set_title('answer_{}'.format(i - 8))
        ax.imshow(Y_data[i - 8],cmap=plt.cm.gray, interpolation='none')
plt.show()

The purpose of semantic segmentation is to build a model that extracts the part corresponding to Mr. Kuma from the above "image data" and outputs "correct answer data".

Building a Kipchaks segmentation model

Data shaping

import torch
from torch.utils.data import TensorDataset, DataLoader

#Convert image data and correct answer data to ndarray
X_a = np.array([[np.asarray(x).transpose((2, 0, 1))[0]] for x in X_data])
Y_a = np.array([[np.asarray(y).transpose((2, 0, 1))[0]] for y in Y_data])

#Convert ndarray image data and correct answer data to tensor
X_t = torch.tensor(X_a, dtype = torch.float32)               
Y_t = torch.tensor(Y_a, dtype = torch.float32)

#Stored in data loader for learning with PyTorch
data_set = TensorDataset(X_t, Y_t)
data_loader = DataLoader(data_set, batch_size = 100, shuffle = True)

Model definition

The model for semantic segmentation is basically Autoencoder Using Convolutional Neural Network (CNN). It is com / maskot1977 / items / 2fb459c66d49ba550db2).

from torch import nn, optim
from torch.nn import functional as F
class Kuma(nn.Module):
    def __init__(self):
        super(Kuma, self).__init__()
        #Encoder part
        self.encode1 = nn.Sequential(
            *[
              nn.Conv2d(
                  in_channels = 1, out_channels = 6, kernel_size = 3, padding = 1),
              nn.BatchNorm2d(6)
              ])
        self.encode2 = nn.Sequential(
            *[
              nn.Conv2d(
                  in_channels = 6, out_channels = 16, kernel_size = 3, padding = 1),
              nn.BatchNorm2d(16)
              ])
        self.encode3 = nn.Sequential(
            *[
              nn.Conv2d(
                  in_channels = 16, out_channels = 32, kernel_size = 3, padding = 1),
              nn.BatchNorm2d(32)
              ])

        #Decoder part
        self.decode3 = nn.Sequential(
            *[
              nn.ConvTranspose2d(
                  in_channels = 32, out_channels = 16, kernel_size = 3, padding = 1),
              nn.BatchNorm2d(16)
              ])
        self.decode2 = nn.Sequential(
            *[
              nn.ConvTranspose2d(
                  in_channels = 16, out_channels = 6, kernel_size = 3, padding = 1),
              nn.BatchNorm2d(6)
              ])
        self.decode1 = nn.Sequential(
            *[
              nn.ConvTranspose2d(
                  in_channels = 6, out_channels = 1, kernel_size = 3, padding = 1),
              ])

    def forward(self, x):
        #Encoder part
        dim_0 = x.size() #For restoring the size in the first layer of the decoder
        x = F.relu(self.encode1(x))
        # return_indices =Set to True and max in the decoder_Use pool position idx
        x, idx_1 = F.max_pool2d(x, kernel_size = 2, stride = 2, return_indices = True)
        dim_1 = x.size() #For restoring the size in the second layer of the decoder
        x = F.relu(self.encode2(x))
        # return_indices =Set to True and max in the decoder_Use pool position idx
        x, idx_2 = F.max_pool2d(x, kernel_size = 2, stride = 2, return_indices = True)            
        dim_2 = x.size()
        x = F.relu(self.encode3(x)) #For restoring the size in the third layer of the decoder
        # return_indices =Set to True and max in the decoder_Use pool position idx
        x, idx_3 = F.max_pool2d(x, kernel_size = 2, stride = 2, return_indices = True)

        #Decoder part
        x = F.max_unpool2d(x, idx_3, kernel_size = 2, stride = 2, output_size = dim_2)
        x = F.relu(self.decode3(x))
        x = F.max_unpool2d(x, idx_2, kernel_size = 2, stride = 2, output_size = dim_1)           
        x = F.relu(self.decode2(x))                           
        x = F.max_unpool2d(x, idx_1, kernel_size = 2, stride = 2, output_size = dim_0)           
        x = F.relu(self.decode1(x))                           
        x = torch.sigmoid(x)                                     

        return x

Learning

%%time

kuma = Kuma()
loss_fn = nn.MSELoss()                               
optimizer = optim.Adam(kuma.parameters(), lr = 0.01)

total_loss_history = []                                     
epoch_time = 50
for epoch in range(epoch_time):
    total_loss = 0.0                          
    kuma.train()
    for i, (XX, yy) in enumerate(data_loader):
        optimizer.zero_grad()       
        y_pred = kuma(XX)
        loss = loss_fn(y_pred, yy)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print("epoch:",epoch, " loss:", total_loss/(i + 1))
    total_loss_history.append(total_loss/(i + 1))

plt.plot(total_loss_history)
plt.ylabel("loss")
plt.xlabel("epoch time")
plt.savefig("total_loss_history")
plt.show()

epoch: 0  loss: 8388.166772460938
epoch: 2  loss: 8372.164868164062
epoch: 3  loss: 8372.035913085938
...
epoch: 48  loss: 8371.781372070312
epoch: 49  loss: 8371.78125

The value of the loss function is a ridiculous number, is that okay ...

It seems that it has converged. The calculation time is as follows.

CPU times: user 6min 7s, sys: 8.1 s, total: 6min 16s
Wall time: 6min 16s

Result announcement

New data will be generated as test data.

X_test = [] #Stores image data for testing
Y_test = [] #Stores correct answer data for testing
Z_test = [] #Store prediction results for testing

for i in range(100): #Generate 100 new data not used for learning
    im = koala_or_bear(bear=True, rotate=True, resize=64, others=True)
    x_im, y_im = getdata_for_semantic_segmentation(im)
    X_test.append(x_im)
    Y_test.append(y_im)

Predict using a trained model.

#Format test image data for PyTorch
X_test_a = np.array([[np.asarray(x).transpose((2, 0, 1))[0]] for x in X_test])
X_test_t = torch.tensor(X_test_a, dtype = torch.float32)

#Calculate predictions using a trained model
Y_pred = kuma(X_test_t)

#Store predicted values as ndarray
for pred in Y_pred:
    Z_test.append(pred.detach().numpy())

Drawing of the first 10 data

Let's draw only the first 10 data to see what kind of prediction result it is. From left to right, input image data, correct answer data, and prediction data.

#Draw image data, correct answer data, and predicted values for the first 10 pieces of data
fig = plt.figure(figsize=(6,18))
for i in range(10):
    ax = fig.add_subplot(10, 3, (i * 3)+1)
    ax.axis('off')
    ax.set_title('input_{}'.format(i))
    ax.imshow(X_test[i])
    ax = fig.add_subplot(10, 3, (i * 3)+2)
    ax.axis('off')
    ax.set_title('answer_{}'.format(i))
    ax.imshow(Y_test[i])
    ax = fig.add_subplot(10, 3, (i * 3)+3)
    ax.axis('off')
    ax.set_title('predicted_{}'.format(i))
    yp2 = Y_pred[i].detach().numpy()[0] * 255
    z_im = Image.fromarray(np.array([yp2, yp2, yp2]).transpose((1, 2, 0)).astype(dtype='uint8'))
    ax.imshow(z_im)
plt.show()

I can cut out the bear part. Even if the size of the bear changes, it's okay to rotate it!

However, it seems that they tend to cut out a little larger. Also, there are quite a few mistakes.

Cutout area

Let's compare the area cut out in white with the correct answer data and the predicted data.

A_ans = []
A_pred = []
for yt, zt in zip(Y_test, Z_test):
    #Correct white area (divide by 3 because there are 3 vectors)
    A_ans.append(np.where(np.asarray(yt) > 0.5, 1, 0).sum() / 3) 
    A_pred.append(np.where(np.asarray(zt) > 0.5, 1, 0).sum()) #Predicted white area

plt.figure(figsize=(4, 4))
plt.scatter(A_ans, A_pred, alpha=0.5)
plt.grid()
plt.xlabel('Observed sizes of bears')
plt.ylabel('Predicted sizes of bears')
plt.xlim([0, 1700])
plt.ylim([0, 1700])
plt.show()

It is good that the correct answer value and the predicted value have an almost linear relationship, but it seems that the predicted value tends to be large.

If you want to know only the size of the bear, it is a good idea to make corrections based on this relationship. However, in order to cut out more accurately, it will be necessary to build a more complicated model.

Bear ... not semantic segmentation