In this article, the author, who has a distrust of deep learning, examines the classification ability of deep learning. There is a part that cites the image from the website and Youtube after verification (source description). Although it is for academic purposes, we will delete it if there is a problem. ** Also, collect the images yourself when using the program. ** **

Introduction

Introduction Introduction

Why did you decide it was vacuum ...


Figure 0.Neural network of images of men manipulating R2D2(VGG16)The result of classifying by. In the graph below, the prediction classes are arranged in descending order of certainty from the left.

7 major machine tool companies (details in the Nikkan Kogyo Shimbun, March 11, 2020)


Figure 1.7 major machine tool companies (orders received in February 2020) If you are interested in the content of this article, please see the actual web page. https://www.nikkan.co.jp/articles/view/00551108

What is the power of deep learning classification?

――It has nothing to do with the content of the article (sales of each company), but I was wondering if the products of these 7 companies could be classified by deep learning, so I tried it. ――Since machine tools are products, data is likely to be collected, and I thought it would be easy to model them mathematically, so I wanted to see how they were classified. ――This problem setting is not practical, but I think you can use this classification framework.

Attention points of neural network by Grad-CAM

Finally, the criteria for image classification by Grad-CAM are shown, but they are not included in the published program. I may post it at some time in the future. Whether to judge based on the logo without explicitly setting to learn with the logo, or to judge based on strange points. Well, not all products have a logo, so it may be a showcase for the power of deep learning.

About the title

The word sommelier is a word used only for wine, but recently it seems that it is also used for other than wine in the form of "~ sommelier".

Program execution environment

Windows10
CPU:Core i7-7700HQ
Memory: 16GB
Graphic board: GTX1060 6GB
Strage: NVMe M.2 SSD 1TB
CUDA 9.0.176

Create a model with Keras to classify the products of 7 major machine tool companies

--Images are collected online (mainly collected from the product page of each company's homepage) ――We will narrow down to a certain degree of uniformity (I will remove the gate type etc. Especially Toshiba Machine is only TUE-100. I felt that if I put other things in it, it would not be possible to classify them separately.) --Name the dataset as 1_0001.jpg (explained at the beginning of the public program) and put it in the dataset folder in the program (Figure2-a, Figure2-b). --You can guess whether to learn with Jupyter notebook, or use ResNet50 or Mobilenet V1 (described later) for Backbone with GUI (Tkinter) --Details on Jupyter notebook parameters and usage are explained at the beginning of the notebook. --The program is on the github page. https://github.com/moriitkys/MachineSommelier


Figure 2-a	Figure 2-b

Glossary

Deep Learning Deep learning

Deep learning is a method of machine learning. Machine learning is a method of giving sample data, learning patterns hidden in the data, and finding the relationship between input and output. Among them, neural networks are models that try to realize learning by imitating human neural circuits. What can be called the origin of neural networks is the simple perceptron proposed in 1958. Deep learning is a deeper layer of a neural network, and the basis of this is a feedforward neural network (Figure 3-b.). A simple perceptron multiplies $ n $ of input nodes by $ n $ of weight $ w $ and adds them all together by the function $ f_ {(x)} $, as shown in Figure 3-a. As a result of adding w_ {n + 1} $, 1 is output if the output is above a certain value, and 0 is output if it is below a certain value.


Figure 3-a.Conceptual diagram of a simple perceptron	Figure 3-b.Conceptual diagram of feedforward neural network

Furthermore, as a feedforward neural network in image recognition, what is called a convolutional neural network is widely used.

Convolutional neural network

A convolutional neural network (CNN) is a feedforward neural network in which units on the output layer side are connected to specific units on the adjacent input layer side. CNN has a special layer called a convolution layer and a pooling layer. In the convolution layer, a feature map is obtained by applying a filter (kernel) to the input data. Although omitted in this article, CNN is a very important process in image recognition. Figure 4 below shows an example of CNN processing.


Figure 4.Example of CNN processing

Classification Image classification

I think there are mainly four types in the field of image recognition. Some applications have more complex output, but this time we will omit it.

** ・ Classification Image classification ** Guess what class the input corresponds to (each output the probability), (Figure 5-a)


Figure 5-a.Classification example. Classification with VGG16 (a type of network, described below) trained with ImageNet (described later) fails as a result of this example (explained at the end).

[Folding. ▶ Click] By the way (vacuum) It seems that ImageNet contains data such as the following sites. It's no wonder this is mistaken for a vacuum ... https://www.myhenry.co.uk/

** ・ Detection Object detection ** Guess what class of object exists in which coordinate area for input with a rectangle (Figure 5-b)


Figure 5-b.Detection example. Detection by Yolo (a type of learning architecture) trained in Pascal VOC2012 (described later) can detect people well in this example, but the robot"parking meter"Is misrecognized.

[Folding. ▶ Click] By the way (parking meter) What the parking meter looks like was in the following paper. https://arxiv.org/pdf/2003.07003.pdf

It looks like a robot, but ...

** ・ Semantic Segmentation area semantic segmentation ** Classify each input pixel (Figure 5-c)


Figure 5-c.An example of Semantic Segmentation. Deeplab V3 learned at Pascal VOC 2018+In this example, the human domain can be correctly inferred by Semantic Segmentation by. The robot is recognized as the background.

** ・ Instance Segmentation ** = Detection + Segmentation (Figure5-d)


Figure 5-d. Instance Segmentation example. Instance segmentation by Mask R-CNN trained by Microsoft COCO (described later) can infer the human domain in this example, but it is rough. The robot is recognized as the background.

This time we apply softmax at the end of the network and do a "** Classification **" to guess what class the input corresponds to.

Backbone Backbone is a feature extraction network. This is the part where the layer of the input part and the output part (Head) are removed. This time, we will use ResNet50 and Mobilenet (described later). Both are provided by Keras, so you don't have to build your own network. You can import it as follows.

from keras.applications.resnet50 import ResNet50

base_model = ResNet50(
    include_top = False,
    weights = "imagenet",
    input_shape = INPUT_SHAPE
)

Since Backbone is a neural network, training in a certain dataset determines some weights (weights in the above program). -** "A certain data set" ** is published on the net and can be downloaded (although it is a very large amount of data such as several GB to several tens of GB). For example, are Microsoft COCO, ImageNet, Pascal VOC, etc. famous? -** "Some weight" ** is also open to the public. However, this is important in what network you learned with what dataset. For example, ResNet 50 and Mobilenet V1 this time use the weights learned by ImageNet respectively. This time, you can easily execute it with the Keras function, but if you set the problem yourself and procure the weight, you need to pay attention to the network and the dataset.

Head (Top) This is the part near the output of the neural network. You can change the output by setting include_top = False when loading an existing Backbone. It is possible to change according to the purpose, such as increasing the number of layers and devising to improve accuracy, or changing the number of classification classes (I think that the number of classes can be changed with an argument ...). This time, the Head part was constructed as follows.

top_model.add(Flatten(input_shape=base_model.output_shape[1:]))
top_model.add(Dropout(0.5))
top_model.add(Dense(nb_classes, activation='softmax'))
# Concatenate base_model(backbone) with top model
model = Model(input=base_model.input, output=top_model(base_model.output))

Grad-CAM The Grad-CAM paper is below. Please read for details. Ramprasaath R. Selvaraju · Michael Cogswell · Abhishek Das · Ramakrishna Vedantam · Devi Parikh · Dhruv Batra, Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, arXiv:1610.02391v4 [cs.CV] 3 Dec 2019 (https://arxiv.org/abs/1610.02391) This paper is referred to below as the Grad-CAM paper. Some of the abstracts in the Grad-CAM paper are:

We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable. Our approach–Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’ in a classiﬁcation network or a sequence of words in captioning network) ﬂowing into the ﬁnal convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.

→ The proposed method visually explains the decisions made from a large class of CNN-based models, improving the transparency and interpretability of the model. Grad-CAM uses a gradient of the concept of interest (such as a dog-like label), which enters the final layer to generate a rough feature map, highlighting important positions on the image when making predictions. ..


Figure 6. Grad-Fig in the CAM paper.2 images

The result of "highlighting important positions on the image when predicting" is the color map (a in Figure 6.) labeled Grad-CAM in the image above. This time, we will generate this color map and check where in the input image we focused on when predicting the class.

** Summarized the processing flow of Grad-CAM ** (I'm sorry if I made a mistake). ** 1. Input and prediction to CNN **: In the deep learning architecture, the feature map is output when the input image passes through the CNN, and based on that, the class of the input corresponds to the array of probabilities. Is output as (b in Figure 6.).

    plobs_pred = model.predict(input_tests)
    cls_idx = np.argmax(model.predict(x))
    cls_output = model.output[:, cls_idx]
    conv_output = model.get_layer(layer_name).output

** 2. Calculation of gradient $ \ frac {\ partial y ^ {c}} {\ partial A_ {ij} ^ {k}} $ **: Backpropagation with 1 for the target class and 0 for the others I will do it. Since it is sufficient to take the gradient with the final layer, calculate as follows. In this case, the variable layer_name is activation_49 for ResNet50 and conv_pw_13_relu for Mobilenet. $ A_ {i j} ^ {k} $ in the Grad-CAM paper was calculated by the following fmap_activation. $ \ frac {\ partial y ^ {c}} {\ partial A_ {ij} ^ {k}} $ is the score $ y ^ {c} for class $ c $ for $ A_ {ij} ^ {k} $ It's a gradient of $, not a gradient calculation using fmap_activation. I think it's a matter of meaning.

    grads = keras.backend.gradients(cls_output, conv_output)[0]  
    gradient_function = keras.backend.function([model.input], [conv_output, grads]) 
    fmap_activation, grads_val = gradient_function([x])
    fmap_activation, grads_val = output[0], grads_val[0]

** Caution ** The above input x needs to be preprocessed with img_to_array etc.

** 3. Calculation of weight $ \ alpha_ {k} ^ {c} $ **: Average the gradients for each channel and use them as weights. This is the processing of the part corresponding to equation (1) in the following Grad-CAM paper. Prior to Grad-CAM, Class Activation Mapping (CAM) required the addition of a Global average pooling layer (GAP) (CAM is calculated as the product of the feature map and the weights of the final layer, and the feature map is weighted by GAP. Get). However, with Grad-CAM, the GAP part can be replaced by backpropagation gradient calculation, and visualization is possible without modifying the model.

\alpha_{k}^{c}=\frac{1}{Z} \sum_{i} \sum_{j} \frac{\partial y^{c}}{\partial A_{i j}^{k}}

    alpha = np.mean(grads_val, axis=(0, 1))

** 4. Class Identification Local Map (Grad-CAM) $ L_ {\ text {Grad}-\ mathrm {C} A \ mathrm {M}} ^ {c} $ Calculation **: Grad-CAM The calculation of equation (2) is performed as follows. The sigma part is calculated by the inner product of numpy, and ReLU has zero negative as shown in Figure 7. Therefore, we can see that the maximum of numpy should be used.


Figure 7.ReLU graph

L_{\mathrm{Grad}-\mathrm{CAM}}^{c}=\operatorname{ReLU}\left(\sum_{k} \alpha_{k}^{c} A^{k}\right)

    l_gradcam = np.dot(fmap_activation, alpha)#Sigma part
    l_gradcam = cv2.resize(l_gradcam, (img_h, img_w), cv2.INTER_LINEAR) 
    l_gradcam = np.maximum(l_gradcam, 0) #Equivalent to ReLU calculation
    l_gradcam = l_gradcam / l_gradcam.max()

After that, JET conversion using cv2.applyColorMap etc. to make a color map is completed.

Drop out

A technique called dropout is applied to prevent the network from overfitting (a phenomenon in which the validation data is inaccurate due to overfitting only to the training data). Repeat learning by disabling some nodes in the network to improve generalization performance.

Loss function

I explained it lightly last time (wine classification by Keras), so I will omit it. Since this time is also a multi-class classification, ** category_crossentropy ** is used. Wine classification by Keras: https://qiita.com/moriitkys/items/2ac240437a31131108c7

Optimization function

I explained it lightly last time (wine classification by Keras), so I will omit it. As a result of various trials this time, ** SGD ** was the best learning, so I will adopt it.

The above loss function and optimization function are in the part of compiling the model in cell 2 of the code uploaded to github.

model.compile(
    optimizer = SGD(lr=0.001),
    loss = 'categorical_crossentropy',
    metrics = ["accuracy"]
)

Tkinter Since it is a standard library (widget toolkit) for building and operating GUI from Python, it is a library that can be used if Python is installed without doing anything. The GUI window that works with the notebook posted on github this time looks like the following.

The program is below. I think it will work even if you copy and paste it as it is. There are three types of buttons, and the click_flag_train function turns red when pressed and the flag value changes. First, inference mode, press to train mode, press to inference mode, and so on. The label also changes. A radio button is also installed, and you can select which backbone to use with tkinter.Radiobutton. Finally, press the Start button at the bottom to close the window and proceed to the program below it.


Figure 8.Tkinter runtime

import os
import numpy as np

flag_train = False
type_backbone = "ResNet50"
layer_name_gradcam = "activation_49"

# --- Setting buttons ---
import tkinter
tki = tkinter.Tk()
tki.geometry('300x400')
tki.title('Settings')

radio_value_aug = tkinter.IntVar() 
radio_value_split = tkinter.IntVar() 
radio_value = tkinter.IntVar() 

label = tkinter.Label(tki, text='Mode: Inference')
label.place(x=50, y=60)

def callback(event):
    if event.widget["bg"] == "SystemButtonFace":
        event.widget["bg"] = "red"
    else:
        event.widget["bg"] = "SystemButtonFace"
def click_flag_train():
    global flag_train
    if flag_train == True:
        label['text'] = 'Mode: Inference'
        flag_train = False
    else:
        label['text'] = 'Mode: Train'
        flag_train = True
def click_start():
    tki.destroy()
    
# Create buttons
btn_flag_train = tkinter.Button(tki, text='Train', command = click_flag_train)
btn_start = tkinter.Button(tki, text='Start', command = click_start)
label1 = tkinter.Label(tki,text="1. Select Train or Inference")
label1.place(x=50, y=30)
btn_flag_train.place(x=50, y=100)

label2 = tkinter.Label(tki,text="2. Select ResNet50 or Mobilenet")
label2.place(x=50, y=150)
rdio_one = tkinter.Radiobutton(tki, text='ResNet',
                             variable=radio_value, value=1) 
rdio_two = tkinter.Radiobutton(tki, text='Mobilenet',
                             variable=radio_value, value=2) 
rdio_one.place(x=50, y=180)
rdio_two.place(x=150, y=180)

label3 = tkinter.Label(tki,text="3. Start")
label3.place(x=50, y=250)
btn_start.place(x=50, y=280)

# Display the button window
btn_flag_train.bind("<1>",callback)
btn_start.bind("<1>",callback)
tki.mainloop()

if radio_value.get() == 1:
    type_backbone = "ResNet50"
    layer_name_gradcam = "activation_49"
elif radio_value.get() == 2 :
    type_backbone = "Mobilenet"
    layer_name_gradcam = "conv_pw_13_relu"

print(flag_train)
print(type_backbone)

It's not the main subject of this time, but Tkinter has become long.

** End of terminology **

ResNet50 VS Mobilenet V1

Roughly speaking, the difference between the two is that ResNet is heavy and highly accurate, and Mobilenet is light and slightly inaccurate. Below is a detailed comparison.

ResNet --Network that won the ILSVRC in 2015 --Model structure 176 layers of Deep CNN as shown in Figure 9-a. --Use the residual structure (Figure 9-b.) --Number of parameters in this program 23548935 (Figure 10-a.) --126s per epoch (Figure 10-b.) --Reference: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, arXiv: 1512.03385 [cs.CV] 10 Dec 2015, (https://arxiv.org/abs/1512.03385)


Figure 9-a.ResNet model structure	Figure 9-b.Residual structure


Figure 10-a.Number of parameters in ResNet	Figure 10-b.ResNet learning

As shown in Figure 9-b., The residual structure has a structure in which the identity function is added to the output as it is, and the gradient disappears easily when the layer is deepened. Therefore, ResNet has gained more complex expressiveness.

Mobilenet V1 --Google published a paper in 2015 --Model structure 97 layers of Deep CNN as shown in Figure 11-a. --Use Depthwise Separable Convolution (Figure 11-b.) --Number of learning parameters in this program 3465031 (Figure 12-a.) --65s per epoch (Figure 12-b.) --Reference Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, arXiv: 1704.04861 [cs.CV] 17 Apr 2017 , (https://arxiv.org/abs/1704.04861)


Figure 11-a.Mobilenet V1 structure	Figure 11-b. Depthwise Separable Convolution


Figure 12-a.Number of parameters of Mobilenet V1	Figure 12-b.Mobilenet V1 learning

As shown in Figure 11-b., The structure of the Depthwise Separable Convolution executes the convolution process in two steps: convolution in the spatial direction (vertical and horizontal) and then convolution in the channel direction. Normally you do this all at once. The principle of why the weight is reduced in two steps is explained in detail in the article by reading the paper or in the Qiita article written by another person, so I will omit it this time (maybe I will write it in the future).

Learning settings

・ There are 7 types of classes: categories = ["Makino", "Okuma", "OKK", "Toshiba", "JTEKT", "Tsugami", "Mitsubishi"] -The number of data sets is listed in Table 1. below. Expansion performs brightness change, pepper noise, cutout, and deformation. ・ Data set division train: validation = 6: 4 -The vertical and horizontal size of the image is 197x197 for ResNet50 and 192x192 for Mobilenet. ・ Dropout, loss function, optimization function have already been explained. ・ All layers are learning targets -Perform batch_size = 32, 30 epochs learning on ResNet 50 or Mobilenet V1 respectively

Table 1. Data set breakdown

class	Number of sheets before expansion	Number of sheets after expansion
1 Makino	105	2088
2 Okuma	72	1368
3 OKK	75	1800
4 Toshiba	7	504
5 JTEKT	60	2160
6 Tsugami	45	1620
7 Mitsubishi	24	1152

result

Loss and Accuracy

I think that learning has converged from the transition of Loss and Accuracy. (Although it is disturbing that the rating is 1.0) First, the transition of Loss and Accuracy when learning 30 epoch with ResNet50 is as follows. The smaller the Loss and the larger the Accuracy, the better the learning. Blue is the accuracy of train data, and orange is the accuracy of valodation data. You can also see the effect of the dropout (the accuracy of the train data is low).


Figure 13-a.Loss for each Epoch in ResNet 50	Figure 13-b.Accuracy for each Epoch of ResNet 50

Next, the transition of Loss and Accuracy when learning 30 epoch with Mobilenet V1 is as follows. Blue is the accuracy of train data, and orange is the accuracy of valodation data.


Figure 14-a.Loss for each Epoch on Mobilenet	Figure 14-b.Accuracy for each Epoch of Mobilenet

Guessing results using unknown test images

Using this 30 epoch trained weight, 28 unknown images that are not used for training or evaluation are collected and estimated, and the results are summarized in Table 2 below.

Table 2. Guess results for each test image of each model

Total number of test images	ResNet50 Total number of correct answers	Mobilenet total number of correct answers
28	26	23
Correct answer rate	0.929	0.821

From this, ResNet is about 10% more accurate.

Visualization with Grad-CAM

Also, although the area of interest is visualized by Grad-CAM, it seems that we are paying attention to the part that looks like a logo (especially ResNet).


Figure 15-1.Makino test image estimation results and visualization https://www.youtube.com/watch?v=mMgyLnV7l6M


Figure 15-2.Guess results and visualization of Okuma test images https://www.youtube.com/watch?v=f_qI1sxj9fU


Figure 15-3.Guessing results and visualization of Tsugami's test images https://www.youtube.com/watch?v=rSTW2hEfSns

Maybe not ...


Figure 15-4.JTEKT test image estimation results and visualization https://www.youtube.com/watch?v=SVtN08ASIrI

Many visualizations gave the impression that ResNet was sharply focused on logos, colors, and distinctive shapes (edges, etc.), but Mobilenet was generally blurry and couldn't fully capture its features. It also looks like.

OKK's guy seems to be distinguished by color.


Figure 15-5.OKK test image estimation results and visualization https://www.youtube.com/watch?v=Xfk3wXWleUs&t=21s

About "Introduction to the beginning"

The reason why the first image was judged to be Vacuum is not that the network structure is bad, but that it is a data set problem. And I thought about the following three reasons.

--The shape of the robot in the image was similar to the image of Vacuum used for ImageNet (the site described in the "By the way" part above). --In ImageNet, there is little data that humans turn sideways (?) (Unconfirmed) --In ImageNet, there is little data on humans with masks (?) (Unconfirmed)

However, I'm curious that Person is not in the top 10. Then, when I asked him to guess the image of a person facing the front, he found many guesses about clothes such as shirts, ties, and suits. From this, I think that wearing a mask will reduce the accuracy of human guessing with ImageNet.

Impressions

――Makino Milling's site was the best way to collect data sets. This is because some of the products are taken at different angles, and there are many images that look similar. With the exception of some OKK, the colors are characteristic, so it may be appreciated for NN. ――Since it is transfer learning, if it is a similar image, I could learn to some extent even if it is less than 100 images, but this example is too special and may not be very helpful. In particular, the transition of Loss and Accuracy is so beautiful that I don't think the dataset and parameter tuning are appropriate. ――The most real thing is that ** datasets are very important **, but considering the results and accuracy of this visualization, we can see that the network has some influence. I did. ――For visualization, I was hoping to discover features that are not immediately noticeable, rather than features such as logos and colors, but I haven't seen all of them yet. Even if I can do it, I don't think it's good to publish a lot of images here ... ――In this verification, the question setting itself is not practical, but I think that it has become an easy-to-use notebook if you want to see the learning results quickly. ――Articles and programs have some rough parts left, but you have to separate them. If you have any corrections or comments, please contact us.

reference

https://www.ling.upenn.edu/courses/cogs501/Rosenblatt1958.pdf https://keras.io/ja/models/sequential/ https://keras.io/ja/applications/ https://www.bigdata-navi.com/aidrops/2611/ https://qiita.com/simonritchie/items/f6d6196b1b0c41ca163c https://qiita.com/kinziro/items/69f996065b4a658c42e8 http://www.image-net.org/ https://cocodataset.org/#home http://host.robots.ox.ac.uk/pascal/VOC/ https://arxiv.org/pdf/1409.1556.pdf https://pjreddie.com/darknet/yolo/ https://github.com/tensorflow/models/tree/master/research/deeplab https://arxiv.org/abs/1703.06870

https://docs.python.org/ja/3/library/tkinter.html https://qiita.com/simonritchie/items/da54ff0879ad8155f441

https://www.makino.co.jp/ja-jp/ https://www.okuma.co.jp/mold-industry/index.html https://www.okk.co.jp/product/index.html https://www.shibaura-machine.co.jp/jp/product/machinetool/ https://www.jtekt.co.jp/ https://www.tsugami.co.jp/ https://www.mhi-machinetool.com/

https://qiita.com/moriitkys/private/2ac240437a31131108c7

Machine Sommelier by Keras-