Learning a convolutional neural network (CNN) takes time.
In particular, if you really try to improve accuracy without using transfer learning, it will take about 5 days ** There are times. The papers introduced in this paper quantitatively explain the difficulty of datasets. By evaluating, classification accuracy can be estimated at high speed without learning CNN. That is.
The key to this paper is "if there is a lot of overlap between the two classes" in the image classification task. The classification problem is a difficult problem. "
The figure above is my own work. Now, if you use CNN to embed an image in 2D space To do.
And, as a task to classify orange dots and blue dots, with respect to the left figure, It seems that the problem on the right is easier to classify. In other words, this The larger the overlap between classes, the more difficult the classification problem.
If you think in terms of mathematical formulas, if there are many points in the same group as you near a certain point, there will be duplication. Small (easy classification problem), large duplication if there are many points in a group different from yourself (Difficult classification problem).
The paper presents the following formula.
This formula expresses the multiplicity of class $ C_j $ as seen from class $ C_i $.
Since M and V are constant values, set them aside for the time being, and $ K_ {C_j} $ is the image x of class $ C_i $. The number of classes $ C_j $ in the neighborhood. In other words, if $ K_ {C_j} $ is large, the overlap is large. I can say.
Here, $ \ phi (x) \ in R ^ d $ is the vector in which the image x of $ C_i $ is embedded, and M is from $ C_j $. The number of samples extracted, V is the size of the Hyper Cube. Finally The sum is obtained by applying equation (4) to the image of class $ C_i $.
Since equation (4) can only calculate the multiplicity between two classes, If there are multiple classes, calculate the multiplicity of each class. is needed.
If the number of classes is K, the similarity matrix $ S \ in R ^ {K \ times K} $ will be It is defined. $ S_ {ij} $ is calculated by the sum of equation (4).
If each element of $ S $ is large, it means that equation (4) is a large value. Seen from class i, there are many classes j nearby. In other words, it is a difficult classification problem.
After that, how to use $ S $ to quantify the difficulty of the classification problem Is the problem.
The paper uses spectral clustering.
Spectral clustering was originally unsupervised learning clustering It seems to be the method used. Divide where the connection of each graph is weak By doing so, even division problems that are difficult with the k-means method etc. can be divided neatly. It is possible to. Please see the link below for details. https://www.slideshare.net/pecorarista/ss-51761860 https://qiita.com/sakami/items/9b3d57d4be3ff1c70e1d
In spectral clustering, the following Laplacian L Quantify the difficulty of dividing a graph by finding the eigenvalues can do.
L=D-W
When applied to this paper, $ S_ {ij} $ is regarded as a graph, and it is difficult to divide it. It quantifies (difficulty of classification). Generally, the eigenvalue of L is It seems that the larger it is, the more difficult it is to divide, but in this paper, as will be described later. By capturing and quantifying not only the size of the eigenvalues but also the tendency of the eigenvalues More ** more accurate quantification ** is possible.
For L's expression, D is defined as $ D_i = \ sum_jw_ {i, j} $. I would like to assign $ S $ directly to W, but W has the constraint of a symmetric matrix. Therefore, it is symmetric using the following formula.
If you show the experimental results first, the difficulty of the data set with W alone Can be estimated. The figure below shows the experimental results using CIFAR10.
The upper row is the experimental result of W. The confusion matrix inferred by AlexNet trained in the lower row. You can see that the W estimation is very similar to the AlexNet inference result. It also hits the situation where it is difficult to distinguish between "dog" and "cat".
Check the processing up to this point. Image → Embedding → Calculate the overlap between classes by equation (4) → Calculate the eigenvalue of L
Just by looking at the maximum eigenvalue of the Laplacian L as described above Difficulty of dividing graphs (difficulty of data set in this paper) Can be estimated.
If you show the experimental results first, you can see the maximum or sum of the eigenvalues. You can estimate the difficulty of a dataset just by looking at it.
The table above shows the experimental results using CIFAR10.
Even just looking at the sum of the eigenvalues, the performance exceeds that of the conventional method.
In the table above, the following four "embedding methods" are available.
By the way, the method in this paper claims to be "high speed", but you have to learn the autoencoder. A good CSG cannot be calculated. The time in parentheses in the table above is the autoencoder learning time. As you can see, it's about twice as fast as the traditional method. (It's still amazing.)
CSG Finally, I will explain about CSG. If the eigenvalues are arranged in ascending order of i = 0,1,2 ..., first standardize with the following formula.
Then calculate the CSG.
Here, assuming that cummax has an array [1,4,3,2]
cummax[1,4,3,2] = [1,4,4,4]
Means. In other words, when reading in order from the left, it was read A function that records the maximum number.
The table below shows the results of experiments using MNIST.
As you can see, as the number of classes in mnist increases, so does the CSG and error rate. You can see (correlation). By the way, the light blue mnist means "mnist 1". That is, there is only one class.
The table below shows the results of experiments using different datasets (number of classes: 10).
It should be noted that there is a correlation between CSG (red) and AlexNet error rate (blue). You can see that CSG is working properly even if the datasets are different.
The figure below shows the results of an experiment using a dataset called MioTCD.
Here, CSG is calculated while reducing the number of data. And to CSG The error rate will increase in proportion. In this dataset You can see that there is no problem even if the data is reduced by about 80%.
On the other hand, in CIFAR10, as the data is reduced as shown in the following table. The error rate will increase steadily.
Again, you can see that the CSG increases in proportion to the error rate.
Also, using the matrix W, apply MDS to the following equation to determine the similarity between classes. It can be visualized.
S=1-W
The figure below is a diagram with MDS applied.
MNIST is nicely separated between classes, but in the case of CIFAR10, "Dog" and "cat", "deer" and "bird" are close to each other and have similarities I understand. In other words, these are relatively difficult to distinguish.
In conclusion, MDS and W managed to get results, but the point is ** CSG did not reproduce well. ** **
The purpose of the experiment is "to see the transition of CSG while reducing the data of CIFAR10." In other words, it reproduces Table6.
The autoencoder for embedding was tested under the following conditions.
The embedding method will be the one with the highest score in the paper. The technique is CSG $ CNN_ {AE} $ t-SNE.
This is calculated by the following procedure.
The CSG produced in the experiment is not very stable. In the first place, the eigenvalues given in the paper Since it has not been reproduced, I feel that something is wrong with the implementation. I couldn't understand it with my own ability.
The following measures are taken to stabilize it somehow.
Extraction of test data In the paper, it is enough to extract 100 test data from each class. It says. However, the CSG fluctuates considerably depending on what is included in these 100 pieces. Therefore, the extraction is performed ** 5 times, ** CSG is calculated, and the average value is taken.
t-sne Since the result of t-sne is different each time, it is executed ** 5 times to calculate ** CSG and take the average value.
As a result of the above measures, the CSG was calculated 5 x 5 = 25 times for each data size, and the average value was calculated.
I calculated the CSG, but sometimes the correlation coefficient is around 0.9, and sometimes it is 0. Because there is, it can be said that it is completely unreliable.
However, W and MDS have similar results to the paper, so I will report them.
Although W gave similar results, the similarity between "dog" and "cat" is as in the paper. It wasn't a big value.
The results of MDS are similar to the paper. Properly, "cat" and "dog", "deer" and "berd" are in close proximity.
This time, I verified it using the GPU of Colaboratory.
As for GPU, ** No way ** `Tesla P100``` came out, so when I used
`Tesla K80``` in parentheses
I will also write down the time to remember.
The above time is the time when all the data of CIFAR10 is used. As you can see, the execution speed is faster than training CNN and seeing the classification accuracy directly. I understand.
Applying the above knowledge, ** Efficient reduction of the amount of classified data, effective in data cleansing ** It may work. I will write those details in the next article.