Humanity of everyone, Hello. I'm Yaguchi, the software manager for Virtual YouTuber Hiro Rinkai.
This time, I will explain the "Santa Identifier" which was taken up in the video.
v It is a YouTube video link. Please watch it if you like, and subscribe to the channel if you like it. v
The source code is available on github. It's a very short code, so please take a look there first.
https://github.com/hyaguchi947d/santa_recognition
As a problem, I set it to "identify Santa's illustrations from non-Santa's illustrations". Due to the fact that we finish the video once a week, we only have one day to actually create, code, and validate the dataset. So design with the minimum set.
This is the hardest part. Because we are a for-profit company and video posting is a sales activity. In other words, do not use any data other than the data that is clearly marked as commercially available.
So, first of all, I made it with 20 sheets based on the rules of "Irasutoya". Since the data is too biased, we added it from "Pixabay" and "Material Good". In the end, there were a total of 58 sheets of 29 + 29.
Furthermore, both positive and negative data were divided into train: test = 22: 7. In this case as well, the distribution source data is divided so that it is not biased.
Please note that we will not redistribute or link datasets as they violate the rules.
I had decided to drop the video in three steps, so The simplest color histogram at first, The last is YOLO v3, which I have used for deep learning. It's one of the middle, but I was at a loss with HOG, but I chose Bag of Visual Words. In order to minimize the content of writing code by yourself Things that can be implemented with OpenCV in about an hour, Or there was a condition that it could be operated without implementing it.
The problem given is a simple two-class classification. Anyway, as long as the color histogram is taken The rest can be judged by SVM.
So, I will make it quickly with OpenCV Python and scikit-learn.
There is a tutorial in OpenCV on how to take the color histogram. https://docs.opencv.org/master/dd/d0d/tutorial_py_2d_histogram.html
SVM also has an easy-to-understand tutorial on scikit-learn. https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
What you are doing is very simple
def color_hist_from_file(imgfile):
img = cv.imread(imgfile)
hist = cv.calcHist([img], [0, 1, 2], None, [8, 8, 8], [0, 256, 0, 256, 0, 256])
hist_1d = hist.flatten()[1:-1] # remove bins with (0,0,0) and (255,255,255)
n_hist_1d = hist_1d / np.linalg.norm(hist_1d)
return n_hist_1d
In this function, a histogram (8 * 8 * 8 = 512 dimensions) that divides each bgr channel into 8 is generated. The 510-dimensional vector without the first and last bins (black and white, because it is the background color) is normalized and used as the vector for classification.
SVM learning is a list of positive and negative examples, Generate a list of labels corresponding to each (1 and 0 this time).
trains = hist_train_p + hist_train_n
labels = [1] * len(hist_train_p) + [0] * len(hist_train_n)
clf = svm.SVC()
clf.fit(trains, labels)
Only this.
Bag of Visual Words
There was no decent tutorial on OpenCV.
However, since it is an algorithm that I have implemented by myself, the worst is somehow I thought, but I was a little worried and could use the OpenCV one.
This article was helpful. https://hazm.at/mox/machine-learning/computer-vision/recipes/similar-image-retrieval.html
In addition, AKAZE is used for the feature points. This is because SIFT and SURF are protected by patents. (I think this story will be repeated in three months, but at that time.)
As a procedure,
--Calculate keypoints and descriptors of AKAZE feature points of teacher image
--Add descriptors to BOWKMeansTrainer
and create a dictionary.
--At this time, the reason why only descriptor is used is that Visual Words, which is the same as the number of dictionary sizes, is generated from the clustering result of desciptor.
--Bag of Visual Words is to generate a histogram from the image depending on how many of these Visual Words are included in the image. The number of dimensions is the number of dictionary sizes.
--Set a dictionary for BOWImgDescriptorExtractor
--This will give you the Bag of Visual Words for each image.
--The rest is to classify images using SVM
YOLO v3
I used darknet. that's all.
By the way, I used to use VoTT to generate datasets, Since it became v2, YOLO data could not be generated, V1 had a hard time with it because it was buggy (for some reason leaking annotated images).
Actually, if you just take a histogram + SVM, Even if the brightness is slightly different at the boundary part, it will enter the adjacent bin, In some cases, the similarity was extremely low. Or set the bins so that they overlap a little Adjacent bins are close to each other, and some measures are needed, but I didn't have time.
Therefore, I am sending an illustration of Santa Hiro, who is the object of identification. When I first wrote Santa clothes in FF0000, I wasn't Santa anymore. I adjusted the color a little and changed it to DD0000.
Bag of Visual Words
At first I was doing it only in grayscale, I made 3 BoVWs by separating each RGB channel and connected them. The results were almost unchanged.
One of the reasons I got lost with HOG is Because there were many vector graphics (Hiro Rinkai is also vector graphics) I thought that the edge would have an advantage.
As pointed out in the video, I think it was simply a matter of style that I couldn't identify well.
YOLO v3
As the person in charge of voice gave in the video, I wish I had separated here.
By the way, at first I tried to identify only one class of Santa, With this, it has become a humanoid illustration detector that also detects all not Santa. I had been learning about 30,000 epoch for 20 hours so far, but I discarded it once and I made 2 classes and started over again in about 20000 epoch for 15 hours. (Experientially, 20000 epoch is enough for discrimination performance) However, it doesn't really take much time because I just wait.
Again, I realized that the colors didn't come out so much.
As long as I introduce it in the weekly program, I didn't have the time to agree with it, but Any result, I had to get a punch line as entertainment, so The script itself was written so that it could fall into any result. However, it is not possible to send out inaccurate or sloppy things, so I tried to pack enough content to give a lecture within a 10-minute frame.
As expected it is difficult to provide this density every week The talk of apt technology may be about once a month, We look forward to your continued support of Hiro Rinkai.
Recommended Posts