What kind of comments do people who have only seen feasts make when they see the touch, and can the machine recognize Chino-chan just by commenting?

We learned the comments of Nico Nico Douga with a caption generation model using deep learning and automatically generated them. Click here for related papers. Show and Tell: A Neural Image Caption Generator(code) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Comment generation result

gif_1 Learn feast videos and comments (train, dev)

gif_2 Generate a comment with a video of Kemofure (test)

Part of the code is at the end.

What I did this time

I learned a set of Nico Nico Douga videos and comments and applied them to another video. I don't use any new technology. I tried it like a machine learning tutorial.

The general flow is like this.

    1. Is the Order a Rabbit? Learn Comments with the First Bird
  1. Comment generation in the first episode of Kemono Friends
    1. Learn and visualize with attention model

Caption generation

Caption generation generates a descriptive text for the input image. Images and sentences are attached to the training data. Data sets include COCO and Flickr. This time, we will extract one frame of the video and generate a comment for that one image. For this, we use caption generation using deep learning.

Caption generative model

This time I use CNN + RNN. Roughly speaking, the features of the image are extracted by CNN and the features of the sentence are extracted by RNN, and the correspondence between them is learned. The history of caption generation is Automatic generation of image captions, Yoshitaka Ushiku, 2016

data set

comment

I used the comments from Nico Nico Douga. I used Succubus for download. I called Succubus from Python and downloaded about 420,000 comments, going back in order from the latest comment to the oldest comment. We separated them into words with MeCab, making a total of 75,356 words. The top 10 frequencies and the number are as follows.

.py


  207,947
Good 170,959
U 145,736
! 46,939
43,857
・ 31,119
26,608
25,683
Is 25,392
24,575
Shigeru 24,540

The first place was the whitespace character'\ xe3 \ x80 \ x80'. The top 30 most frequently used words that are judged to be nouns are as follows.

.py


Matsuzaki 24,228
Shigeru 24,165
Aso 23,962
Taro 22,832
( 17,850
) 17,638
P 13,952
Chino 13,812
~ 12,410
Here 11,916
Hmm 11,414
Rohi 11,324
Pro 11,298
♪ 10,916
Oo 9,677
Ishiba 9,463
Shigeru 9,244
Goku 8,981
-8,664
O 8,038
Lao 7,966
I 7,775
Oh 7,537
Two 6,993
Go 6,130
Waste 6,099
5 of,990
Cocoa 5,909
Shuzo 5,852
Matsuoka 5,680
w 5,637

Of these 70,000 words, the most frequent 30,000 were used, and the rest were unknown words. The frequency of the 30,000th word was 2. Normally, words that appear less than 5 times in the dataset are unknown words, but this time I wanted to increase the vocabulary, so I lowered the threshold.

In addition, the top 10 comments in descending order of the number of comments for each frame are as follows.

Number of comments Frame index Time in the video
1,482 0 0m0s
1,339 42,468 23m37s
622 10,531 5m51s
446 10,530 5m51s
342 10,532 5m51s
195 28,795 16m0s
188 8,075 4m29s
164 10,529 5m51s
121 25,169 13m59s
121 28,091 15m37

The top two are the first and last frames. Regarding the frame number of the video attached to the comment, it was set to a value slightly larger than the maximum frame of the video, so I decided that they were all the last frames. Here are some images and randomly obtained comments. (The image is rough)

0th frame The oldest person in Japan, Misao Okawa, 117 years old, 27 days old Up to 1.5 million plays 1772 Where to return

42,468th frame the world Bucchipa Cute

10,531st frame here now ! This ↑ this ↓

10,530th frame here here ! here !

10,532th frame here ! Here it is! This is ↑ this ↓!

image

The video was made into an image for each frame and the features were extracted by passing it through CNN in advance. It was 42,469 frames in 24 minutes. There are an average of 10 comments per frame. There were 513 uncommented frames. CNN used VGG-19. I resized the 640x480 image to 224x224 and input it to the CNN, using a 4,096 dimensional vector from the relu7 layer.

Learning

Learn the image and the comments attached to it. First, we divided the feast data into train and dev. The frames were randomly divided 9: 1. For example, all comments attached to frames divided into trains belong to train.

Learning turned around comments. For example, if the train has 100 frames and 500 comments, all 500 comment-frame pairs are learned to make 1 epoch.

This time, 1 frame out of 10 frames of the animation movie is dev and there are many duplicate comments, so dev is included in the train. It might have been better to switch train and dev every second, or to split between the first half and the second half of the video.

The number of comments and duplication of each data are as follows. (After converting all words not included in the 30,000 words in the vocabulary to unknown word symbols)

all train dev
The entire 427,364 38,4922 42,442
uniq 202,671 185,094 26,682

In addition, the number of comments by uniq (train) and uniq (dev) was 9,105.

Animated

This time, instead of real-time processing, we generated comments for each frame using a model created by learning, and connected them to make a video. Create comments for all the target frames, combine them into one file, and write them to an image later. If you use the succubus function, you can automatically create a video with comments if you have a comment file and a video file, but I didn't know how to do it, so I made it in Python.

The comments generated in order from the 0th frame were read and drawn on the image. Comments drawn in the past have been moved to frame out after 4 seconds. In the above feast video, the amount of comments generated was determined from the distribution of the number of comments in the dataset (though this is a cheat).

In addition, in order to increase the variation of comments, the first word is stochastically selected from all words except symbols based on the probability of appearance of the word, and the second and subsequent words have the highest probability from other than unknown words. Is now selected.

Applies to Kemono Friends

I generated a comment of the movie of Kemono Friends with the model learned from the data of Gochiusa.

jump_power.png It is impossible that the Friends word was generated properly in this way, only the word about Kemono Friends that happened to be included in the comment of Gochiusa was generated.

gif_2 gif_2

When I looked at it, I found that the learning data contained words that were unexpectedly born from fraud, with comments like that occasionally flowing. The top two are the same scene and the same comment, but the one below is separated by spaces. In the first place, MeCab is used normally, so slang etc. are not separated at all.

I expected that if there was a black scene with a touch, it would generate a comment that was attached to the black scene with a feast like "Shigetenna", but the grammar etc. collapsed more than I expected. was doing. It seems that RNN is confused by the difference from the training data.

Attention model

You can get more out of RNN performance by using attention. In the Show, Attend and tell example, the model will be able to focus on the position in the image associated with the word when it is generated. Simply put, when you generate a word, you'll learn to focus on somewhere in the image, and hopefully continue to focus on similar positions.

What I want to do this time is that for the frame where the caption containing the word "Chino" is generated after learning, for example, when the word "Chino" is generated, the Chino in the image should be attracted attention. That is.

Image feature vector to use

To focus on somewhere in the image, we need something like coordinate information. Since there is no such thing in the previous learning method, we will change the data to be used a little. Unlike before, we use the vector from the CONV5_3 layer of VGG-19. This is 14x14x512. The image is divided into 14x14 areas (overlapping), and 512-dimensional vectors are extracted from each. You can also resize the image to 448x448 instead of 224 to extract a 28x28x512 dimensional vector. With this, when inputting a CNN vector to an RNN, you can focus on the area corresponding to the related word by applying a large weight to the important part of 14x14 and a small weight to the non-important part.

Attention experiment results

att_3.PNG I brought a sentence that contains the word "Chino" in the sentence generated for the image of Chino-chan. For this image, the sentence "Chino-chan here is cute" was generated.

The white part is the part of interest, and the position of interest changes as the words are generated in order. When you generate the word "Chino", it's okay if the area where Chino is reflected is white.

The result was that the way the attention was applied was subtle and not useful as a whole. This is not the case in this example, but I felt that the person's face tended to get a little attention.

In the case of cocoa, it looks like this: "I like the look of cocoa here" att_1.PNG

att_2.PNG It seems that cocoa has more attention on the first piece, but when you look at the second piece, Rize has attention, and it seems that the person cannot be identified.

I think there are various causes, such as the words in the comments of the dataset and the objects in the image do not correspond so much, and the data is such that the frames can be easily classified.

Learning environment

Word vector 30,000 vocabulary x 256 dimensions LSTM hidden layer 256 dimensions GPU NVIDIA TITANX 1 piece

The hidden layer is much less. The model was LSTM1 layer and MLP2 layer for word prediction.

Learning time

model The number of epochs it took to get the best valid batch size Time to take 1 epoch
VGG relu7, no attention 19 epoch 256 7m30s
VGG Conv5_3, with attention 25 epoch 256 25m0s

Conclusion

Since the comments depend heavily on something other than what is shown in the image, simply training them will not work.

Impressions

Since CNN was pre-learned with ImageNet, I don't know how much features could be extracted in the animation, but if you fine tune it, you may have seen more attention as you expected. In the first place, it may have worked well even if it was not a rich one like VGG. I wanted to learn the size and color of comments, but I stopped. Since the ratio of the number of premium comments is small, it is necessary to devise such as biasing the loss if learning, but in the scene where there is not much difference between the contents of the premium comment and the normal comment, random generation may be sufficient according to the probability. I was wondering if the character got attention if I learned all the birds, or if even one bird got attention if I learned the lines. It was surreal and interesting to see Nico Nico's comments flowing on the console while doing it.

Software, libraries, etc. that I used

Get comments

Saccubus1.66.3.11 I did a command line operation from Python. I got an error on Ubuntu 14.04, so I did it on Windows 10.

.py


import subprocess
def download_comment(email, password, target_id, WaybackTime):
    cmd = 'java -jar Saccubus1.66.3.11\saccubus\saccubus.jar %s %s %s %s @DLC' % (email, password, target_id, WaybackTime)
    subprocess.call(cmd.split())

Read comment file

Python-xml The file downloaded by Succubus is xml, so I read it as follows. I specified the oldest date as the next Wayback Time and went back.

.py


import xml.etree.ElementTree as ET
def read_xml(path):
    return ET.fromstring(codecs.open(path, 'r+', 'utf-8').read())

def extract_data(root):
    date, vpos, text = [], [], []
    for r in root:
        if 'date' in r.attrib:
            date.append(int(r.attrib['date']))
            vpos.append(int(r.attrib['vpos']))
            text.append(r.text)
    return date, vpos, text

xml = read_xml(xml_path)
date, vpos, text = extract_data(xml)
oldest_date = min(date)

Word split

MeCab I used it

.py


from natto import MeCab
mecab = MeCab()
for t in text:
    res_raw = mecab.parse(t.encode('utf-8'))

Video loading

Python imageio

.py


import imageio
vid = imageio.get_reader(movie_path + movie_name, 'ffmpeg')
frame = vid.get_data(idx)

Export text to image

Python PIL

.py


from PIL import Image
from PIL import ImageFont
from PIL import ImageDraw 

img = Image.fromarray(frame)
draw = ImageDraw.Draw(img)
draw.text((x, y), text, font=font, fill=(0, 0, 0))
#Border character
def write_text_with_outline(draw, x, y, text, font, fillcolor, shadowcolor):
    offset = [[-1, 0], [1, 0], [0, -1], [0, 1], [-1, -1], [1, -1], [-1, 1], [1, 1]]
    for _x, _y in offset:
        draw.text((x + _x, y + _y), text, font=font, fill=shadowcolor)
    draw.text((x, y), text, font=font, fill=fillcolor)

Image video conversion

ffmpeg There are various methods, but the first frame and the number of frames to write are specified by -start_number and -frames. I feel that it was okay without other options.

ffmpeg -y -f image2 -start_number 0 -loop 1 -r 29.97 -i 'out/frames/target/%010d.png' -pix_fmt yuv420p -frames 40000 out/movies/target.mp4

Machine learning library

TensorFlow 0.12

Reference paper

Show and Tell: A Neural Image Caption Generator, Vinyals et al., 2015

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Xu et al., 2015

Very deep convolutional networks for large-scale image recognition, Simonyan and Zisserman, 2014

Framing image description as a ranking task: Data, models and evaluation metrics, Hodosh et al., 2013

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, Young et al., 2014

Microsoft coco: Common objects in context, Lin et al., 2014(web)

Recommended Posts

What kind of comments do people who have only seen feasts make when they see the touch, and can the machine recognize Chino-chan just by commenting?
The mystery of the number that can be seen just by arranging 1s-The number of repunits and mysterious properties-