We learned the comments of Nico Nico Douga with a caption generation model using deep learning and automatically generated them. Click here for related papers. Show and Tell: A Neural Image Caption Generator(code) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Learn feast videos and comments (train, dev)
Generate a comment with a video of Kemofure (test)
Part of the code is at the end.
I learned a set of Nico Nico Douga videos and comments and applied them to another video. I don't use any new technology. I tried it like a machine learning tutorial.
The general flow is like this.
Caption generation generates a descriptive text for the input image. Images and sentences are attached to the training data. Data sets include COCO and Flickr. This time, we will extract one frame of the video and generate a comment for that one image. For this, we use caption generation using deep learning.
This time I use CNN + RNN. Roughly speaking, the features of the image are extracted by CNN and the features of the sentence are extracted by RNN, and the correspondence between them is learned. The history of caption generation is Automatic generation of image captions, Yoshitaka Ushiku, 2016
I used the comments from Nico Nico Douga. I used Succubus for download. I called Succubus from Python and downloaded about 420,000 comments, going back in order from the latest comment to the oldest comment. We separated them into words with MeCab, making a total of 75,356 words. The top 10 frequencies and the number are as follows.
.py
207,947
Good 170,959
U 145,736
! 46,939
43,857
・ 31,119
26,608
25,683
Is 25,392
24,575
Shigeru 24,540
The first place was the whitespace character'\ xe3 \ x80 \ x80'. The top 30 most frequently used words that are judged to be nouns are as follows.
.py
Matsuzaki 24,228
Shigeru 24,165
Aso 23,962
Taro 22,832
( 17,850
) 17,638
P 13,952
Chino 13,812
~ 12,410
Here 11,916
Hmm 11,414
Rohi 11,324
Pro 11,298
♪ 10,916
Oo 9,677
Ishiba 9,463
Shigeru 9,244
Goku 8,981
-8,664
O 8,038
Lao 7,966
I 7,775
Oh 7,537
Two 6,993
Go 6,130
Waste 6,099
5 of,990
Cocoa 5,909
Shuzo 5,852
Matsuoka 5,680
w 5,637
Of these 70,000 words, the most frequent 30,000 were used, and the rest were unknown words. The frequency of the 30,000th word was 2. Normally, words that appear less than 5 times in the dataset are unknown words, but this time I wanted to increase the vocabulary, so I lowered the threshold.
In addition, the top 10 comments in descending order of the number of comments for each frame are as follows.
Number of comments | Frame index | Time in the video |
---|---|---|
1,482 | 0 | 0m0s |
1,339 | 42,468 | 23m37s |
622 | 10,531 | 5m51s |
446 | 10,530 | 5m51s |
342 | 10,532 | 5m51s |
195 | 28,795 | 16m0s |
188 | 8,075 | 4m29s |
164 | 10,529 | 5m51s |
121 | 25,169 | 13m59s |
121 | 28,091 | 15m37 |
The top two are the first and last frames. Regarding the frame number of the video attached to the comment, it was set to a value slightly larger than the maximum frame of the video, so I decided that they were all the last frames. Here are some images and randomly obtained comments. (The image is rough)
0th frame The oldest person in Japan, Misao Okawa, 117 years old, 27 days old Up to 1.5 million plays 1772 Where to return
42,468th frame the world Bucchipa Cute
10,531st frame here now ! This ↑ this ↓
10,530th frame here here ! here !
10,532th frame here ! Here it is! This is ↑ this ↓!
The video was made into an image for each frame and the features were extracted by passing it through CNN in advance. It was 42,469 frames in 24 minutes. There are an average of 10 comments per frame. There were 513 uncommented frames. CNN used VGG-19. I resized the 640x480 image to 224x224 and input it to the CNN, using a 4,096 dimensional vector from the relu7 layer.
Learn the image and the comments attached to it. First, we divided the feast data into train and dev. The frames were randomly divided 9: 1. For example, all comments attached to frames divided into trains belong to train.
Learning turned around comments. For example, if the train has 100 frames and 500 comments, all 500 comment-frame pairs are learned to make 1 epoch.
This time, 1 frame out of 10 frames of the animation movie is dev and there are many duplicate comments, so dev is included in the train. It might have been better to switch train and dev every second, or to split between the first half and the second half of the video.
The number of comments and duplication of each data are as follows. (After converting all words not included in the 30,000 words in the vocabulary to unknown word symbols)
all | train | dev | |
---|---|---|---|
The entire | 427,364 | 38,4922 | 42,442 |
uniq | 202,671 | 185,094 | 26,682 |
In addition, the number of comments by uniq (train) and uniq (dev) was 9,105.
This time, instead of real-time processing, we generated comments for each frame using a model created by learning, and connected them to make a video. Create comments for all the target frames, combine them into one file, and write them to an image later. If you use the succubus function, you can automatically create a video with comments if you have a comment file and a video file, but I didn't know how to do it, so I made it in Python.
The comments generated in order from the 0th frame were read and drawn on the image. Comments drawn in the past have been moved to frame out after 4 seconds. In the above feast video, the amount of comments generated was determined from the distribution of the number of comments in the dataset (though this is a cheat).
In addition, in order to increase the variation of comments, the first word is stochastically selected from all words except symbols based on the probability of appearance of the word, and the second and subsequent words have the highest probability from other than unknown words. Is now selected.
I generated a comment of the movie of Kemono Friends with the model learned from the data of Gochiusa.
It is impossible that the Friends word was generated properly in this way, only the word about Kemono Friends that happened to be included in the comment of Gochiusa was generated.
When I looked at it, I found that the learning data contained words that were unexpectedly born from fraud, with comments like that occasionally flowing. The top two are the same scene and the same comment, but the one below is separated by spaces. In the first place, MeCab is used normally, so slang etc. are not separated at all.
I expected that if there was a black scene with a touch, it would generate a comment that was attached to the black scene with a feast like "Shigetenna", but the grammar etc. collapsed more than I expected. was doing. It seems that RNN is confused by the difference from the training data.
You can get more out of RNN performance by using attention. In the Show, Attend and tell example, the model will be able to focus on the position in the image associated with the word when it is generated. Simply put, when you generate a word, you'll learn to focus on somewhere in the image, and hopefully continue to focus on similar positions.
What I want to do this time is that for the frame where the caption containing the word "Chino" is generated after learning, for example, when the word "Chino" is generated, the Chino in the image should be attracted attention. That is.
To focus on somewhere in the image, we need something like coordinate information. Since there is no such thing in the previous learning method, we will change the data to be used a little. Unlike before, we use the vector from the CONV5_3 layer of VGG-19. This is 14x14x512. The image is divided into 14x14 areas (overlapping), and 512-dimensional vectors are extracted from each. You can also resize the image to 448x448 instead of 224 to extract a 28x28x512 dimensional vector. With this, when inputting a CNN vector to an RNN, you can focus on the area corresponding to the related word by applying a large weight to the important part of 14x14 and a small weight to the non-important part.
I brought a sentence that contains the word "Chino" in the sentence generated for the image of Chino-chan. For this image, the sentence "Chino-chan here is cute" was generated.
The white part is the part of interest, and the position of interest changes as the words are generated in order. When you generate the word "Chino", it's okay if the area where Chino is reflected is white.
The result was that the way the attention was applied was subtle and not useful as a whole. This is not the case in this example, but I felt that the person's face tended to get a little attention.
In the case of cocoa, it looks like this: "I like the look of cocoa here"
It seems that cocoa has more attention on the first piece, but when you look at the second piece, Rize has attention, and it seems that the person cannot be identified.
I think there are various causes, such as the words in the comments of the dataset and the objects in the image do not correspond so much, and the data is such that the frames can be easily classified.
Word vector 30,000 vocabulary x 256 dimensions LSTM hidden layer 256 dimensions GPU NVIDIA TITANX 1 piece
The hidden layer is much less. The model was LSTM1 layer and MLP2 layer for word prediction.
model | The number of epochs it took to get the best valid | batch size | Time to take 1 epoch |
---|---|---|---|
VGG relu7, no attention | 19 epoch | 256 | 7m30s |
VGG Conv5_3, with attention | 25 epoch | 256 | 25m0s |
Since the comments depend heavily on something other than what is shown in the image, simply training them will not work.
Since CNN was pre-learned with ImageNet, I don't know how much features could be extracted in the animation, but if you fine tune it, you may have seen more attention as you expected. In the first place, it may have worked well even if it was not a rich one like VGG. I wanted to learn the size and color of comments, but I stopped. Since the ratio of the number of premium comments is small, it is necessary to devise such as biasing the loss if learning, but in the scene where there is not much difference between the contents of the premium comment and the normal comment, random generation may be sufficient according to the probability. I was wondering if the character got attention if I learned all the birds, or if even one bird got attention if I learned the lines. It was surreal and interesting to see Nico Nico's comments flowing on the console while doing it.
Saccubus1.66.3.11 I did a command line operation from Python. I got an error on Ubuntu 14.04, so I did it on Windows 10.
.py
import subprocess
def download_comment(email, password, target_id, WaybackTime):
cmd = 'java -jar Saccubus1.66.3.11\saccubus\saccubus.jar %s %s %s %s @DLC' % (email, password, target_id, WaybackTime)
subprocess.call(cmd.split())
Python-xml The file downloaded by Succubus is xml, so I read it as follows. I specified the oldest date as the next Wayback Time and went back.
.py
import xml.etree.ElementTree as ET
def read_xml(path):
return ET.fromstring(codecs.open(path, 'r+', 'utf-8').read())
def extract_data(root):
date, vpos, text = [], [], []
for r in root:
if 'date' in r.attrib:
date.append(int(r.attrib['date']))
vpos.append(int(r.attrib['vpos']))
text.append(r.text)
return date, vpos, text
xml = read_xml(xml_path)
date, vpos, text = extract_data(xml)
oldest_date = min(date)
MeCab I used it
.py
from natto import MeCab
mecab = MeCab()
for t in text:
res_raw = mecab.parse(t.encode('utf-8'))
Python imageio
.py
import imageio
vid = imageio.get_reader(movie_path + movie_name, 'ffmpeg')
frame = vid.get_data(idx)
Python PIL
.py
from PIL import Image
from PIL import ImageFont
from PIL import ImageDraw
img = Image.fromarray(frame)
draw = ImageDraw.Draw(img)
draw.text((x, y), text, font=font, fill=(0, 0, 0))
#Border character
def write_text_with_outline(draw, x, y, text, font, fillcolor, shadowcolor):
offset = [[-1, 0], [1, 0], [0, -1], [0, 1], [-1, -1], [1, -1], [-1, 1], [1, 1]]
for _x, _y in offset:
draw.text((x + _x, y + _y), text, font=font, fill=shadowcolor)
draw.text((x, y), text, font=font, fill=fillcolor)
ffmpeg There are various methods, but the first frame and the number of frames to write are specified by -start_number and -frames. I feel that it was okay without other options.
ffmpeg -y -f image2 -start_number 0 -loop 1 -r 29.97 -i 'out/frames/target/%010d.png' -pix_fmt yuv420p -frames 40000 out/movies/target.mp4
TensorFlow 0.12
Show and Tell: A Neural Image Caption Generator, Vinyals et al., 2015
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Xu et al., 2015
Very deep convolutional networks for large-scale image recognition, Simonyan and Zisserman, 2014
Framing image description as a ranking task: Data, models and evaluation metrics, Hodosh et al., 2013
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, Young et al., 2014
Microsoft coco: Common objects in context, Lin et al., 2014(web)