Overview

I implemented image caption generation using Chainer. When you enter an image, the description will be generated. The source code is below. https://github.com/dsanno/chainer-image-caption

I used the algorithm in the following paper. Show and tell: A neural image caption generator

Some people have already implemented caption generation in Chainer, so I referred to that as well. Image caption generation by CNN and LSTM ～ Satoshi's Blog from Bloomington

Caption generative model

The caption generation model used in the paper is roughly divided into three networks.

--Convert image to vector $ {\ rm CNN} $ $ {\ rm CNN} $ includes GoogleNet and VGG_ILSVRC_19_layers Use an existing model for image classification such as / 3785162f95cd2d5fee77 # file-readme-md). --Word embedding $ W_e $ --Enter a vector and output the probability of occurrence of the next word $ {\ rm LSTM} $

Model used for implementation

GPU memory was not enough if it was implemented as it is in the paper, so I changed it and implemented it.

--Convert image to vector $ {\ rm CNN} $ (Input: 224 x 224 x 3D Output: 4096 dimensions) --Matrix $ W_I $ that converts the image feature vector to the input of $ {\ rm LSTM} $ (input: 4096 dimensions output: 512 dimensions) --Word embedding (word to vector conversion) $ W_e $ (input: word ID output: 512 dimensions) -$ {\ rm LSTM} $ (input 512 dimensions output: 512 dimensions) -Convert the output of $ {\ rm LSTM} $ to the probability of word occurrence $ W_w $ (input 512 dimensions output: dimension)

I will explain based on the model in the following papers, but I think it is not difficult to replace it with the model actually used.

Model learning

The learning targets are $ W_e $ and $ {\ rm LSTM} $. $ {\ rm CNN} $ uses the trained parameters as they are.

The training data are the image $ I $ and the word string $ \ {S_t \} (t = 0 ... N) $. However, $ S_0 $ is the statement start symbol \ <S > and $ S_N $ is the terminal symbol \ </ S >. Learn as follows.

Input the image $ I $ into $ {\ rm CNN} $ and extract the output of a specific intermediate layer as a feature vector.
Enter the feature vector in $ {\ rm LSTM} $.
Enter $ S_t $ in order from $ t = 0 $ to $ N-1 $ and get $ p_ {t + 1} $ at each step.
Minimize the cost obtained from the probability of outputting $ S_ {t + 1} $ $ p_ {t + 1} (S_ {t + 1}) $

Negative log-likelihood as a cost function in the paper

L(I,S)=-\sum_{t=1}^{N}\log p_t(S_t)

I used to use softmax cross entropy in my implementation. Also, in the paper, the parameter was updated by SGD without momentum, but in my implementation, I used Adam (parameter is the recommended value of Adam paper). .. I also tried the log-likelihood and SGD implementation, but it seems that there is no merit just because the learning converges slowly, but I do not understand why it is adopted in the paper. I also used dropout as in the paper. The paper also mentioned that "ensembling models" were used, but I didn't implement it because I didn't know the specific implementation method.

Caption generation

When generating a caption using a trained model, the word occurrence probabilities are calculated in order from the beginning as shown below, and the word string with the highest product of word appearance probabilities is used as the caption.

Input the image to $ {\ rm CNN} $ and extract the output of a specific intermediate layer as a feature vector.
Enter the feature vector in $ {\ rm LSTM} $.
Convert the statement start symbol \ <S > to a vector using $ W_e $ and enter it in $ {\ rm LSTM} $.
Since the probability of word occurrence is known from the output of $ {\ rm LSTM} $, select the top $ M $ words.
Convert the word output in the previous step into a vector using $ W_e $ and enter it in $ {\ rm LSTM} $.
From the output of $ {\ rm LSTM} $, calculate the product of the probabilities of the words output so far, and select the top M word strings.
Repeat steps 5 and 6 until the word output is terminal \ </ S >.

In this implementation, $ M = 20 $

Training data

For the training data, we used the image data set with Annotation of MSCOCO. However, instead of the data distributed by MSCOCO, I used the data distributed on the following sites. The data distributed on this site are the feature vector data extracted from the image using VGG_ILSVRC_19_layers and the Annotation word string data. By using this data, we were able to save the trouble of extracting the feature vector from the image and the trouble of preprocessing Annotation (dividing the sentence into words).

Deep Visual-Semantic Alignments for Generating Image Descriptions

According to the following site, MSCOCO's Annotation data seems to be difficult to handle due to severe notational fluctuations (sentences start with uppercase or lowercase letters, with or without periods).

I summarized the Microsoft COCO (MS COCO) dataset-I can have 3 cups of rice on the topic of artificial intelligence

Of the words included in the training data, only the words that appear 5 times or more were used, and the others were learned as unknown words.

Evaluation of generated captions

It seems that there are indicators such as BLEU, METEOR, and CIDER to evaluate the generated captions, but this time I did not calculate the indicators.

Caption generation example

Captions were generated using public domain images downloaded from PublicDomainPictures.net. Place the top 5 of the generated character strings.

clock

``` a clock on the side of a building a clock that is on the side of a building a clock on the side of a brick building a close up of a street sign on a pole a clock that is on top of a building ```

Is traffic control in progress? Police officer

``` a man riding a skateboard down a street a man riding a skateboard down a road a man riding a skateboard down the street a man riding a skateboard down a sidewalk a man riding a skateboard down the side of a road ``` skateboard. .. .. ??

Woman with a tennis racket

``` a woman holding a tennis racquet on a tennis court a man holding a tennis racquet on a tennis court a woman holding a tennis racquet on a court a woman holding a tennis racquet on top of a tennis court a man holding a tennis racquet on a court ``` The second and fifth are "man", but sometimes I mistake man for woman.

Living sofa

``` a cat laying on top of a bed a cat sitting on top of a bed a cat sitting on top of a couch a black and white cat laying on a bed a cat laying on a bed in a room ``` It seems that the cushion is mistaken for a cat.

Some were generated correctly, while others were clearly wrong.

References

Oriol Vinyals, Alexander Toshev, Samy Bengio, et al. Show and tell: A neural image caption generator . 2015
Diederik Kingma, Jimmy Ba Adam: A Method for Stochastic Optimization. 2014
MSCOCO
Deep Visual-Semantic Alignments for Generating Image Descriptions
Image caption generation by CNN and LSTM ～ Satoshi's Blog from Bloomington -I summarized the Microsoft COCO (MS COCO) dataset-I can have 3 cups of rice on the topic of artificial intelligence

Image caption generation with Chainer