I implemented image caption generation using Chainer. When you enter an image, the description will be generated. The source code is below. https://github.com/dsanno/chainer-image-caption
I used the algorithm in the following paper. Show and tell: A neural image caption generator
Some people have already implemented caption generation in Chainer, so I referred to that as well. Image caption generation by CNN and LSTM ~ Satoshi's Blog from Bloomington
The caption generation model used in the paper is roughly divided into three networks.
--Convert image to vector $ {\ rm CNN} $ $ {\ rm CNN} $ includes GoogleNet and VGG_ILSVRC_19_layers Use an existing model for image classification such as / 3785162f95cd2d5fee77 # file-readme-md). --Word embedding $ W_e $ --Enter a vector and output the probability of occurrence of the next word $ {\ rm LSTM} $
GPU memory was not enough if it was implemented as it is in the paper, so I changed it and implemented it.
--Convert image to vector $ {\ rm CNN} $ (Input: 224 x 224 x 3D Output: 4096 dimensions)
--Matrix $ W_I $ that converts the image feature vector to the input of $ {\ rm LSTM} $ (input: 4096 dimensions output: 512 dimensions)
--Word embedding (word to vector conversion) $ W_e $ (input: word ID output: 512 dimensions)
-$ {\ rm LSTM} $ (input 512 dimensions output: 512 dimensions)
-Convert the output of $ {\ rm LSTM} $ to the probability of word occurrence $ W_w $ (input 512 dimensions output:
I will explain based on the model in the following papers, but I think it is not difficult to replace it with the model actually used.
The learning targets are $ W_e $ and $ {\ rm LSTM} $. $ {\ rm CNN} $ uses the trained parameters as they are.
The training data are the image $ I $ and the word string $ \ {S_t \} (t = 0 ... N) $. However, $ S_0 $ is the statement start symbol \ <S > and $ S_N $ is the terminal symbol \ </ S >. Learn as follows.
Negative log-likelihood as a cost function in the paper
L(I,S)=-\sum_{t=1}^{N}\log p_t(S_t)
I used to use softmax cross entropy in my implementation. Also, in the paper, the parameter was updated by SGD without momentum, but in my implementation, I used Adam (parameter is the recommended value of Adam paper). .. I also tried the log-likelihood and SGD implementation, but it seems that there is no merit just because the learning converges slowly, but I do not understand why it is adopted in the paper. I also used dropout as in the paper. The paper also mentioned that "ensembling models" were used, but I didn't implement it because I didn't know the specific implementation method.
When generating a caption using a trained model, the word occurrence probabilities are calculated in order from the beginning as shown below, and the word string with the highest product of word appearance probabilities is used as the caption.
In this implementation, $ M = 20 $
For the training data, we used the image data set with Annotation of MSCOCO. However, instead of the data distributed by MSCOCO, I used the data distributed on the following sites. The data distributed on this site are the feature vector data extracted from the image using VGG_ILSVRC_19_layers and the Annotation word string data. By using this data, we were able to save the trouble of extracting the feature vector from the image and the trouble of preprocessing Annotation (dividing the sentence into words).
Deep Visual-Semantic Alignments for Generating Image Descriptions
According to the following site, MSCOCO's Annotation data seems to be difficult to handle due to severe notational fluctuations (sentences start with uppercase or lowercase letters, with or without periods).
Of the words included in the training data, only the words that appear 5 times or more were used, and the others were learned as unknown words.
It seems that there are indicators such as BLEU, METEOR, and CIDER to evaluate the generated captions, but this time I did not calculate the indicators.
Captions were generated using public domain images downloaded from PublicDomainPictures.net. Place the top 5 of the generated character strings.
Some were generated correctly, while others were clearly wrong.
Recommended Posts