Created a fooling image for the caption generative model

Original paper Deep neural networks are easily fooled High confidence predictions for unrecognizable images In order to know the contents of DNN, an image that is not understood by humans but is classified into DNN with 99% or more confidence was generated. For example, it looks like this: キャプチャ.JPG jkljkljlklk.JPG (Quoted from the paper)

Let's generate this fooling image for the generating model as well.

Normal caption generation example

like this Ci0S0LkU4AE5DjK_orig.jpg If you include an image of a horse, you will get a caption like this. It seems that you can see two horse-like things. The probability of a sentence appearing is calculated from the probability of a word appearing. Three sentences that are easy to come out are displayed. It is judged that the smaller the number on the left is, the more appropriate the sentence is for the image. (Actually, the sign inversion of the sum of the logarithms of softmax for each word divided by the number of words)

Also, if you insert an image with random pixel values, the following sentence will be generated. Ci0S0MjUYAAGFzR_orig.jpg Although it is a sentence, the number is large, that is, it is not possible to judge what is in the image.

fooling image generation result

I was able to generate it well for the time being. Ci0HI4LUgAAhh_W_orig.jpg Ci0HI4AUUAAUxX5_orig.jpg

Two sheets were generated. Neither is known to humans, and machines have a high probability of generating sentences about horses. (= The number is smaller than the previous example)

Above: direct encoding, the pixel of the image is the direct gene Bottom: indirect encoding, pixels have some correlation In the paper, the indirect encoding had a beautiful pattern and was exhibited as art, but it didn't work just by creating an NN and giving it a correlation. (Maybe it was too good)

How did you do that

The image was evolved so that the probability of generating one sentence is high. At the top of the sentence that was finally generated in the first example "a couple of horses are standing in a field" Was selected, and the image was evolved so that the probability of generating this sentence was high. Eight new individuals were generated each time, leaving eight excellent individuals, and direct encoding gave such a result in about 300 generations.

About the generative model

This time, a fooling image was generated for the caption generation model Show, Attend and Tell. The BLEU value for COCO of the model was 0.689 / 0.503 / 0.359 / 0.255.

Summary

Using an evolutionary algorithm, we succeeded in generating a fooling image that increases the probability of generating a sentence for the generative model. If you feel like this image can be fooled to other models trained on the same CNN, or if you evolve it for multiple sentences, try it.

Recommended Posts

Created a fooling image for the caption generative model
Created a Python wrapper for the Qiita API
A program that searches for the same image
(Reading the paper) Jukebox: A Generative Model for Music Prafulla (Music sampling using VQ-VAE)
The image is a slug
Try a similar search for Image Search using the Python SDK [Search]
Try to edit a new image using the trained StyleGAN2 model
Disclose the know-how that created a similar image search service for AV actresses by deep learning by chainer
Find the dates for a jarring tournament
Create a model for your Django schedule
Cut out A4 print in the image
Change the list in a for statement
Hashing algorithm for determining the same image
Creating a position estimation model for the Werewolf Intelligence Tournament using machine learning
A model that identifies the guitar with fast.ai
Exposing the DCGAN model for Cifar 10 with keras
Created a header-only library management tool for C / C ++
Make a histogram for the time being (matplotlib)
I created a Dockerfile for Django's development environment
Python: Prepare a serializer for the class instance:
Image processing? The story of starting Python for
Implementation of Deep Learning model for image recognition
A story about improving the program for partial filling of 3D binarized image data