Load caffe model with Chainer and classify images

Load the caffe model with Chainer and classify the images. The Chainer sample also has image classification, but I can't tell which image was classified into which category just by outputting the recognition rate. Allows you to output the category name and score as the classification result. You can find the source code at here. (A classified version of the code in this article) If you find it difficult to read the article, please clone it.

Download caffe model

This time, we will use bvlc_googlenet as the model. 1000 categories can be classified. There is a link to the caffemodel file on the bvlc_googlenet page, so download it from there.

Generate a label file

A label file is generated so that the category number of the classification result and the category name can be associated. Below is a script to download imagenet related files. https://github.com/BVLC/caffe/blob/master/data/ilsvrc12/get_ilsvrc_aux.sh A label file is generated by processing the synset_words.txt included in caffe_ilsvrc12.tar.gz described in this.

`synset_words.txt`


n01440764 tench, Tinca tinca
n01443537 goldfish, Carassius auratus
n01484850 great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias
n01491361 tiger shark, Galeocerdo cuvieri

Execute the following command

wget http://dl.caffe.berkeleyvision.org/caffe_ilsvrc12.tar.gz
tar -xf caffe_ilsvrc12.tar.gz
sed -e 's/^[^ ]* //g' synset_words.txt > labels.txt

The label file is created.

`labels.txt`


tench, Tinca tinca
goldfish, Carassius auratus
great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias
tiger shark, Galeocerdo cuvieri
hammerhead, hammerhead shark

Since there are two lines called "crane", it is confusing, so change the 135th line to "crane (bird)" and the 518th line to "crane (machine)".

Convert image to numpy array

Use Pillow to read the image, resize it, clip it and then convert it to a numpy array

import numpy as np
from PIL import Image

#Definition of input image size
image_shape = (224, 224)

#Read image and convert to RGB format
image = Image.open('sample.png').convert('RGB')

#Image resizing and clipping
image_w, image_h = self.image_shape
w, h = image.size
if w > h:
    shape = (image_w * w / h, image_h)
else:
    shape = (image_w, image_h * h / w)
x = (shape[0] - image_w) / 2
y = (shape[1] - image_h) / 2
image = image.resize(shape)
image = image.crop((x, y, x + image_w, y + image_h))
pixels = np.asarray(image).astype(np.float32)

#pixels are 3D and each axis is[Y coordinate,X coordinate, RGB]Represents
#Input data is 4D[Image index, BGR,Y coordinate,X coordinate]So, do the array conversion
#Convert from RGB to BGR
pixels = pixels[:,:,::-1]

#Swap the axes
pixels = pixels.transpose(2,0,1)

#Draw average image
mean_image = np.ndarray((3, 224, 224), dtype=np.float32)
mean_image[0] = 103.939
mean_image[1] = 116.779
mean_image[2] = 123.68
pixels -= self.mean_image

#Make it 4D
pixels = pixels.reshape((1,) + pixels.shape)

Load caffemodel and classify

Load the caffemodel and use the array you just generated as input data.

import chainer
import chainer.functions as F
from chainer.functions import caffe

#Load caffe model
func = caffe.CaffeFunction('bvlc_googlenet.caffemodel')

#layer'loss3/classifier'Get the output of and apply softmax
x = chainer.Variable(pixels, volatile=True)
y, = func(inputs={'data': x}, outputs=['loss3/classifier'], disable=['loss1/ave_pool', 'loss2/ave_pool'], train=False)
prediction = F.softmax(y)

Output the result

The classification result is output.

#Read label
categories = np.loadtxt('labels.txt', str, delimiter="\n")

#Scores and labels are linked and sorted in descending order of score
result = zip(prediction.data.reshape((prediction.data.size,)), categories)
result = sorted(result, reverse=True)

#View the top 10 results
for i, (score, label) in enumerate(result[:10]):
    print '{:>3d} {:>6.2f}% {}'.format(i + 1, score * 100, label)

Recognition example

When I recognized the landscape image taken in Asakusa, it became as follows. The top category is now a mosque. I would like you to recognize skyscrapers and towers, but they do not seem to be in the category.

  1  38.85% mosque
  2   6.07% fire engine, fire truck
  3   5.15% traffic light, traffic signal, stoplight
  4   3.97% radio, wireless
  5   3.25% cinema, movie theater, movie theatre, movie house, picture palace
  6   2.14% pier
  7   2.01% limousine, limo
  8   1.92% stage
  9   1.89% trolleybus, trolley coach, trackless trolley
 10   1.61% crane (machine)

At the end

There are several trained caffe models available that anyone can use to classify images. This time, only one image was input, but it is possible to input multiple images at the same time. Since it takes time to load the caffemodel, it is better to load the image while keeping the caffemodel loaded.

reference

Import Caffe model using Chainer and let it recognize images on Mac without CUDA