Deep learning technology has become familiar, and when I googled, I found a lot of samples that let me recognize various things as images. The results are easy to understand and fun to watch, so I wanted to make myself aware of something while brewing nth, but since everyone is doing it to recognize animals and favorite actresses, another story I want to do it.
So, ** I used the image of the railroad vehicle as learning data and tried to guess the format of the vehicle shown in the input image by deep learning technology. ** However, I think the level feeling is about the same as that of a small child who started to be interested in trains.
↓ We aim to input an image of such a vehicle and make it guess "E231 series 500 series".
Let's learn a neural network model that guesses one of five options for a given ** railcar exterior photo **! This time, we will target 5 types of JR East trains running in the suburbs of Tokyo. If you look at it, you can easily distinguish it by the color of the belt, but it may be difficult for the computer to know which part is the vehicle.
The image here is my own.
I referred to this article. Making an image recognition "○○ discriminator" with TensorFlow --Qiita
Automatically download the images that appear in Google Image Search.
pip install google_images_download
googleimagesdownload -k yamanote line
googleimagesdownload -k Chuo Line Rapid
googleimagesdownload -k E235
:
:
Of the collected images, use only the photos that show the exterior of the vehicle. The following images are not used.
--Images without vehicles (such as route maps and station buildings only) --Image showing multiple vehicles / trains --Interior photo --Model train --CG image
Try different keywords and finally collect more than 100 images per format. This time, there are 5 types, for a total of 540 sheets. It seems that the number is not enough at all, but even this is quite difficult ... Even if you change the keyword, only the same image will be hit.
Next, place the collected images in folders for each class.
Now, when the image is ready, it's time to learn. Basically, I referred to the contents of the following articles. I made a face recognition AI by fine tuning VGG16 using GPU --Qiita
Each format this time has completely different colors (there are two Yamanote lines, but they look quite different), so I think it's an easier task than identifying the actress's face, but it's still only about 500 images. Then it will be tough. To deal with this small amount of data, we will perform fine tuning ** using the trained model of VGG16. Models of VGG16 are available from TensorFlow (Keras) without the need for a separate package installation. Keras: What are VGG16 and VGG19? ?? --Qiita
VGG16 is a model that performs 1000 classes of image classification that is not related to railroad vehicles, but considering that the learned weights express the features that are effective for image identification, this task is only for the layer close to the output I will replace it according to and learn. It seems that it will be possible to solve the identification problem that has nothing to do with the training data of the first model. Oh mysterious. The input is a color image of 128 x 128 [^ 1], and after passing through the VGG16 model, 256 units of fully connected layer, Dropout, and 5 units of fully connected layer (output layer) are attached. Only the weights of the fully connected layer added this time and the part of Conv2D-Conv2D-Conv2D closest to the output layer of VGG16 are trained, and the remaining Conv2D layers are not moved from the learned parameters.
[^ 1]: As in the original article, 150x150 was fine, but when VGG16 is passed through, the image size becomes 1/32 each in the vertical and horizontal directions, so I thought about making it a multiple of 32. The 224x224 used for the original VGG16 learning ran out of memory (probably because TensorFlow didn't work well on Windows 10 and it runs on Linux on a virtual machine).
The input image is used for learning by giving fluctuations such as scaling and left-right inversion with `ʻImageDataGenerator`` as introduced in various places. The original image is about 500, but ** each epoch gives a different fluctuation, so it seems that the data is inflated **. Python --About Keras Image Data Generator | teratail
train.py
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, Flatten, Dropout
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from tensorflow.keras.models import Model
from tensorflow.keras.applications import VGG16
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
#Learning settings
batch_size = 32
epochs = 30
#Feature setting
#matches classes to the name of the subfolder
classes = ["E231-yamanote", "E233-chuo", "E233-keihintohoku", "E233-nanbu", "E235-yamanote"]
num_classes = len(classes)
img_width, img_height = 128, 128
feature_dim = (img_width, img_height, 3)
#File Path
data_dir = "./images"
# ===Image preparation===
datagen = ImageDataGenerator(
rescale=1.0 / 255, #Each pixel value is[0, 1]Convert to and handle
zoom_range=0.2,
horizontal_flip=True,
validation_split=0.1
)
train_generator = datagen.flow_from_directory(
data_dir,
subset="training",
target_size=(img_width, img_height),
color_mode="rgb",
classes=classes,
class_mode="categorical",
batch_size=batch_size,
shuffle=True)
validation_generator = datagen.flow_from_directory(
data_dir,
subset="validation",
target_size=(img_width, img_height),
color_mode="rgb",
classes=classes,
class_mode="categorical",
batch_size=batch_size)
#Get the number of images and calculate the number of mini-batch of 1 epoch
num_train_samples = train_generator.n
num_validation_samples = validation_generator.n
steps_per_epoch_train = (num_train_samples-1) // batch_size + 1
steps_per_epoch_validation = (num_validation_samples-1) // batch_size + 1
# ===Model definition===
#Based on the trained VGG16 model, train by changing only the output layer
# block4_Do not train parameters up to pool
vgg16 = VGG16(include_top=False, weights="imagenet", input_shape=feature_dim)
for layer in vgg16.layers[:15]:
layer.trainable = False
#Build this model
layer_input = Input(shape=feature_dim)
layer_vgg16 = vgg16(layer_input)
layer_flat = Flatten()(layer_vgg16)
layer_fc = Dense(256, activation="relu")(layer_flat)
layer_dropout = Dropout(0.5)(layer_fc)
layer_output = Dense(num_classes, activation="softmax")(layer_dropout)
model = Model(layer_input, layer_output)
model.summary()
model.compile(loss="categorical_crossentropy",
optimizer=SGD(lr=1e-3, momentum=0.9),
metrics=["accuracy"])
# ===Learning===
cp_cb = ModelCheckpoint(
filepath="weights.{epoch:02d}-{loss:.4f}-{val_loss:.4f}.hdf5",
monitor="val_loss",
verbose=1,
mode="auto")
reduce_lr_cb = ReduceLROnPlateau(
monitor="val_loss",
factor=0.5,
patience=1,
verbose=1)
history = model.fit(
train_generator,
steps_per_epoch=steps_per_epoch_train,
epochs=epochs,
validation_data=validation_generator,
validation_steps=steps_per_epoch_validation,
callbacks=[cp_cb, reduce_lr_cb])
# ===Transition output of correct answer rate===
plt.plot(range(1, len(history.history["accuracy"]) + 1),
history.history["accuracy"],
label="acc", ls="-", marker="o")
plt.plot(range(1, len(history.history["val_accuracy"]) + 1),
history.history["val_accuracy"],
label="val_acc", ls="-", marker="x")
plt.ylabel("accuracy")
plt.xlabel("epoch")
plt.legend(loc="best")
plt.savefig("accuracy.png ")
plt.show()
After turning 30 epochs, the transition of the correct answer rate in the learning data and verification data looks like this. I trained by spinning the CPU on a notebook PC without GPU (at full operation of 4 cores), but since 1 epoch was about 1 minute, it took about 30 minutes in total. The correct answer rate of the verification data has stopped around 10 epochs, but the correct answer rate has reached 94% for the 5-choice question. You did your best despite the small amount of data!
The ModelCheckpoint
feature automatically saves the model at the end of each epoch. This time, the model with the smallest validation data loss was the 17th epoch model weights.17-0.1049-0.1158.hdf5
, so we will use this for identification.
import numpy as np
print(np.argmin(history.history["val_loss"]) + 1)
#17 (May change every time)
I set Optimizer to SGD
, but if I set it to ```Adam`` etc., it does not converge well.
This is probably due to fine tuning. For details, see the following article.
[\ TensorFlow ] Optimizer also has Weight --Qiita
Let's actually identify each image listed at the beginning.
predict.py
import sys
def usage():
print("Usage: {0} <input_filename>".format(sys.argv[0]), file=sys.stderr)
exit(1)
# ===Get the file name of the input image from the argument===
if len(sys.argv) != 2:
usage()
input_filename = sys.argv[1]
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing import image
#Feature setting
classes = ["E231-yamanote", "E233-chuo", "E233-keihintohoku", "E233-nanbu", "E235-yamanote"]
num_classes = len(classes)
img_width, img_height = 128, 128
feature_dim = (img_width, img_height, 3)
# ===Model loading===
model = tf.keras.models.load_model("weights.17-0.1049-0.1158.hdf5")
# ===Loading input image===
img = image.load_img(input_filename, target_size=(img_height, img_width))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
#The range is the same as when learning[0, 1]Convert to
x = x / 255.0
#Predict vehicle type
pred = model.predict(x)[0]
#View results
for cls, prob in zip(classes, pred):
print("{0:18}{1:8.4f}%".format(cls, prob * 100.0))
If you give the image file name to the command line argument of predict.py
, the identification result will be output.
python3 predict.py filename.jpg
All the input samples introduced here are your own. In addition, although a part of the image is processed for posting here, it is input as it is at the time of actual learning and identification.
E231-yamanote 99.9974%
E233-chuo 0.0000%
E233-keihintohoku 0.0000%
E233-nanbu 0.0004%
E235-yamanote 0.0021%
It's the correct answer without any complaints.
E231-yamanote 0.0023%
E233-chuo 97.3950%
E233-keihintohoku 0.0101%
E233-nanbu 2.5918%
E235-yamanote 0.0009%
This is no problem at all.
E231-yamanote 2.0006%
E233-chuo 0.9536%
E233-keihintohoku 34.9607%
E233-nanbu 6.5641%
E235-yamanote 55.5209%
The probability of Yamanote Line E235 series has increased. The conditions are bad and the image is from the side, so is it unavoidable?
By the way, the reason why this image is from the side is that I didn't have an image of the Keihin Tohoku Line train taken from the front ... (sweat)
E231-yamanote 0.1619%
E233-chuo 7.9535%
E233-keihintohoku 0.0309%
E233-nanbu 91.7263%
E235-yamanote 0.1273%
I gave a high probability to the correct Nambu line, but it seems that I was a little lost with the Chuo line fast. However, the form is almost the same and only different colors, but in that case it may be confusing with the Keihin Tohoku Line.
E231-yamanote 0.0204%
E233-chuo 0.0000%
E233-keihintohoku 0.0027%
E233-nanbu 0.0002%
E235-yamanote 99.9767%
This is no problem.
E231-yamanote 0.2417%
E233-chuo 0.0204%
E233-keihintohoku 2.1286%
E233-nanbu 0.0338%
E235-yamanote 97.5755%
Actually, the third is the correct answer, but it seems that I thought it was a new model on the Yamanote Line. why. .. ..
E231-yamanote 47.2513%
E233-chuo 0.0898%
E233-keihintohoku 0.4680%
E233-nanbu 6.5922%
E235-yamanote 45.5986%
The second is the correct answer, but for some reason I will push the Yamanote line. Is there a theory that E235 is simply recommended for images from the side? I wonder if the probability of the Nambu Line is a little higher because it responded to the yellow sign on the far right (I don't know if it really is).
I tried to learn a model that identifies five types of rolling stock using about 500 images of railroad cars collected by Google image search. By diverting a part of the trained model (VGG16) and learning it, it seems that a model that can be identified reasonably in about 30 minutes even on a PC without a GPU has been created. There are some patterns that make mistakes, but I think he was a good fighter for the amount of computational resources and data. It was surprisingly easy to make and it was fun.
If you do it seriously, you have to collect image data from various directions, and I think it is necessary to cut out the part of the vehicle. If you want to identify the face, you can cut out the face immediately with OpenCV etc., but in the case of a vehicle, it is probably from the annotation of object detection.
Recommended Posts