You can download the trained EfficientDet model from the dataset created using the method noted in this article from the Github repository (https://github.com/kosuke1701/pretrained-efficientdet-character-head).
We have also prepared a demo where you can try the trained model with Google colaboratory, so please play with it.
I wanted to detect the face in the illustration to improve the accuracy of the Character Illustration Automatic Classification Tool that I recently developed.
For face detection in anime, Efforts using OpenCV and Efforts using YOLO / items / 054ee8266d5b9e0c704e) has been published as an article. However, it seems that the accuracy of the OpenCV model is still limited, and the YOLO article does not seem to publish the trained model (I'm sorry if it is published).
So this time, I decided to learn the object detection model by myself. However, there is no data set, so this time I will start with annotations myself.
Since object detection is a relatively major research field, it seems that many tools for annotation are open to the public.
This time, using LabelImg with reference to Comparison article of these tools. Annotated.
Preparation for annotation is very easy, and after preparing the necessary environment, prepare the class definition file classes.txt
to be annotated, and combine the images to be annotated into one directory (for example, ʻimage_dir`). Just leave it.
The class definition file is as simple as writing the class name on each line. For example, in this case, we just want to detect the "face", so it will be as follows.
classes.txt
face
To start the annotation tool, do the following.
python labelImg.py image_dir classes.txt
The GUI of the tool is extremely intuitive, just press the w
key to start creating a Bouding Box and then click on the points that correspond to both ends of the diagonal.
This tool has a variety of useful shortcuts, but the ones I used most often were:
Key | function |
---|---|
w | Start creating a new Bounding Box. |
d | Move on to the next image. |
a | Move to the previous image. |
Ctrl+S | Save the annotation result. |
Annotation results can be saved in two formats, Pascal VOC and YOLO. (The default is Pascal VOC.)
For ease of management, I saved the annotation result xml file in the same directory as the image. Thanks to that, once I closed the tool and restarted annotation, the previous annotation result was automatically loaded.
In recent object detection research, the COCO 2017 dataset is often used as a benchmark, and it seems that many public implementations of papers also input this format.
So, this time, I wrote the preprocessing code to convert the result of annotation with LabelImg to COCO format. The pre-processing code is available on Github.
In creating this preprocessing code, I referred to this article for the specific format. Please refer to those who are interested in more detailed formats.
The file structure of the final dataset looks like this: Replace project_name
with any name you like.
project_name
+- train
| +- <Training image>
+- val
| +- <Image for evaluation>
+- annotations
+- instances_train.json
+- instances_val.json
Execute the following command to execute the preprocessing code. In the following, it is assumed that the XML file of the annotation result by LabelImg is saved under the directory such as <DIR1>
. (You can select multiple directories by separating them with spaces.)
#Training data
python convert_to_coco.py \
--image-root-dir project_name/train \
--annotation-fn project_name/annotations/instances_train.json \
--copy-images --same-dir \
--annotation-dir <DIR1> <DIR2> ...
#Evaluation data (just change train to val)
python convert_to_coco.py \
--image-root-dir project_name/val \
--annotation-fn project_name/annotations/instances_val.json \
--copy-images --same-dir \
--annotation-dir <DIR1'> <DIR2'> ...
The following assumptions are made:
There are several repositories that publish implementations of EfficientDet. As a result of various investigations zylo117 / Yet-Another-EfficientDet-Pytorch seemed to be easy to use, so this time we will use it to learn the model.
In order to learn with your own dataset, you need to (1) download the pre-learned parameters and (2) create a definition file for your own dataset, in addition to preparing the environment.
Download'efficientdet-d * .pth' from the Github repository for pre-trained parameters of EfficientDet. I will.
d0
and d7
(probably) correspond to the settings related to model size etc. described in Paper, and the smaller the number, the lighter the weight. Instead, it seems to be a model with low accuracy.
This time, I downloaded ʻefficientdet-d0.pth` for the time being.
It seems that you need to create a yml file with settings under the projects
directory to learn with your own dataset.
The items that need to be changed from the COCO dataset settings are the dataset name project_name
, the list of class names to detect ʻobj_list, and the number of GPUs to use
num_gpus`. Let's do it.
For ʻanchors_scales and ʻanchors_ratios
, I have not yet understood the detailed algorithms of object detection and EfficientDet, so I left them as the COCO dataset this time.
In my case, the definition file looks like this:
projects/coco_pixiv.yml
project_name: coco_pixiv # also the folder name of the dataset that under data_path folder
train_set: train
val_set: val
num_gpus: 1
# mean and std in RGB order, actually this part should remain unchanged as long as your dataset is similar to coco.
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
# this is coco anchors, change it if necessary
anchors_scales: '[2 ** 0, 2 ** (1.0 / 3.0), 2 ** (2.0 / 3.0)]'
anchors_ratios: '[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]'
# must match your dataset's category_id.
# category_id is one_indexed
obj_list: ['face']
To perform the learning, do the following: Hyperparameters etc. are not particularly grounded, so they are for reference only.
python train.py -c 0 -p project_name \
--batch_size 8 --lr 1e-5 --num_epochs 100 \
--load_weights efficientdet-d0.pth \
--data_path <Data set directory created by preprocessing code project_Directory with name>
The -c
option specifies the numerical value corresponding to the above-mentioned d0
or d7
. Therefore, this number and the number of d *
of the pre-trained parameter specified by --load_weights
must match.
With GTX1070, it took about 1 hour to learn about 900 training data and 100 epochs.
Run ʻefficientdet_test.py` to actually apply the trained model to the image and see what the performance will be. However, since the input image name etc. is solid in this code, if you change it as follows
python efficientdet_test.py <d*Numbers corresponding to> <model.pth file name> <Image name>
You can visualize the actual detection result with. ** Press any key to close the window of the displayed image. It may freeze when erased with the x button. ** **
efficientdet_test.py
# Author: Zylo117
"""
Simple Inference Script of EfficientDet-Pytorch
"""
import sys
import time
import torch
from torch.backends import cudnn
from matplotlib import colors
from backbone import EfficientDetBackbone
import cv2
import numpy as np
from efficientdet.utils import BBoxTransform, ClipBoxes
from utils.utils import preprocess, invert_affine, postprocess, STANDARD_COLORS, standard_to_bgr, get_index_label, plot_one_box
compound_coef = int(sys.argv[1])
force_input_size = None # set None to use default size
img_path = sys.argv[3]
# replace this part with your project's anchor config
anchor_ratios = [(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]
anchor_scales = [2 ** 0, 2 ** (1.0 / 3.0), 2 ** (2.0 / 3.0)]
threshold = 0.2
iou_threshold = 0.2
use_cuda = True
use_float16 = False
cudnn.fastest = True
cudnn.benchmark = True
obj_list = ['face']
color_list = standard_to_bgr(STANDARD_COLORS)
# tf bilinear interpolation is different from any other's, just make do
input_sizes = [512, 640, 768, 896, 1024, 1280, 1280, 1536]
input_size = input_sizes[compound_coef] if force_input_size is None else force_input_size
ori_imgs, framed_imgs, framed_metas = preprocess(img_path, max_size=input_size)
if use_cuda:
x = torch.stack([torch.from_numpy(fi).cuda() for fi in framed_imgs], 0)
else:
x = torch.stack([torch.from_numpy(fi) for fi in framed_imgs], 0)
x = x.to(torch.float32 if not use_float16 else torch.float16).permute(0, 3, 1, 2)
model = EfficientDetBackbone(compound_coef=compound_coef, num_classes=len(obj_list),
ratios=anchor_ratios, scales=anchor_scales)
model.load_state_dict(torch.load(sys.argv[2]))
model.requires_grad_(False)
model.eval()
if use_cuda:
model = model.cuda()
if use_float16:
model = model.half()
with torch.no_grad():
features, regression, classification, anchors = model(x)
regressBoxes = BBoxTransform()
clipBoxes = ClipBoxes()
out = postprocess(x,
anchors, regression, classification,
regressBoxes, clipBoxes,
threshold, iou_threshold)
def display(preds, imgs, imshow=True, imwrite=False):
for i in range(len(imgs)):
if len(preds[i]['rois']) == 0:
continue
for j in range(len(preds[i]['rois'])):
x1, y1, x2, y2 = preds[i]['rois'][j].astype(np.int)
obj = obj_list[preds[i]['class_ids'][j]]
score = float(preds[i]['scores'][j])
plot_one_box(imgs[i], [x1, y1, x2, y2], label=obj,score=score,color=color_list[get_index_label(obj, obj_list)])
if imshow:
cv2.imshow('img', imgs[i])
cv2.waitKey(0)
if imwrite:
cv2.imwrite(f'test/img_inferred_d{compound_coef}_this_repo_{i}.jpg', imgs[i])
out = invert_affine(framed_metas, out)
display(out, ori_imgs, imshow=True, imwrite=True)
print('running speed test...')
with torch.no_grad():
print('test1: model inferring and postprocessing')
print('inferring image for 10 times...')
t1 = time.time()
for _ in range(10):
_, regression, classification, anchors = model(x)
out = postprocess(x,
anchors, regression, classification,
regressBoxes, clipBoxes,
threshold, iou_threshold)
out = invert_affine(framed_metas, out)
t2 = time.time()
tact_time = (t2 - t1) / 10
print(f'{tact_time} seconds, {1 / tact_time} FPS, @batch_size 1')
This time, I went through everything from annotation of the object detection dataset to learning the model. I have only annotated about 1000 sheets yet, but it seems that I can learn to a practical level than I thought when I saw the performance of the model of the learning result. I was saved because the learning itself (although I haven't been able to converge properly) was unexpectedly easy, about an hour.
In the future, I would like to increase the number of annotations to improve the accuracy of the illustration classification tool, which is the original purpose.
Recommended Posts