With the recognition of handwritten hiragana and the convolutional neural network (deep learning) according to the textbook, the accuracy of 99.78% was obtained. It seems that there is a merit for those who read and write that it is as per the textbook (it is not unique), so I will write it in Qiita.
The source code is https://github.com/yukoba/CnnJapaneseCharacter.
I was chatting with a friend about handwritten hiragana recognition on Facebook, and when I googled, I found these two.
-"Since I touched Tensorflow for 2 months, I explained the convolutional neural network in an easy-to-understand manner with 95.04% of" handwritten hiragana "identification" http://qiita.com/tawago/items/931bea2ff6d56e32d693 --Stanford University Student Report "Recognizing Handwritten Japanese Characters Using Deep Convolutional Neural Networks" http://cs231n.stanford.edu/reports2016/262_Report.pdf
Both were written in March 2016, but Stanford University students are the first.
In a report from a student at Stanford University,
--Hiragana: 96.50% --Katakana: 98.19% --Kanji: 99.64%
It was.
According to another friend's analysis, I expected that kanji would be easier because it has more clues. In this article, I will talk about raising the most inaccurate hiragana to 99.78%.
For the data, everyone uses ETL8G from AIST's "ETL character database", and I also use it. If you would like to see specific handwritten characters, please visit http://etlcdb.db.aist.go.jp/?page_id=2461.
The data is 128x127 px, a 4-bit grayscale image for 160 people. There are 72 types of hiragana.
See tawago's http://qiita.com/tawago/items/931bea2ff6d56e32d693 for the basic story of what a convolutional neural network (deep learning) is. Also, O'Reilly Japan's book "Deep Learning from scratch-The theory and implementation of deep learning learned with Python" is also a good introduction. It was (I'm sorry I'm only browsing).
As a neural network library, this time
--High layer: Keras https://keras.io/ja/ --Low layer: Theano http://deeplearning.net/software/theano/
I used. I wrote the code to work with TensorFlow as a low layer.
The programming language is Python 3.
So, what has changed from 95.04% of tawago and 96.50% of Stanford University students. I'm only doing the basics.
First of all, a student at Stanford University, it seems to be done with CPU, and the number of calculations was insufficient, so I used the GPU of Amazon EC2. Increased the number of epochs (repetition count) from 40 to 400.
Machine learning is divided into training data and evaluation data. I train the training data, but the stochastic gradient descent method uses random numbers, and in principle, it increases or decreases finely and changes to rattling, but as a different story, When the learning result is applied to the evaluation data, it often improves to a certain point and worsens from a certain point. This is called overfitting.
Ideally, the number of repetitions (number of epochs) should be done until the overfitting starts (early end), and this time it seems that overfitting starts at about 300 to 400 times (I have not confirmed it seriously), so 400 times. It is set to.
The number of divisions between training data and evaluation data is 8: 2. This followed a student at Stanford University.
The model used what is commonly called "VGG style". It was announced by a person at Oxford University in September 2014 as Very Deep Convolutional Networks for Large-Scale Image Recognition. A Keras sample written based on this is https://github.com/fchollet/keras/blob/master/examples/cifar10_cnn.py, and I am modifying it. Stanford University students are also VGG-style. tawago is unknown. Is it VGG style?
The VGG style is a normal neural network that repeats "convolution-> convolution-> Max pooling".
Here is a brief explanation of convolution. For details, see the book "Deep Learning from scratch."
--Convolution: Take the neighborhood of each point (3x3, etc.), convert it to a one-dimensional vector, and inner product with the parameter. --Max pooling: Take the neighborhood of each point (2x2, etc.) and the maximum value in it
ETL8G has only data for 160 people, not a large dataset. Generally, when there is little data, using a complex model with a large number of parameters does not work, so I used the simple Keras sample as it is. "Convolution-> Convolution-> Max pooling" is twice.
One way to improve generalization ability is to add noise only during training. These are the two I used this time.
--Dropout (1 / p times with probability p, 0 times with probability 1-p to erase) --Rotate to teacher image (± 15 degrees), zoom (0.8 to 1.2 times)
Dropout is also used by Stanford University students. I have all p = 0.5, that is, 50% chance of doubling and 50% chance of 0.
I used it in the sample code of Keras, and I also used it to learn that the input image can be rotated / zoomed as characters even if it is rotated / zoomed. This is also very effective. Stanford University students didn't use it.
--The image has been reduced to 32x32. This is enough, and if it is large, the amount of calculation will increase. Stanford University students set it to 64x64. Also, in terms of the balance of the number of convolutions, having a large number of pixels does not mean that it will lead to improvement. --By default, Keras has a normal distribution with a standard deviation of 0.1 because the initial value is strange and learning does not proceed. ――The stochastic gradient descent method doesn't work well for a bouncer like Adam, so I used a gentle AdaGrad at first, but AdaDelta, a variant of AdaGrad that does not require a learning rate, was better. So I used that.
So, I did what the textbook did, and it was 99.78%. Handwritten numbers are reported to be 99.77% in the MNIST dataset, which is about the same as http://yann.lecun.com/exdb/mnist/. I don't do kanji or anything else, but Stanford University students say 99.64%, which is a little better than this.
Deep learning and handwritten characters can be recognized almost perfectly!
Recommended Posts