Notes on doing Japanese OCR with Python

I will summarize the steps I took to do Japanese OCR with python using the free tesseract OCR.

environment

Ubuntu 14.04
Python 2.7

Installation

Install tesseract.

Installation policy

How to install

Install with apt-get
Build and install from source

There are two. The version that can be installed with 1 apt-get is 3.0.3. To handle Japanese with tesseract, data trained in Japanese (jpn.traindata) is required. I have to download this myself, but only the one found on the net is ver3.0.4. When I try to use this data in 3.03, it doesn't work and I get this error:

read_params_file: parameter not found: allow_blob_division

You can also edit traindata and use it in 3.0.3 like this person, but it is necessary for that. The command `` `combine_tessdata``` cannot be used with apt-get installations. Therefore, if you want to do it in Japanese at present, you may have to install it from the source.

Basically, install tesseract 3.0.4 by referring to the official compile installation page.

https://github.com/tesseract-ocr/tesseract/wiki/Compiling

Dependency installation

$ sudo apt-get install autoconf automake libtool
$ sudo apt-get install libpng12-dev
$ sudo apt-get install libjpeg62-dev
$ sudo apt-get install libtiff4-dev
$ sudo apt-get install zlib1g-dev
$ sudo apt-get install libicu-dev      # (if you plan to make the training tools)
$ sudo apt-get install libpango1.0-dev # (if you plan to make the training tools)
$ sudo apt-get install libcairo2-dev   # (if you plan to make the training tools)

Laptonica installation

It seems that you need an image library called Laptonica. Download and unzip the source from the Download Page (http://www.leptonica.org/download.html). To install tesseract3.0.4, you need at least Laptonica 1.71, so install the latest 1.7.3.

#Defrost
gzip -dc leptonica-1.73.tar.gz |tar xvf -
cd leptonica-1.73

#like make
$ ./configure
$ make
$ sudo make install

tesseract installation

Basically, do as this.

Get the source of 3.0.4 from here

#Unzip, move
$ unzip 3.04.zip 
$ cd tesseract-3.04

#Put it through the library path
$ export -p LD_LIBRARY=$LD_LIBRARY:/usr/local/lib

#Installation
$ ./autogen.sh
$ ./configure
$ sudo make          #I made sudo only here. I couldn't find laptonica.
$ sudo make install
$ sudo ldconfig

Acquisition and setting of Japanese files

Download the Japanese version of jpn.traindata from the language dataset at here and place it here.

/usr/local/share/tessdata/

And set the path of this folder.

export TESSDATA_PREFIX="/usr/local/share/tessdata/tessdata/

Operation check

If the installation is successful, you should be able to run OCR on the command line. I will try this image on Japanese OCR.

tesseract ocr_test.png out -l jpn

Will write the results to a file called out.txt.

`out.txt`


Smile is the best!Reni Takagi

The small "ya" becomes the large "ya", but it is generally recognizable. Is it difficult because there is no concept of lowercase letters in other English?

Introduction of pyocr

We use a wrapper library called pyocr for use with python.

Installation is

$ pip install pyocr

That's it.

However, it does not support tesseract installed from source, and when I run the following error.py for testing, it does not work.

`error.py`


import pyocr
tools = pyocr.get_available_tools()

Traceback (most recent call last):
  File "error.py", line 12, in <module>
    tools = pyocr.get_available_tools()
  File "/usr/local/lib/python2.7/site-packages/pyocr/pyocr.py", line 74, in get_available_tools
    if tool.is_available():
  File "/usr/local/lib/python2.7/site-packages/pyocr/libtesseract/__init__.py", line 152, in is_available
    version = get_version()
  File "/usr/local/lib/python2.7/site-packages/pyocr/libtesseract/__init__.py", line 179, in get_version
    upd = int(version[2])
ValueError: invalid literal for int() with base 10: '02dev'

When I read the error, I'm angry trying to convert the string "02dev" to an int. The version installed from source is tesseract 3.04.02dev, and it doesn't seem to assume the dev package. So I'll change this source.

If you are using virtualenv, replace the source to be changed appropriately.

`py:/usr/local/lib/python2.7/site-packages/pyocr/libtesseract/init.py`


    if len(version) >= 3:
        upd = int(version[2].replace('dev', ''))
        # upd = int(version[2])

This should work.

Try OCR

There are various OCR mechanisms, so I will try them. I will try it with this image.

Text The simplest OCR. Reads a character from the image and returns it as a string.

import pyocr
import pyocr.builders
import argparse
from PIL import Image

parser = argparse.ArgumentParser(description='tesseract ocr test')
parser.add_argument('image', help='image path')
args = parser.parse_args()

tools = pyocr.get_available_tools()

if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]

res = tool.image_to_string(Image.open(args.image),
                           lang="jpn",
                           builder=pyocr.builders.TextBuilder(tesseract_layout=6))

print res

result

Lord of the machine training
Tess Wota
Next to the door screaming
rope
Ship Day^~Genus~Customary betting history ba "Mae

The result is terrible, probably because of difficult words.

WordBox

It will return a box where the word is. Let's visualize the result with openCV. (Install openCV at here)

import pyocr
import pyocr.builders
import argparse
import cv2
from PIL import Image

parser = argparse.ArgumentParser(description='tesseract ocr test')
parser.add_argument('image', help='image path')
args = parser.parse_args()


tools = pyocr.get_available_tools()

if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]


res = tool.image_to_string(Image.open(args.image),
                           lang="jpn",
                           builder=pyocr.builders.WordBoxBuilder(tesseract_layout=6))

# draw result 
out = cv2.imread(args.image)
for d in res:
    print d.content
    print d.position
    cv2.rectangle(out, d.position[0], d.position[1], (0, 0, 255), 2)

cv2.imshow('image',out)
cv2.waitKey(0)
cv2.destroyAllWindows()

Screenshot from 2016-07-20 15:12:12.png

Lord of the machine training
((226, 12), (412, 37))
Tess
((255, 138), (278, 148))
Wota
((283, 137), (326, 148))
door
((397, 149), (406, 159))
Screaming next door
((411, 149), (430, 159))
Historical training
((477, 148), (523, 159))
rope
((165, 170), (199, 181))
Ship Day
((115, 202), (156, 212))
^~Genus~
((210, 196), (247, 220))
Customary betting history
((297, 202), (343, 213))
Ba "Mae
((390, 203), (438, 212))

The territory is decent, but the recognized words are still terrible.

LineBox WordBox was word-by-word, but LineBox seems to group words on the same line.

I will change only a part of the source of WordBox Just change from WordBoxBuilder to LineBoxBuilder.

res = tool.image_to_string(Image.open(args.image),
                           lang="jpn",
                           builder=pyocr.builders.LineBoxBuilder(tesseract_layout=6))

result Screenshot from 2016-07-20 15:34:57.png

Lord of the machine training
((226, 12), (412, 37))
Tess Wota
((255, 137), (326, 148))
Next to the door screaming
((397, 148), (523, 159))
rope
((165, 170), (199, 181))
Ship Day^~Genus~Customary betting history ba "Mae
((115, 196), (438, 220))

This image doesn't have to be on the same line, but it seems to be useful for multi-line sentences.

About tesseract_layout

For each builder, tesseract_layout = 6 Was set. This number seems to set the policy of OCR for images,

This person has put together. http://tanaken-log.blogspot.jp/2012/08/imagemagick-tesseract.html

pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.

About learning data

As you can see, the accuracy is not good when using the existing Japanese data. If you create the learning data yourself, it will be more decent.

http://hadashi-gensan.hatenablog.com/entry/2014/01/15/135316

Bonus Google Cloud Vision

If you use TEXT_DETECT of Google Cloud Vision API, it will look like this.

Screenshot from 2016-07-21 11:28:43.png

machine
Learning
of
flow
test
data
Preliminary
Measurement
vessel
Learning
result
Before
processing
teacher
data
Raw
data
machine
Learning
Parameters
L
Reason
A

After all accuracy is good. If you want to process it easily without making so many requests, you should use the Vision API.