I will summarize the steps I took to do Japanese OCR with python using the free tesseract OCR.
Install tesseract.
How to install
There are two. The version that can be installed with 1 apt-get is 3.0.3. To handle Japanese with tesseract, data trained in Japanese (jpn.traindata) is required. I have to download this myself, but only the one found on the net is ver3.0.4. When I try to use this data in 3.03, it doesn't work and I get this error:
read_params_file: parameter not found: allow_blob_division
You can also edit traindata and use it in 3.0.3 like this person, but it is necessary for that. The command `` `combine_tessdata``` cannot be used with apt-get installations. Therefore, if you want to do it in Japanese at present, you may have to install it from the source.
Basically, install tesseract 3.0.4 by referring to the official compile installation page.
https://github.com/tesseract-ocr/tesseract/wiki/Compiling
$ sudo apt-get install autoconf automake libtool
$ sudo apt-get install libpng12-dev
$ sudo apt-get install libjpeg62-dev
$ sudo apt-get install libtiff4-dev
$ sudo apt-get install zlib1g-dev
$ sudo apt-get install libicu-dev # (if you plan to make the training tools)
$ sudo apt-get install libpango1.0-dev # (if you plan to make the training tools)
$ sudo apt-get install libcairo2-dev # (if you plan to make the training tools)
It seems that you need an image library called Laptonica. Download and unzip the source from the Download Page (http://www.leptonica.org/download.html). To install tesseract3.0.4, you need at least Laptonica 1.71, so install the latest 1.7.3.
#Defrost
gzip -dc leptonica-1.73.tar.gz |tar xvf -
cd leptonica-1.73
#like make
$ ./configure
$ make
$ sudo make install
Basically, do as this.
Get the source of 3.0.4 from here
#Unzip, move
$ unzip 3.04.zip
$ cd tesseract-3.04
#Put it through the library path
$ export -p LD_LIBRARY=$LD_LIBRARY:/usr/local/lib
#Installation
$ ./autogen.sh
$ ./configure
$ sudo make #I made sudo only here. I couldn't find laptonica.
$ sudo make install
$ sudo ldconfig
Download the Japanese version of jpn.traindata from the language dataset at here and place it here.
/usr/local/share/tessdata/
And set the path of this folder.
export TESSDATA_PREFIX="/usr/local/share/tessdata/tessdata/
If the installation is successful, you should be able to run OCR on the command line. I will try this image on Japanese OCR.
tesseract ocr_test.png out -l jpn
Will write the results to a file called out.txt.
out.txt
Smile is the best!Reni Takagi
The small "ya" becomes the large "ya", but it is generally recognizable. Is it difficult because there is no concept of lowercase letters in other English?
We use a wrapper library called pyocr for use with python.
Installation is
$ pip install pyocr
That's it.
However, it does not support tesseract installed from source, and when I run the following error.py for testing, it does not work.
error.py
import pyocr
tools = pyocr.get_available_tools()
Traceback (most recent call last):
File "error.py", line 12, in <module>
tools = pyocr.get_available_tools()
File "/usr/local/lib/python2.7/site-packages/pyocr/pyocr.py", line 74, in get_available_tools
if tool.is_available():
File "/usr/local/lib/python2.7/site-packages/pyocr/libtesseract/__init__.py", line 152, in is_available
version = get_version()
File "/usr/local/lib/python2.7/site-packages/pyocr/libtesseract/__init__.py", line 179, in get_version
upd = int(version[2])
ValueError: invalid literal for int() with base 10: '02dev'
When I read the error, I'm angry trying to convert the string "02dev" to an int. The version installed from source is tesseract 3.04.02dev, and it doesn't seem to assume the dev package. So I'll change this source.
If you are using virtualenv, replace the source to be changed appropriately.
py:/usr/local/lib/python2.7/site-packages/pyocr/libtesseract/__init__.py
if len(version) >= 3:
upd = int(version[2].replace('dev', ''))
# upd = int(version[2])
This should work.
There are various OCR mechanisms, so I will try them. I will try it with this image.
Text The simplest OCR. Reads a character from the image and returns it as a string.
import pyocr
import pyocr.builders
import argparse
from PIL import Image
parser = argparse.ArgumentParser(description='tesseract ocr test')
parser.add_argument('image', help='image path')
args = parser.parse_args()
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
res = tool.image_to_string(Image.open(args.image),
lang="jpn",
builder=pyocr.builders.TextBuilder(tesseract_layout=6))
print res
result
Lord of the machine training
Tess Wota
Next to the door screaming
rope
Ship Day^~Genus~Customary betting history ba "Mae
The result is terrible, probably because of difficult words.
WordBox
It will return a box where the word is. Let's visualize the result with openCV. (Install openCV at here)
import pyocr
import pyocr.builders
import argparse
import cv2
from PIL import Image
parser = argparse.ArgumentParser(description='tesseract ocr test')
parser.add_argument('image', help='image path')
args = parser.parse_args()
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
res = tool.image_to_string(Image.open(args.image),
lang="jpn",
builder=pyocr.builders.WordBoxBuilder(tesseract_layout=6))
# draw result
out = cv2.imread(args.image)
for d in res:
print d.content
print d.position
cv2.rectangle(out, d.position[0], d.position[1], (0, 0, 255), 2)
cv2.imshow('image',out)
cv2.waitKey(0)
cv2.destroyAllWindows()
Lord of the machine training
((226, 12), (412, 37))
Tess
((255, 138), (278, 148))
Wota
((283, 137), (326, 148))
door
((397, 149), (406, 159))
Screaming next door
((411, 149), (430, 159))
Historical training
((477, 148), (523, 159))
rope
((165, 170), (199, 181))
Ship Day
((115, 202), (156, 212))
^~Genus~
((210, 196), (247, 220))
Customary betting history
((297, 202), (343, 213))
Ba "Mae
((390, 203), (438, 212))
The territory is decent, but the recognized words are still terrible.
LineBox WordBox was word-by-word, but LineBox seems to group words on the same line.
I will change only a part of the source of WordBox Just change from WordBoxBuilder to LineBoxBuilder.
res = tool.image_to_string(Image.open(args.image),
lang="jpn",
builder=pyocr.builders.LineBoxBuilder(tesseract_layout=6))
result
Lord of the machine training
((226, 12), (412, 37))
Tess Wota
((255, 137), (326, 148))
Next to the door screaming
((397, 148), (523, 159))
rope
((165, 170), (199, 181))
Ship Day^~Genus~Customary betting history ba "Mae
((115, 196), (438, 220))
This image doesn't have to be on the same line, but it seems to be useful for multi-line sentences.
For each builder, tesseract_layout = 6
Was set. This number seems to set the policy of OCR for images,
This person has put together. http://tanaken-log.blogspot.jp/2012/08/imagemagick-tesseract.html
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
As you can see, the accuracy is not good when using the existing Japanese data. If you create the learning data yourself, it will be more decent.
http://hadashi-gensan.hatenablog.com/entry/2014/01/15/135316
If you use TEXT_DETECT of Google Cloud Vision API, it will look like this.
machine
Learning
of
flow
test
data
Preliminary
Measurement
vessel
Learning
result
Before
processing
teacher
data
Raw
data
machine
Learning
Parameters
L
Reason
A
After all accuracy is good. If you want to process it easily without making so many requests, you should use the Vision API.
Recommended Posts