Overview

Build an OCR environment with Anaconda alone I don't know how difficult it is, so I'm looking for an easy way.

environment

Windows10 Anaconda Python 3.6 Spyder 4.1.2

About tesseract and pyocr

After investigating, there seems to be a way of tesseract + pyocr for OCR with Python, so I decided to try this method

tesseract It is an OCR (optical character recognition) engine currently being developed by Google. Since v4.0 or later is based on machine learning LSTM, Considering the recognition rate, the latest version seems to be good

pyocr OCR tool wrapper for Python Also supports tesseract

reference

Build an environment only with Anaconda and try Python + OCR https://qiita.com/anzanshi/items/9ee94affecd74be33159

I used it as a reference, but I was a little addicted to it because of the difference in environment.

Installation of tesseract

There seems to be various ways, but this time I will install it with Anaconda

There was tesseract in the conda-forge repository https://anaconda.org/conda-forge/tesseract

Install obediently (v4.1.1 as of April 14, 2020) conda install -c conda-forge tesseract

Install pyocr

This is a repository called brianjmcguirk that I rarely see ...? https://anaconda.org/brianjmcguirk/pyocr

This is also installed obediently (this is currently v0.5) conda install -c brianjmcguirk pyocr

Try running the code on the official page

Refer to the above article and check with the code on the official page

from PIL import Image
import sys

import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.

And the execution result is

`Execution result`


Will use tool 'Tesseract (sh)'
Available languages: eng, osd
Will use lang 'eng'

It will be. As it is written, Japanese is not yet OCR in English only.

Japanese OCR environment creation

Now, let's do OCR in Japanese

Download trained data

Download jpn.traineddata from here It seems that the place has changed from the old days, so it was a little difficult to find. https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files.md

Note that the data differs depending on the version! (I made a mistake once ...)

Put it in the right place

This is also a little troublesome ... In my environment / Anaconda3 / envs / (environment name) / Library / bin / tessdata I was able to read it when I put it under it (There are already eng.traineddata and osd.traineddata)

There is also a tessdata directory under (environment name), It seems that this is not going to read

Re-execute

Run the code on the official page again

`Execution result`


Will use tool 'Tesseract (sh)'
Available languages: eng, jpn, osd
Will use lang 'eng'

"Jpn" has also been added properly Next, let's read Japanese

↓ Test image

txt = tool.image_to_string(
    Image.open('test.png'),
    lang="jpn",
    builder=pyocr.builders.TextBuilder(tesseract_layout=6)
)
print( txt )

`Execution result`


raise TesseractError(status, errors)
pyocr.error.TesseractError: (1, b"Error, unknown command line argument '-psm'\r\n")

Unexpected error occurred ... This was also helpful to the article of the person who got a similar error https://xkage.com/python-ocr.html

tesseract.pyとbuilders.py I was able to rewrite "-psm" in "--psm"

Run again

txt = tool.image_to_string(
    Image.open('test.png'),
    lang="jpn",
    builder=pyocr.builders.TextBuilder(tesseract_layout=6)
)
print( txt )

`Execution result`


Test test

did it!

Summary

You can create an environment with Anaconda, but I'm quite addicted to it because there is less information than I expected. Well, I'm going to play hard with OCR

Supplement

There is also information that pyocr cannot be used if the python version is 3.7. It seems safe to create an environment with 3.6

Create a Japanese OCR environment with Anaconda (tesseract + pyocr)

Overview

environment

About tesseract and pyocr

reference

Installation of tesseract

Install pyocr

Try running the code on the official page

Execution result

Japanese OCR environment creation

Download trained data

Put it in the right place

Re-execute

Execution result

Execution result

Run again

Execution result

Summary

Supplement

`Execution result`

`Execution result`

`Execution result`

`Execution result`