Build an OCR environment with Anaconda alone I don't know how difficult it is, so I'm looking for an easy way.
Windows10 Anaconda Python 3.6 Spyder 4.1.2
After investigating, there seems to be a way of tesseract + pyocr for OCR with Python, so I decided to try this method
tesseract It is an OCR (optical character recognition) engine currently being developed by Google. Since v4.0 or later is based on machine learning LSTM, Considering the recognition rate, the latest version seems to be good
pyocr OCR tool wrapper for Python Also supports tesseract
Build an environment only with Anaconda and try Python + OCR https://qiita.com/anzanshi/items/9ee94affecd74be33159
I used it as a reference, but I was a little addicted to it because of the difference in environment.
There seems to be various ways, but this time I will install it with Anaconda
There was tesseract in the conda-forge repository https://anaconda.org/conda-forge/tesseract
Install obediently (v4.1.1 as of April 14, 2020)
conda install -c conda-forge tesseract
This is a repository called brianjmcguirk that I rarely see ...? https://anaconda.org/brianjmcguirk/pyocr
This is also installed obediently (this is currently v0.5)
conda install -c brianjmcguirk pyocr
Refer to the above article and check with the code on the official page
from PIL import Image
import sys
import pyocr
import pyocr.builders
tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'
langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.
And the execution result is
Execution result
Will use tool 'Tesseract (sh)'
Available languages: eng, osd
Will use lang 'eng'
It will be. As it is written, Japanese is not yet OCR in English only.
Now, let's do OCR in Japanese
Download jpn.traineddata from here It seems that the place has changed from the old days, so it was a little difficult to find. https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files.md
Note that the data differs depending on the version! (I made a mistake once ...)
This is also a little troublesome ... In my environment / Anaconda3 / envs / (environment name) / Library / bin / tessdata I was able to read it when I put it under it (There are already eng.traineddata and osd.traineddata)
There is also a tessdata directory under (environment name), It seems that this is not going to read
Run the code on the official page again
Execution result
Will use tool 'Tesseract (sh)'
Available languages: eng, jpn, osd
Will use lang 'eng'
"Jpn" has also been added properly Next, let's read Japanese
↓ Test image

txt = tool.image_to_string(
    Image.open('test.png'),
    lang="jpn",
    builder=pyocr.builders.TextBuilder(tesseract_layout=6)
)
print( txt )
Execution result
raise TesseractError(status, errors)
pyocr.error.TesseractError: (1, b"Error, unknown command line argument '-psm'\r\n")
Unexpected error occurred ... This was also helpful to the article of the person who got a similar error https://xkage.com/python-ocr.html
tesseract.pyとbuilders.py I was able to rewrite "-psm" in "--psm"

txt = tool.image_to_string(
    Image.open('test.png'),
    lang="jpn",
    builder=pyocr.builders.TextBuilder(tesseract_layout=6)
)
print( txt )
Execution result
Test test
did it!
You can create an environment with Anaconda, but I'm quite addicted to it because there is less information than I expected. Well, I'm going to play hard with OCR
There is also information that pyocr cannot be used if the python version is 3.7. It seems safe to create an environment with 3.6
Recommended Posts