Overview

Read image data with Python and convert it to text data with OCR.

environment

Python This article used Python 2.7.

Tesseract-ocr It is an OCR engine. I used Tesseract-ocr 3.0.4. If you want to install on MacOS X, also see Installing tesseract-ocr training tools on MacOS X.

pyocr A wrapper for using the OCR engine from Python. There are a lot of similar wrappers to look for, but I tried a few and chose one that would make it easier to write code. The repository is here.

$ pip install pyocr

PIL A library for handling images in Python. I installed and used Pillow.

$ pip install pillow

Code example

The code that reads the text from the image and outputs it based on the sample code is like this.

`pyocr_sample.py`


from PIL import Image
import sys
import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]

txt = tool.image_to_string(
    Image.open('sample.png'),
    lang="jpn+eng",
    builder=pyocr.builders.TextBuilder(tesseract_layout=6)
)
print txt

The part of tesseract_layout = 6 is not written in the document, but it is the same as the option -psm 6 of the command of tesseract. It is an option of what kind of layout is assumed for analysis. Choosing the right one will greatly improve the accuracy of box extraction. The default is -psm 3, which is automatically determined, but it is better to specify it if possible.

If you use the language data distributed by tesseract-ocr as it is, the result is terrible to the level that you are worried whether it can be read properly. However, if you use properly trained language data, the accuracy will improve accordingly.

Extract text from images in Python

Overview

environment

Code example

pyocr_sample.py

`pyocr_sample.py`