tesseract-ocr for Python

First, what you want to do

I want to extract Japanese using OCR technology. The extracted Japanese will be used for various purposes.

Usage environment

MacBook Pro (13-inch, Mid 2012) Processor: 2.5 GHz Intel Core i5 Memory: 4 GB 1600 MHz DDR3 OS: OS X El Capitan (Ver.10.11.4)

Installation reference:

-tesseract-ocr (Mac version)

You can install "Tesseract" using either "MacPorts"or"Homebrew". (You can install "Tesseract" with either "MacPorts" or "Homebrew".)

1. What I used

MacPorts

`Terminal`


sudo port install tesseract
# '<langcode>'Install the package of the language you want to process in the part(English:eng,Japanese:jpn)
sudo port install tesseract-<langcode>

Homebrew

`Terminal`


brew install tesseract

2. Run

This time, I prepared an image that mixes Japanese and English.

`Terminal`


tesseract test.png out -l eng+jpn

Information on the executed image Size: 996 x 517 ↓ ↓ ↓ ↓ image ↓ ↓ ↓ ↓ ↑↑↑ Up to here ↑↑↑

result

`Text output result`


tesseract—ocr for Python

Introduction ヽ What you want to do

I want to extract Japanese using OCR technology.
In addition, the extracted Japanese will be used for various purposes.

Reflections

I haven't tried it in detail, so it's not something like this, Perhaps the result will change under all conditions such as "resolution" and "blank space (including margins)". If you need it someday, I'll verify it.

By the way, "English only" and "Japanese only" are quite good results.

Recommended Posts

tesseract-OCR for Python [Japanese version]

2016-10-30 else for Python3> for:

python [for myself]

PYTHON2.7 64bit version

Kernel / Python version summary for each Debian release

Japanese support for Jupyter PDF output (December 2020 version)

About Python for loops

Python basics ② for statement

About Python, for ~ (range)

Refactoring tools for Python