tesseract-ocr for Python
I want to extract Japanese using OCR technology. The extracted Japanese will be used for various purposes.
MacBook Pro (13-inch, Mid 2012) Processor: 2.5 GHz Intel Core i5 Memory: 4 GB 1600 MHz DDR3 OS: OS X El Capitan (Ver.10.11.4)
You can install "Tesseract" using either "MacPorts"or"Homebrew". (You can install "Tesseract" with either "MacPorts" or "Homebrew".)
Terminal
sudo port install tesseract
# '<langcode>'Install the package of the language you want to process in the part(English:eng,Japanese:jpn)
sudo port install tesseract-<langcode>
Terminal
brew install tesseract
This time, I prepared an image that mixes Japanese and English.
Terminal
tesseract test.png out -l eng+jpn
Text output result
tesseract—ocr for Python
Introduction ヽ What you want to do
I want to extract Japanese using OCR technology.
In addition, the extracted Japanese will be used for various purposes.
I haven't tried it in detail, so it's not something like this, Perhaps the result will change under all conditions such as "resolution" and "blank space (including margins)". If you need it someday, I'll verify it.
By the way, "English only" and "Japanese only" are quite good results.
Recommended Posts