Extract text from images in Python

Overview

Read image data with Python and convert it to text data with OCR.

environment

Python This article used Python 2.7.

Tesseract-ocr It is an OCR engine. I used Tesseract-ocr 3.0.4. If you want to install on MacOS X, also see Installing tesseract-ocr training tools on MacOS X.

pyocr A wrapper for using the OCR engine from Python. There are a lot of similar wrappers to look for, but I tried a few and chose one that would make it easier to write code. The repository is here.

$ pip install pyocr

PIL A library for handling images in Python. I installed and used Pillow.

$ pip install pillow

Code example

The code that reads the text from the image and outputs it based on the sample code is like this.

pyocr_sample.py


from PIL import Image
import sys
import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]

txt = tool.image_to_string(
    Image.open('sample.png'),
    lang="jpn+eng",
    builder=pyocr.builders.TextBuilder(tesseract_layout=6)
)
print txt

The part of tesseract_layout = 6 is not written in the document, but it is the same as the option -psm 6 of the command of tesseract. It is an option of what kind of layout is assumed for analysis. Choosing the right one will greatly improve the accuracy of box extraction. The default is -psm 3, which is automatically determined, but it is better to specify it if possible.

If you use the language data distributed by tesseract-ocr as it is, the result is terrible to the level that you are worried whether it can be read properly. However, if you use properly trained language data, the accuracy will improve accordingly.

Recommended Posts

Extract text from images in Python
Extract strings from files in Python
Download images from URL list in Python
Read text in images with python OCR
[Python] (Line) Extract values from graph images
Extract text from PowerPoint with Python! (Compatible with tables)
Load images from URLs using Pillow in Python 3
UTF8 text processing in python
Base64 encoding images in Python 3
OCR from PDF in Python
Speech to speech in python [text to speech]
Extract multiple list duplicates in Python
Number recognition in images with Python
I tried [scraping] fashion images and text sentences in Python.
Pixel manipulation of images in Python
GOTO in Python with Sublime Text 3
Get data from Quandl in Python
How to collect images in Python
Generating multilingual text images using Python
[Python] Extract text data from XML data of 10GB or more.
Sort large text files in Python
Post images from Python to Tumblr
Reading and writing text in Python
Working with DICOM images in Python
[python] Extract text from pdf and read characters aloud with Open-Jtalk
Extract images from cifar and CUCUMBER-9 datasets
Mosaic images in various shapes (Python, OpenCV)
Revived from "no internet access" in Python
Prevent double boot from cron in Python
# 5 [python3] Extract characters from a character string
Extract Japanese text from PDF with PDFMiner
How to extract polygon area in Python
Get battery level from SwitchBot in Python
Generate a class from a string in Python
Generate C language from S-expressions in Python
Try text mining your diary in Python
Convert from Markdown to HTML in Python
Get Precipitation Probability from XML in Python
Get rid of DICOM images in Python
Get metric history from MLflow in Python
Quadtree in Python --2
Python in optimization
Metaprogramming in Python
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
Meta-analysis in Python
Unittest in python
Reading from text files and SQLite in Python (+ Pandas), R, Julia (+ DataFrames)
Extract lines that match the conditions from a text file with python
Epoch in Python
Discord in Python
Sudoku in Python
sql from python
nCr in python
N-Gram in Python
Programming in python
Plink in Python
Extract every n elements from an array (list) in Python and Ruby
Constant in python
MeCab from Python