I want to calculate the difference number from the graph image of the pachislot data site.
At that time, since the number of sheets indicated on the graph image was required, the indicated number of sheets is acquired by OCR.
Such a graph image.
What you want to get is the number shown in the upper left (2410 in the case of this image)
・ Tesseract (4.0 or later) ・ PyOCR
Installation method etc. are omitted. A reference link is posted at the bottom of the page, so please use that.
For the time being, read this graph image as it is.
from PIL import Image
import pyocr
import pyocr.builders
import sys
file_path = 'File Path'
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
print('I can't find pyocr. Please install pyocr.')
sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
# OCR
max_medals = tool.image_to_string(img_org, lang='jpn', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
print(f'max_medals:{max_medals}')
Execution result
-
I couldn't get any numbers.
After investigating various things, it seems that it is more accurate to do numerical OCR with an English dataset, so I changed the language setting to English.
from PIL import Image
import pyocr
import pyocr.builders
import sys
file_path = 'File Path'
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
print('I can't find pyocr. Please install pyocr.')
sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
# OCR
max_medals = tool.image_to_string(img_org, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
print(f'max_medals:{max_medals}')
Execution result
2410 1300160019.00
This time, I was able to get some of the numbers shown.
However, since I have read the unnecessary parts, I rewrite it so that only the part I want to read is cut out and then processed.
from PIL import Image
import pyocr
import pyocr.builders
import sys
file_path = 'File Path'
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
print('I can't find pyocr. Please install pyocr.')
sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
#Cut out the number notation part
max_medals_img = img_org.crop((0, 0, 45, 15))
# OCR
max_medals = tool.image_to_string(max_medals_img , lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
print(f'max_medals:{max_medals}')
Execution result
max_medals:2410
It went well!
Since it worked well with the previous code, I increased the number of graph images to be read and tried again.
from PIL import Image
import pyocr
import pyocr.builders
import sys
from glob import glob
file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
print('I can't find pyocr. Please install pyocr.')
sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
#Cut out the number notation part
max_medals_img = img_org.crop((0, 0, 45, 15))
# OCR
max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
print(f'max_medals:{max_medals}')
Execution result
max_medals:2410
max_medals:
max_medals:490
max_medals:2717
max_medals:689
max_medals:504
max_medals:1013
max_medals:
max_medals:862
max_medals:979
max_medals:835
max_medals:1683
max_medals:1587
max_medals:1010
max_medals:7
max_medals:1586
max_medals:1653
max_medals:413
max_medals:1167
max_medals:527
Some images were not read properly.
I have tried OCR with an image of another format before, and at that time I did not get any error, but the format at that time is
"Background color: white, text color: black"
Since it was an image of the format, I tried to reverse the background color and text color.
from PIL import Image, ImageOps
import pyocr
import pyocr.builders
import sys
from glob import glob
file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
print('I can't find pyocr. Please install pyocr.')
sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
#Cut out the number notation part
max_medals_img = img_org.crop((0, 0, 45, 15))
#Invert background color and text color (convert from white text to black text)
max_medals_img = ImageOps.invert(max_medals_img.convert('RGB'))
# OCR
max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
print(f'max_medals:{max_medals}')
Execution result
max_medals:2410
max_medals:440
max_medals:490
max_medals:2717
max_medals:689
max_medals:504
max_medals:1013
max_medals:791
max_medals:862
max_medals:979
max_medals:835
max_medals:1683
max_medals:1587
max_medals:1010
max_medals:1132
max_medals:1586
max_medals:1653
max_medals:413
max_medals:1167
max_medals:527
Images that could not be recognized normally were also recognized normally.
I tried to increase the number of image reading samples with this code ...
Execution result
max_medals:1908.
max_medals:
max_medals:1000-
max_medals:10
There are still rare cases where characters that are not written in this way are mixed in, the number of digits is incorrect, or the numbers cannot be recognized in the first place.
(7 out of 10,000)
Changed the OCR mode to further improve accuracy.
from PIL import Image, ImageOps
import pyocr
import pyocr.builders
import sys
from glob import glob
file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
print('I can't find pyocr. Please install pyocr.')
sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
#Cut out the number notation part
max_medals_img = img_org.crop((0, 0, 45, 15))
#Invert background color and text color (convert from white text to black text)
max_medals_img = ImageOps.invert(max_medals_img.convert('RGB'))
# OCR
max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=8))
print(f'max_medals:{max_medals}')
Changed to a mode in which one image itself is regarded as a word. (This mode should be optimal because OCR is performed after cutting only the number notation part) This mode is more accurate.
(Reduced to about 4 out of 10,000)
However, since there were cases where it was not recognized normally, I added a code to exclude characters other than numerical values for the time being.
import re
from PIL import Image, ImageOps
import pyocr
import pyocr.builders
import sys
from glob import glob
file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
print('I can't find pyocr. Please install pyocr.')
sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
#Cut out the number notation part
max_medals_img = img_org.crop((0, 0, 45, 15))
#Invert background color and text color (convert from white text to black text)
max_medals_img = ImageOps.invert(max_medals_img.convert('RGB'))
# OCR
max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=8))
#Remove non-numeric characters
max_medals = re.sub(r'\D', '', max_medals)
print(f'max_medals:{max_medals}')
This avoids the case where characters other than unmarked numbers such as "-" and "." Are mixed in.
However, in rare cases, the numerical value itself could not be recognized or the number of digits was incorrect.
I wonder what to do and devise various improvement measures
Get the number notation of both the upper left and lower left of the image ↓ Compare both ↓ Adopt a person who seems to be normal
I thought about some patterns of logic, but the code was long and complicated, so I will reconsider a little here.
** "In the first place, if you can improve the recognition accuracy in OCR, you don't have to write troublesome code." **
I came up with the idea that it was natural, and tried various changes in the number notation cutout size of OCR preprocessing.
import re
from PIL import Image, ImageOps
import pyocr
import pyocr.builders
import sys
from glob import glob
file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
print('I can't find pyocr. Please install pyocr.')
sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
#Cut out the number notation part
max_medals_img = img_org.crop((0, 0, 44, 14))
#Invert background color and text color (convert from white text to black text)
max_medals_img = ImageOps.invert(max_medals_img.convert('RGB'))
# OCR
max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=8))
#Remove non-numeric characters
max_medals = re.sub(r'\D', '', max_medals)
print(f'max_medals:{max_medals}')
After trying various sizes, the recognition rate of the graph image I had with this code became 100%!
The result was that it was better to find the best practice for the cut size than to think about the logic (laughs)
If you cannot recognize the characters well
** I doubt the image in the first place> Review the settings etc.> Adjust by adding another logic **
I think it will be harder to get hooked if you pack in this priority.
This time, even though it recognizes only numerical values, it is extremely accurate, OCR.
Character recognition with Python and Tesseract OCR How to run OCR in Python
Recommended Posts