Purpose

I want to calculate the difference number from the graph image of the pachislot data site.

At that time, since the number of sheets indicated on the graph image was required, the indicated number of sheets is acquired by OCR.

Such a graph image.

What you want to get is the number shown in the upper left (2410 in the case of this image)

What to prepare

・ Tesseract (4.0 or later) ・ PyOCR

Installation method etc. are omitted. A reference link is posted at the bottom of the page, so please use that.

Try OCR

For the time being, read this graph image as it is.

from PIL import Image
import pyocr
import pyocr.builders
import sys

file_path = 'File Path'
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
    print('I can't find pyocr. Please install pyocr.')
    sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
# OCR
max_medals = tool.image_to_string(img_org, lang='jpn', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
print(f'max_medals：{max_medals}')

`Execution result`

I couldn't get any numbers.

After investigating various things, it seems that it is more accurate to do numerical OCR with an English dataset, so I changed the language setting to English.

Change language setting to English

from PIL import Image
import pyocr
import pyocr.builders
import sys

file_path = 'File Path'
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
    print('I can't find pyocr. Please install pyocr.')
    sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
# OCR
max_medals = tool.image_to_string(img_org, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
print(f'max_medals：{max_medals}')

`Execution result`


2410 1300160019.00

This time, I was able to get some of the numbers shown.

However, since I have read the unnecessary parts, I rewrite it so that only the part I want to read is cut out and then processed.

OCR after cutting out the reading point

from PIL import Image
import pyocr
import pyocr.builders
import sys

file_path = 'File Path'
#Tool loading
tools = pyocr.get_available_tools()
#If you can't find the tool
if len(tools) == 0:
    print('I can't find pyocr. Please install pyocr.')
    sys.exit(1)
tool = tools[0]
#Image loading
img_org = Image.open(file_path)
#Cut out the number notation part
max_medals_img = img_org.crop((0, 0, 45, 15))
# OCR
max_medals = tool.image_to_string(max_medals_img , lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
print(f'max_medals：{max_medals}')

`Execution result`


max_medals：2410

It went well!

upgrade accuracy

Since it worked well with the previous code, I increased the number of graph images to be read and tried again.

from PIL import Image
import pyocr
import pyocr.builders
import sys
from glob import glob

file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
    #Tool loading
    tools = pyocr.get_available_tools()
    #If you can't find the tool
    if len(tools) == 0:
        print('I can't find pyocr. Please install pyocr.')
        sys.exit(1)
    tool = tools[0]
    #Image loading
    img_org = Image.open(file_path)
    #Cut out the number notation part
    max_medals_img = img_org.crop((0, 0, 45, 15))
    # OCR
    max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
    print(f'max_medals：{max_medals}')

`Execution result`


max_medals：2410
max_medals：
max_medals：490
max_medals：2717
max_medals：689
max_medals：504
max_medals：1013
max_medals：
max_medals：862
max_medals：979
max_medals：835
max_medals：1683
max_medals：1587
max_medals：1010
max_medals：7
max_medals：1586
max_medals：1653
max_medals：413
max_medals：1167
max_medals：527

Some images were not read properly.

I have tried OCR with an image of another format before, and at that time I did not get any error, but the format at that time is

"Background color: white, text color: black"

Since it was an image of the format, I tried to reverse the background color and text color.

Invert background color and text color

from PIL import Image, ImageOps
import pyocr
import pyocr.builders
import sys
from glob import glob

file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
    #Tool loading
    tools = pyocr.get_available_tools()
    #If you can't find the tool
    if len(tools) == 0:
        print('I can't find pyocr. Please install pyocr.')
        sys.exit(1)
    tool = tools[0]
    #Image loading
    img_org = Image.open(file_path)
    #Cut out the number notation part
    max_medals_img = img_org.crop((0, 0, 45, 15))
    #Invert background color and text color (convert from white text to black text)
    max_medals_img = ImageOps.invert(max_medals_img.convert('RGB'))
    # OCR
    max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=6))
    print(f'max_medals：{max_medals}')

`Execution result`


max_medals：2410
max_medals：440
max_medals：490
max_medals：2717
max_medals：689
max_medals：504
max_medals：1013
max_medals：791
max_medals：862
max_medals：979
max_medals：835
max_medals：1683
max_medals：1587
max_medals：1010
max_medals：1132
max_medals：1586
max_medals：1653
max_medals：413
max_medals：1167
max_medals：527

Images that could not be recognized normally were also recognized normally.

I tried to increase the number of image reading samples with this code ...

`Execution result`


max_medals：1908.
max_medals：
max_medals：1000-
max_medals：10

There are still rare cases where characters that are not written in this way are mixed in, the number of digits is incorrect, or the numbers cannot be recognized in the first place.

(7 out of 10,000)

Changed the OCR mode to further improve accuracy.

Change mode from 6 to 8 (mode that regards images as words)

from PIL import Image, ImageOps
import pyocr
import pyocr.builders
import sys
from glob import glob

file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
    #Tool loading
    tools = pyocr.get_available_tools()
    #If you can't find the tool
    if len(tools) == 0:
        print('I can't find pyocr. Please install pyocr.')
        sys.exit(1)
    tool = tools[0]
    #Image loading
    img_org = Image.open(file_path)
    #Cut out the number notation part
    max_medals_img = img_org.crop((0, 0, 45, 15))
    #Invert background color and text color (convert from white text to black text)
    max_medals_img = ImageOps.invert(max_medals_img.convert('RGB'))
    # OCR
    max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=8))
    print(f'max_medals：{max_medals}')

Changed to a mode in which one image itself is regarded as a word. (This mode should be optimal because OCR is performed after cutting only the number notation part) This mode is more accurate.

(Reduced to about 4 out of 10,000)

However, since there were cases where it was not recognized normally, I added a code to exclude characters other than numerical values for the time being.

Exclude non-numeric characters

import re
from PIL import Image, ImageOps
import pyocr
import pyocr.builders
import sys
from glob import glob

file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
    #Tool loading
    tools = pyocr.get_available_tools()
    #If you can't find the tool
    if len(tools) == 0:
        print('I can't find pyocr. Please install pyocr.')
        sys.exit(1)
    tool = tools[0]
    #Image loading
    img_org = Image.open(file_path)
    #Cut out the number notation part
    max_medals_img = img_org.crop((0, 0, 45, 15))
    #Invert background color and text color (convert from white text to black text)
    max_medals_img = ImageOps.invert(max_medals_img.convert('RGB'))
    # OCR
    max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=8))
    #Remove non-numeric characters
    max_medals = re.sub(r'\D', '', max_medals)
    print(f'max_medals：{max_medals}')

This avoids the case where characters other than unmarked numbers such as "-" and "." Are mixed in.

However, in rare cases, the numerical value itself could not be recognized or the number of digits was incorrect.

I wonder what to do and devise various improvement measures

Get the number notation of both the upper left and lower left of the image ↓ Compare both ↓ Adopt a person who seems to be normal

I thought about some patterns of logic, but the code was long and complicated, so I will reconsider a little here.

** "In the first place, if you can improve the recognition accuracy in OCR, you don't have to write troublesome code." **

I came up with the idea that it was natural, and tried various changes in the number notation cutout size of OCR preprocessing.

Final code

import re
from PIL import Image, ImageOps
import pyocr
import pyocr.builders
import sys
from glob import glob

file_path = 'File storage directory'
#Create file list to read
file_list = [file for file in glob(f'{file_path}*.png')]
for file_path in file_list:
    #Tool loading
    tools = pyocr.get_available_tools()
    #If you can't find the tool
    if len(tools) == 0:
        print('I can't find pyocr. Please install pyocr.')
        sys.exit(1)
    tool = tools[0]
    #Image loading
    img_org = Image.open(file_path)
    #Cut out the number notation part
    max_medals_img = img_org.crop((0, 0, 44, 14))
    #Invert background color and text color (convert from white text to black text)
    max_medals_img = ImageOps.invert(max_medals_img.convert('RGB'))
    # OCR
    max_medals = tool.image_to_string(max_medals_img, lang='eng', builder=pyocr.builders.DigitBuilder(tesseract_layout=8))
    #Remove non-numeric characters
    max_medals = re.sub(r'\D', '', max_medals)
    print(f'max_medals：{max_medals}')

After trying various sizes, the recognition rate of the graph image I had with this code became 100%!

The result was that it was better to find the best practice for the cut size than to think about the logic (laughs)

Conclusion

If you cannot recognize the characters well

** I doubt the image in the first place> Review the settings etc.> Adjust by adding another logic **

I think it will be harder to get hooked if you pack in this priority.

This time, even though it recognizes only numerical values, it is extremely accurate, OCR.

Reference link

Character recognition with Python and Tesseract OCR How to run OCR in Python

[Python] Get the numbers in the graph image with OCR

Purpose

What to prepare

Try OCR

Execution result

Change language setting to English

Execution result

OCR after cutting out the reading point

Execution result

upgrade accuracy

Execution result

Invert background color and text color

Execution result

Execution result

Change mode from 6 to 8 (mode that regards images as words)

Exclude non-numeric characters

Final code

Conclusion

Reference link

`Execution result`

`Execution result`

`Execution result`

`Execution result`

`Execution result`

`Execution result`