There is a technology called "OCR (Optical Character Recognition)" that reads printed and handwritten characters and converts them into character data.
OCR services are provided for various documents such as invoices, receipts, business cards and driver's licenses. By using OCR, you can reduce the trouble of data entry. In addition, by linking with other systems, it is possible to make effective use of data.
The OCR provided by each company includes services for companies and individuals. As an OCR that can be used by individuals, there is "Google Vision API (hereinafter referred to as Vision API)". Vision API is a very high-performance image analysis service provided by Google. (Click here for the free trial page (https://cloud.google.com/vision?hl=ja))
This time, I tried a simple receipt OCR using the Vision API.
The environment uses Google Colaboratory. The Python version is below.
import platform
print("python " + platform.python_version())
# python 3.6.9
Now let's write the code. First, import the library required to display the image.
import cv2
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
Also prepare a sample image of the receipt. Now let's display the image.
img = cv2.imread(input_file) # input_file is the path of the image
plt.figure(figsize=[10,10])
plt.axis('off')
plt.imshow(img[:,:,::-1])
Now, let's OCR this receipt image using the Google Vision API.
The following are the necessary preparations for using the Vision API. Please proceed according to here. You will need to install the client library and issue a service account key.
The installation of the client library is as follows.
pip install google-cloud-vision
Use the service account key you issued to set environment variables.
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = json_path # json_path is the path of the service account key
Now let's send a request to the Vision API and get a response.
import io
from google.cloud import vision
from google.cloud.vision import types
client = vision.ImageAnnotatorClient()
with io.open(input_file, 'rb') as image_file:
content = image_file.read()
image = types.Image(content=content)
response = client.document_text_detection(image=image)
If it can be executed without error, the request can be sent to the API and the response can be obtained.
This response contains the OCR results for the Vision API. It contains various information such as read text information, coordinate information, conviction, and language. Here, let's check the text information of the read full-text text.
print(response.text_annotations[0].description)
SAVERSONICS Seven-Eleven Chiyoda store 8-8 Nibancho, Chiyoda-ku, Tokyo Phone: 03-1234-5678 Cash register # 31 Tuesday, October 01, 2019 08:45 Responsibility 012 Receipt Hand-rolled rice ball spicy cod roe Coca-Cola 500ml Paradu Mini Nail PK03 Mobius One 50 yen stamp *130 *140 300 490、 50 years Subtotal (8% excluding tax) ¥270 Consumption tax, etc. (8%) ¥21 Subtotal (10% excluding tax) ¥300 Consumption tax, etc. (10%) ¥30 Subtotal (10% including tax) ¥490 Subtotal (tax exempt) ¥50 Total ¥ 1,161 (Tax rate 8% target ¥291) (Tax rate 10% target ¥820) (Including consumption tax, etc. 8% ¥21) (Internal consumption tax, etc. 10% ¥74) Cashless return amount -22 nanaco payment ¥1,139 The purchase details are as above. nanaco number *******9999 This time point 2P The [*] mark is subject to the reduced tax rate. |
You can see that it is being read with extremely high accuracy.
The Vision API divides the image into blocks, paragraphs, etc., depending on how the characters are collected. Let's check each divided area. First, define the function (see code here for details).
from enum import Enum
class FeatureType(Enum):
PAGE = 1
BLOCK = 2
PARA = 3
WORD = 4
SYMBOL = 5
def draw_boxes(input_file, bounds):
img = cv2.imread(input_file, cv2.IMREAD_COLOR)
for bound in bounds:
p1 = (bound.vertices[0].x, bound.vertices[0].y) # top left
p2 = (bound.vertices[1].x, bound.vertices[1].y) # top right
p3 = (bound.vertices[2].x, bound.vertices[2].y) # bottom right
p4 = (bound.vertices[3].x, bound.vertices[3].y) # bottom left
cv2.line(img, p1, p2, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
cv2.line(img, p2, p3, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
cv2.line(img, p3, p4, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
cv2.line(img, p4, p1, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
return img
def get_document_bounds(response, feature):
document = response.full_text_annotation
bounds = []
for page in document.pages:
for block in page.blocks:
for paragraph in block.paragraphs:
for word in paragraph.words:
for symbol in word.symbols:
if (feature == FeatureType.SYMBOL):
bounds.append(symbol.bounding_box)
if (feature == FeatureType.WORD):
bounds.append(word.bounding_box)
if (feature == FeatureType.PARA):
bounds.append(paragraph.bounding_box)
if (feature == FeatureType.BLOCK):
bounds.append(block.bounding_box)
return bounds
Now, let's write each area on the image and display it.
bounds = get_document_bounds(response, FeatureType.BLOCK)
img_block = draw_boxes(input_file, bounds)
bounds = get_document_bounds(response, FeatureType.PARA)
img_para = draw_boxes(input_file, bounds)
bounds = get_document_bounds(response, FeatureType.WORD)
img_word = draw_boxes(input_file, bounds)
bounds = get_document_bounds(response, FeatureType.SYMBOL)
img_symbol = draw_boxes(input_file, bounds)
plt.figure(figsize=[20,20])
plt.subplot(141);plt.imshow(img_block[:,:,::-1]);plt.title("img_block")
plt.subplot(142);plt.imshow(img_para[:,:,::-1]);plt.title("img_para")
plt.subplot(143);plt.imshow(img_word[:,:,::-1]);plt.title("img_word")
plt.subplot(144);plt.imshow(img_symbol[:,:,::-1]);plt.title("img_symbol")
It was confirmed that the image was divided into various units such as block, paragraph, word, and symbol.
As you can see, the Vision API nicely divides the area, but in some cases it can be a disadvantage. For example, in this case, "hand-rolled rice ball spicy cod roe" and its amount "* 130" are separated. Due to the nature of receipts, information is often organized line by line, so consider splitting it line by line.
How can I separate each line? The Vision API has coordinate information for each character (symbol bounding_box above). Sorting from left to right and top to bottom by coordinate values seems to work. Below, we will create a process to group by line according to the coordinates of the characters.
def get_sorted_lines(response):
document = response.full_text_annotation
bounds = []
for page in document.pages:
for block in page.blocks:
for paragraph in block.paragraphs:
for word in paragraph.words:
for symbol in word.symbols:
x = symbol.bounding_box.vertices[0].x
y = symbol.bounding_box.vertices[0].y
text = symbol.text
bounds.append([x, y, text, symbol.bounding_box])
bounds.sort(key=lambda x: x[1])
old_y = -1
line = []
lines = []
threshold = 1
for bound in bounds:
x = bound[0]
y = bound[1]
if old_y == -1:
old_y = y
elif old_y-threshold <= y <= old_y+threshold:
old_y = y
else:
old_y = -1
line.sort(key=lambda x: x[0])
lines.append(line)
line = []
line.append(bound)
line.sort(key=lambda x: x[0])
lines.append(line)
return lines
Let's check it.
img = cv2.imread(input_file, cv2.IMREAD_COLOR)
lines = get_sorted_lines(response)
for line in lines:
texts = [i[2] for i in line]
texts = ''.join(texts)
bounds = [i[3] for i in line]
print(texts)
for bound in bounds:
p1 = (bounds[0].vertices[0].x, bounds[0].vertices[0].y) # top left
p2 = (bounds[-1].vertices[1].x, bounds[-1].vertices[1].y) # top right
p3 = (bounds[-1].vertices[2].x, bounds[-1].vertices[2].y) # bottom right
p4 = (bounds[0].vertices[3].x, bounds[0].vertices[3].y) # bottom left
cv2.line(img, p1, p2, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
cv2.line(img, p2, p3, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
cv2.line(img, p3, p4, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
cv2.line(img, p4, p1, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
plt.figure(figsize=[10,10])
plt.axis('off')
plt.imshow(img[:,:,::-1]);plt.title("img_by_line")
Seven-Eleven SAVERSONICS Chiyoda store 8-8 Nibancho, Chiyoda-ku, Tokyo Phone: 03-1234-5678 Cashier # 31 Tuesday, October 01, 2019 08:45 Responsibility 012 Receipt Hand-rolled rice ball Spicy cod roe * 130 Coca-Cola 500ml * 140 Paradu Mini Nail PK03300 Mobius One 490, 50 yen stamp 50 years Subtotal (8% excluding tax) ¥ 270 Consumption tax, etc. (8%) ¥21 Subtotal (10% excluding tax) ¥ 300 Consumption tax, etc. (10%) ¥ 30 Subtotal (10% including tax) ¥ 490 Subtotal (tax exempt) ¥ 50 Total ¥ 1,161 (Tax rate 8% target ¥291) (Tax rate 10% target ¥ 820) (Including consumption tax, etc. 8% ¥ 21) (Including consumption tax, etc. 10% ¥ 74) Cashless return amount-22 nanaco payment ¥ 1,139 The purchase details are as above. nanaco number ******* 9999 This time point 2P The [*] mark is subject to the reduced tax rate. |
I was able to organize it line by line.
I was able to organize the character strings by text formatting. This makes it easier to retrieve the information you need. To retrieve the necessary information, text processing using regular expressions or natural language processing can be considered.
This time, let's extract and structure information such as "date", "phone number", and "total amount" using regular expressions. See also here for regular expressions.
-I made a regular expression for "date" using Python -I tried to make a regular expression of "time" using Python -I tried to make a regular expression of "amount" using Python
import re
def get_matched_string(pattern, string):
prog = re.compile(pattern)
result = prog.search(string)
if result:
return result.group()
else:
return False
pattern_dict = {}
pattern_dict['date'] = r'[12]\d{3}[/\-Year](0?[1-9]|1[0-2])[/\-Month](0?[1-9]|[12][0-9]|3[01])Day?'
pattern_dict['time'] = r'((0?|1)[0-9]|2[0-3])[:Time][0-5][0-9]Minutes?'
pattern_dict['tel'] = '0\d{1,3}-\d{1,4}-\d{4}'
pattern_dict['total_price'] = r'Total ¥(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)$'
for line in lines:
texts = [i[2] for i in line]
texts = ''.join(texts)
for key, pattern in pattern_dict.items():
matched_string = get_matched_string(pattern, texts)
if matched_string:
print(key, matched_string)
# tel 03-1234-5678
#date October 01, 2019
# time 08:45
# total_price Total ¥ 1,161
I was able to extract the phone number, date, time, and total amount.
This time, I tried to perform receipt OCR using Vision API.
The Vision API has a very high precision OCR function. It is also an OCR service that even individuals can easily use. It is also possible to extract the desired information by applying regular expressions and natural language processing to the text of the OCR result.
Why don't you try OCR with various documents?
Recommended Posts