I tried "Receipt OCR" with Google Vision API

Introduction

There is a technology called "OCR (Optical Character Recognition)" that reads printed and handwritten characters and converts them into character data.

OCR services are provided for various documents such as invoices, receipts, business cards and driver's licenses. By using OCR, you can reduce the trouble of data entry. In addition, by linking with other systems, it is possible to make effective use of data.

The OCR provided by each company includes services for companies and individuals. As an OCR that can be used by individuals, there is "Google Vision API (hereinafter referred to as Vision API)". Vision API is a very high-performance image analysis service provided by Google. (Click here for the free trial page (https://cloud.google.com/vision?hl=ja))

This time, I tried a simple receipt OCR using the Vision API.

Receipt OCR

environment

The environment uses Google Colaboratory. The Python version is below.

import platform
print("python " + platform.python_version())
# python 3.6.9

Let's display the image

Now let's write the code. First, import the library required to display the image.

import cv2
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib

Also prepare a sample image of the receipt. Now let's display the image.

img = cv2.imread(input_file) # input_file is the path of the image
plt.figure(figsize=[10,10])
plt.axis('off')
plt.imshow(img[:,:,::-1])

Vision API setup

Now, let's OCR this receipt image using the Google Vision API.

About Vision API pricing *
・ Free for the first 1,000 units / month (as of June 21, 2020) *
・ Please refer to here for details on the pricing system. *

The following are the necessary preparations for using the Vision API. Please proceed according to here. You will need to install the client library and issue a service account key.

The installation of the client library is as follows.

pip install google-cloud-vision

Use the service account key you issued to set environment variables.

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = json_path # json_path is the path of the service account key

Send request to API / Get response

Now let's send a request to the Vision API and get a response.

import io

from google.cloud import vision
from google.cloud.vision import types

client = vision.ImageAnnotatorClient()
with io.open(input_file, 'rb') as image_file:
    content = image_file.read()
image = types.Image(content=content)
response = client.document_text_detection(image=image)

If it can be executed without error, the request can be sent to the API and the response can be obtained.

This response contains the OCR results for the Vision API. It contains various information such as read text information, coordinate information, conviction, and language. Here, let's check the text information of the read full-text text.

print(response.text_annotations[0].description)

SAVERSONICS
Seven-Eleven
Chiyoda store
8-8 Nibancho, Chiyoda-ku, Tokyo
Phone: 03-1234-5678
Cash register # 31
Tuesday, October 01, 2019 08:45 Responsibility 012
Receipt
Hand-rolled rice ball spicy cod roe
Coca-Cola 500ml
Paradu Mini Nail PK03
Mobius One
50 yen stamp
*130
*140
300
490、
50 years
Subtotal (8% excluding tax)
¥270
Consumption tax, etc. (8%)
¥21
Subtotal (10% excluding tax)
¥300
Consumption tax, etc. (10%)
¥30
Subtotal (10% including tax)
¥490
Subtotal (tax exempt)
¥50
Total ¥ 1,161
(Tax rate 8% target
¥291)
(Tax rate 10% target
¥820)
(Including consumption tax, etc. 8%
¥21)
(Internal consumption tax, etc. 10%
¥74)
Cashless return amount
-22
nanaco payment
¥1,139
The purchase details are as above.
nanaco number
*******9999
This time point
2P
The [*] mark is subject to the reduced tax rate.

You can see that it is being read with extremely high accuracy.

The Vision API divides the image into blocks, paragraphs, etc., depending on how the characters are collected. Let's check each divided area. First, define the function (see code here for details).

from enum import Enum

class FeatureType(Enum):
    PAGE = 1
    BLOCK = 2
    PARA = 3
    WORD = 4
    SYMBOL = 5

def draw_boxes(input_file, bounds):
    img = cv2.imread(input_file, cv2.IMREAD_COLOR)
    for bound in bounds:
      p1 = (bound.vertices[0].x, bound.vertices[0].y) # top left
      p2 = (bound.vertices[1].x, bound.vertices[1].y) # top right
      p3 = (bound.vertices[2].x, bound.vertices[2].y) # bottom right
      p4 = (bound.vertices[3].x, bound.vertices[3].y) # bottom left
      cv2.line(img, p1, p2, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
      cv2.line(img, p2, p3, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
      cv2.line(img, p3, p4, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
      cv2.line(img, p4, p1, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
    return img

def get_document_bounds(response, feature):
    document = response.full_text_annotation
    bounds = []
    for page in document.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                for word in paragraph.words:
                    for symbol in word.symbols:
                        if (feature == FeatureType.SYMBOL):
                          bounds.append(symbol.bounding_box)
                    if (feature == FeatureType.WORD):
                        bounds.append(word.bounding_box)
                if (feature == FeatureType.PARA):
                    bounds.append(paragraph.bounding_box)
            if (feature == FeatureType.BLOCK):
                bounds.append(block.bounding_box)
    return bounds

Now, let's write each area on the image and display it.

bounds = get_document_bounds(response, FeatureType.BLOCK)
img_block = draw_boxes(input_file, bounds)

bounds = get_document_bounds(response, FeatureType.PARA)
img_para = draw_boxes(input_file, bounds)

bounds = get_document_bounds(response, FeatureType.WORD)
img_word = draw_boxes(input_file, bounds)

bounds = get_document_bounds(response, FeatureType.SYMBOL)
img_symbol = draw_boxes(input_file, bounds)

plt.figure(figsize=[20,20])
plt.subplot(141);plt.imshow(img_block[:,:,::-1]);plt.title("img_block")
plt.subplot(142);plt.imshow(img_para[:,:,::-1]);plt.title("img_para")
plt.subplot(143);plt.imshow(img_word[:,:,::-1]);plt.title("img_word")
plt.subplot(144);plt.imshow(img_symbol[:,:,::-1]);plt.title("img_symbol")

It was confirmed that the image was divided into various units such as block, paragraph, word, and symbol.

Text formatting

As you can see, the Vision API nicely divides the area, but in some cases it can be a disadvantage. For example, in this case, "hand-rolled rice ball spicy cod roe" and its amount "* 130" are separated. Due to the nature of receipts, information is often organized line by line, so consider splitting it line by line.

How can I separate each line? The Vision API has coordinate information for each character (symbol bounding_box above). Sorting from left to right and top to bottom by coordinate values seems to work. Below, we will create a process to group by line according to the coordinates of the characters.

def get_sorted_lines(response):
    document = response.full_text_annotation
    bounds = []
    for page in document.pages:
      for block in page.blocks:
        for paragraph in block.paragraphs:
          for word in paragraph.words:
            for symbol in word.symbols:
              x = symbol.bounding_box.vertices[0].x
              y = symbol.bounding_box.vertices[0].y
              text = symbol.text
              bounds.append([x, y, text, symbol.bounding_box])
    bounds.sort(key=lambda x: x[1])
    old_y = -1
    line = []
    lines = []
    threshold = 1
    for bound in bounds:
      x = bound[0]
      y = bound[1]
      if old_y == -1:
        old_y = y
      elif old_y-threshold <= y <= old_y+threshold:
        old_y = y
      else:
        old_y = -1
        line.sort(key=lambda x: x[0])
        lines.append(line)
        line = []
      line.append(bound)
    line.sort(key=lambda x: x[0])
    lines.append(line)
    return lines

Let's check it.

img = cv2.imread(input_file, cv2.IMREAD_COLOR)

lines = get_sorted_lines(response)
for line in lines:
  texts = [i[2] for i in line]
  texts = ''.join(texts)
  bounds = [i[3] for i in line]
  print(texts)
  for bound in bounds:
    p1 = (bounds[0].vertices[0].x, bounds[0].vertices[0].y)   # top left
    p2 = (bounds[-1].vertices[1].x, bounds[-1].vertices[1].y) # top right
    p3 = (bounds[-1].vertices[2].x, bounds[-1].vertices[2].y) # bottom right
    p4 = (bounds[0].vertices[3].x, bounds[0].vertices[3].y)   # bottom left
    cv2.line(img, p1, p2, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
    cv2.line(img, p2, p3, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
    cv2.line(img, p3, p4, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
    cv2.line(img, p4, p1, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)

plt.figure(figsize=[10,10])
plt.axis('off')
plt.imshow(img[:,:,::-1]);plt.title("img_by_line")

Seven-Eleven
SAVERSONICS
Chiyoda store
8-8 Nibancho, Chiyoda-ku, Tokyo
Phone: 03-1234-5678 Cashier # 31
Tuesday, October 01, 2019 08:45 Responsibility 012
Receipt
Hand-rolled rice ball Spicy cod roe * 130
Coca-Cola 500ml * 140
Paradu Mini Nail PK03300
Mobius One 490,
50 yen stamp 50 years
Subtotal (8% excluding tax) ¥ 270
Consumption tax, etc. (8%)
¥21
Subtotal (10% excluding tax) ¥ 300
Consumption tax, etc. (10%) ¥ 30
Subtotal (10% including tax) ¥ 490
Subtotal (tax exempt) ¥ 50
Total ¥ 1,161
(Tax rate 8% target
¥291)
(Tax rate 10% target ¥ 820)
(Including consumption tax, etc. 8% ¥ 21)
(Including consumption tax, etc. 10% ¥ 74)
Cashless return amount-22
nanaco payment ¥ 1,139
The purchase details are as above.
nanaco number ******* 9999
This time point 2P
The [*] mark is subject to the reduced tax rate.

I was able to organize it line by line.

Text structuring

I was able to organize the character strings by text formatting. This makes it easier to retrieve the information you need. To retrieve the necessary information, text processing using regular expressions or natural language processing can be considered.

This time, let's extract and structure information such as "date", "phone number", and "total amount" using regular expressions. See also here for regular expressions.

-I made a regular expression for "date" using Python -I tried to make a regular expression of "time" using Python -I tried to make a regular expression of "amount" using Python

import re

def get_matched_string(pattern, string):
    prog = re.compile(pattern)
    result = prog.search(string)
    if result:
        return result.group()
    else:
        return False

pattern_dict = {}
pattern_dict['date'] = r'[12]\d{3}[/\-Year](0?[1-9]|1[0-2])[/\-Month](0?[1-9]|[12][0-9]|3[01])Day?'
pattern_dict['time'] = r'((0?|1)[0-9]|2[0-3])[:Time][0-5][0-9]Minutes?'
pattern_dict['tel'] = '0\d{1,3}-\d{1,4}-\d{4}'
pattern_dict['total_price'] = r'Total ¥(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)$'

for line in lines:
  texts = [i[2] for i in line]
  texts = ''.join(texts)
  for key, pattern in pattern_dict.items():
    matched_string = get_matched_string(pattern, texts)
    if matched_string:
      print(key, matched_string)

# tel 03-1234-5678
#date October 01, 2019
# time 08:45
# total_price Total ¥ 1,161

I was able to extract the phone number, date, time, and total amount.

Summary

This time, I tried to perform receipt OCR using Vision API.

The Vision API has a very high precision OCR function. It is also an OCR service that even individuals can easily use. It is also possible to extract the desired information by applying regular expressions and natural language processing to the text of the OCR result.

Why don't you try OCR with various documents?