I tried "Receipt OCR" with Google Vision API

Introduction

There is a technology called "OCR (Optical Character Recognition)" that reads printed and handwritten characters and converts them into character data.

OCR services are provided for various documents such as invoices, receipts, business cards and driver's licenses. By using OCR, you can reduce the trouble of data entry. In addition, by linking with other systems, it is possible to make effective use of data.

The OCR provided by each company includes services for companies and individuals. As an OCR that can be used by individuals, there is "Google Vision API (hereinafter referred to as Vision API)". Vision API is a very high-performance image analysis service provided by Google. (Click here for the free trial page (https://cloud.google.com/vision?hl=ja))

This time, I tried a simple receipt OCR using the Vision API.

Receipt OCR

environment

The environment uses Google Colaboratory. The Python version is below.

import platform
print("python " + platform.python_version())
# python 3.6.9

Let's display the image

Now let's write the code. First, import the library required to display the image.

import cv2
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib

Also prepare a sample image of the receipt. Now let's display the image.

img = cv2.imread(input_file) # input_file is the path of the image
plt.figure(figsize=[10,10])
plt.axis('off')
plt.imshow(img[:,:,::-1])

image.png

Vision API setup

Now, let's OCR this receipt image using the Google Vision API.

The following are the necessary preparations for using the Vision API. Please proceed according to here. You will need to install the client library and issue a service account key.

The installation of the client library is as follows.

pip install google-cloud-vision

Use the service account key you issued to set environment variables.

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = json_path # json_path is the path of the service account key

Send request to API / Get response

Now let's send a request to the Vision API and get a response.

import io

from google.cloud import vision
from google.cloud.vision import types

client = vision.ImageAnnotatorClient()
with io.open(input_file, 'rb') as image_file:
    content = image_file.read()
image = types.Image(content=content)
response = client.document_text_detection(image=image)

If it can be executed without error, the request can be sent to the API and the response can be obtained.

This response contains the OCR results for the Vision API. It contains various information such as read text information, coordinate information, conviction, and language. Here, let's check the text information of the read full-text text.

print(response.text_annotations[0].description)
SAVERSONICS
Seven-Eleven
Chiyoda store
8-8 Nibancho, Chiyoda-ku, Tokyo
Phone: 03-1234-5678
Cash register # 31
Tuesday, October 01, 2019 08:45 Responsibility 012
Receipt
Hand-rolled rice ball spicy cod roe
Coca-Cola 500ml
Paradu Mini Nail PK03
Mobius One
50 yen stamp
*130
*140
300
490、
50 years
Subtotal (8% excluding tax)
¥270
Consumption tax, etc. (8%)
¥21
Subtotal (10% excluding tax)
¥300
Consumption tax, etc. (10%)
¥30
Subtotal (10% including tax)
¥490
Subtotal (tax exempt)
¥50
Total ¥ 1,161
(Tax rate 8% target
¥291)
(Tax rate 10% target
¥820)
(Including consumption tax, etc. 8%
¥21)
(Internal consumption tax, etc. 10%
¥74)
Cashless return amount
-22
nanaco payment
¥1,139
The purchase details are as above.
nanaco number
*******9999
This time point
2P
The [*] mark is subject to the reduced tax rate.

You can see that it is being read with extremely high accuracy.

The Vision API divides the image into blocks, paragraphs, etc., depending on how the characters are collected. Let's check each divided area. First, define the function (see code here for details).

from enum import Enum

class FeatureType(Enum):
    PAGE = 1
    BLOCK = 2
    PARA = 3
    WORD = 4
    SYMBOL = 5

def draw_boxes(input_file, bounds):
    img = cv2.imread(input_file, cv2.IMREAD_COLOR)
    for bound in bounds:
      p1 = (bound.vertices[0].x, bound.vertices[0].y) # top left
      p2 = (bound.vertices[1].x, bound.vertices[1].y) # top right
      p3 = (bound.vertices[2].x, bound.vertices[2].y) # bottom right
      p4 = (bound.vertices[3].x, bound.vertices[3].y) # bottom left
      cv2.line(img, p1, p2, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
      cv2.line(img, p2, p3, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
      cv2.line(img, p3, p4, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
      cv2.line(img, p4, p1, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
    return img

def get_document_bounds(response, feature):
    document = response.full_text_annotation
    bounds = []
    for page in document.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                for word in paragraph.words:
                    for symbol in word.symbols:
                        if (feature == FeatureType.SYMBOL):
                          bounds.append(symbol.bounding_box)
                    if (feature == FeatureType.WORD):
                        bounds.append(word.bounding_box)
                if (feature == FeatureType.PARA):
                    bounds.append(paragraph.bounding_box)
            if (feature == FeatureType.BLOCK):
                bounds.append(block.bounding_box)
    return bounds

Now, let's write each area on the image and display it.

bounds = get_document_bounds(response, FeatureType.BLOCK)
img_block = draw_boxes(input_file, bounds)

bounds = get_document_bounds(response, FeatureType.PARA)
img_para = draw_boxes(input_file, bounds)

bounds = get_document_bounds(response, FeatureType.WORD)
img_word = draw_boxes(input_file, bounds)

bounds = get_document_bounds(response, FeatureType.SYMBOL)
img_symbol = draw_boxes(input_file, bounds)

plt.figure(figsize=[20,20])
plt.subplot(141);plt.imshow(img_block[:,:,::-1]);plt.title("img_block")
plt.subplot(142);plt.imshow(img_para[:,:,::-1]);plt.title("img_para")
plt.subplot(143);plt.imshow(img_word[:,:,::-1]);plt.title("img_word")
plt.subplot(144);plt.imshow(img_symbol[:,:,::-1]);plt.title("img_symbol")

image.png

It was confirmed that the image was divided into various units such as block, paragraph, word, and symbol.

Text formatting

As you can see, the Vision API nicely divides the area, but in some cases it can be a disadvantage. For example, in this case, "hand-rolled rice ball spicy cod roe" and its amount "* 130" are separated. Due to the nature of receipts, information is often organized line by line, so consider splitting it line by line.

How can I separate each line? The Vision API has coordinate information for each character (symbol bounding_box above). Sorting from left to right and top to bottom by coordinate values seems to work. Below, we will create a process to group by line according to the coordinates of the characters.

def get_sorted_lines(response):
    document = response.full_text_annotation
    bounds = []
    for page in document.pages:
      for block in page.blocks:
        for paragraph in block.paragraphs:
          for word in paragraph.words:
            for symbol in word.symbols:
              x = symbol.bounding_box.vertices[0].x
              y = symbol.bounding_box.vertices[0].y
              text = symbol.text
              bounds.append([x, y, text, symbol.bounding_box])
    bounds.sort(key=lambda x: x[1])
    old_y = -1
    line = []
    lines = []
    threshold = 1
    for bound in bounds:
      x = bound[0]
      y = bound[1]
      if old_y == -1:
        old_y = y
      elif old_y-threshold <= y <= old_y+threshold:
        old_y = y
      else:
        old_y = -1
        line.sort(key=lambda x: x[0])
        lines.append(line)
        line = []
      line.append(bound)
    line.sort(key=lambda x: x[0])
    lines.append(line)
    return lines

Let's check it.

img = cv2.imread(input_file, cv2.IMREAD_COLOR)

lines = get_sorted_lines(response)
for line in lines:
  texts = [i[2] for i in line]
  texts = ''.join(texts)
  bounds = [i[3] for i in line]
  print(texts)
  for bound in bounds:
    p1 = (bounds[0].vertices[0].x, bounds[0].vertices[0].y)   # top left
    p2 = (bounds[-1].vertices[1].x, bounds[-1].vertices[1].y) # top right
    p3 = (bounds[-1].vertices[2].x, bounds[-1].vertices[2].y) # bottom right
    p4 = (bounds[0].vertices[3].x, bounds[0].vertices[3].y)   # bottom left
    cv2.line(img, p1, p2, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
    cv2.line(img, p2, p3, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
    cv2.line(img, p3, p4, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
    cv2.line(img, p4, p1, (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)

plt.figure(figsize=[10,10])
plt.axis('off')
plt.imshow(img[:,:,::-1]);plt.title("img_by_line")
Seven-Eleven
SAVERSONICS
Chiyoda store
8-8 Nibancho, Chiyoda-ku, Tokyo
Phone: 03-1234-5678 Cashier # 31
Tuesday, October 01, 2019 08:45 Responsibility 012
Receipt
Hand-rolled rice ball Spicy cod roe * 130
Coca-Cola 500ml * 140
Paradu Mini Nail PK03300
Mobius One 490,
50 yen stamp 50 years
Subtotal (8% excluding tax) ¥ 270
Consumption tax, etc. (8%)
¥21
Subtotal (10% excluding tax) ¥ 300
Consumption tax, etc. (10%) ¥ 30
Subtotal (10% including tax) ¥ 490
Subtotal (tax exempt) ¥ 50
Total ¥ 1,161
(Tax rate 8% target
¥291)
(Tax rate 10% target ¥ 820)
(Including consumption tax, etc. 8% ¥ 21)
(Including consumption tax, etc. 10% ¥ 74)
Cashless return amount-22
nanaco payment ¥ 1,139
The purchase details are as above.
nanaco number ******* 9999
This time point 2P
The [*] mark is subject to the reduced tax rate.

I was able to organize it line by line.

Text structuring

I was able to organize the character strings by text formatting. This makes it easier to retrieve the information you need. To retrieve the necessary information, text processing using regular expressions or natural language processing can be considered.

This time, let's extract and structure information such as "date", "phone number", and "total amount" using regular expressions. See also here for regular expressions.

-I made a regular expression for "date" using Python -I tried to make a regular expression of "time" using Python -I tried to make a regular expression of "amount" using Python

import re

def get_matched_string(pattern, string):
    prog = re.compile(pattern)
    result = prog.search(string)
    if result:
        return result.group()
    else:
        return False

pattern_dict = {}
pattern_dict['date'] = r'[12]\d{3}[/\-Year](0?[1-9]|1[0-2])[/\-Month](0?[1-9]|[12][0-9]|3[01])Day?'
pattern_dict['time'] = r'((0?|1)[0-9]|2[0-3])[:Time][0-5][0-9]Minutes?'
pattern_dict['tel'] = '0\d{1,3}-\d{1,4}-\d{4}'
pattern_dict['total_price'] = r'Total ¥(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)$'

for line in lines:
  texts = [i[2] for i in line]
  texts = ''.join(texts)
  for key, pattern in pattern_dict.items():
    matched_string = get_matched_string(pattern, texts)
    if matched_string:
      print(key, matched_string)

# tel 03-1234-5678
#date October 01, 2019
# time 08:45
# total_price Total ¥ 1,161

I was able to extract the phone number, date, time, and total amount.

Summary

This time, I tried to perform receipt OCR using Vision API.

The Vision API has a very high precision OCR function. It is also an OCR service that even individuals can easily use. It is also possible to extract the desired information by applying regular expressions and natural language processing to the text of the OCR result.

Why don't you try OCR with various documents?

Recommended Posts

I tried "Receipt OCR" with Google Vision API
I tried "License OCR" with Google Vision API
I tried using the Google Cloud Vision API
I tried Google Sign-In with Spring Boot + Spring Security REST API
I tried the Google Cloud Vision API for the first time
I tried to extract characters from subtitles (OpenCV: Google Cloud Vision API)
I tried simple image processing with Google Colaboratory.
I tried to uncover our darkness with Chatwork API
I tried to make an OCR application with PySimpleGUI
I tried hitting the API with echonest's python client
I tried fp-growth with python
I tried scraping with Python
I tried Learning-to-Rank with Elasticsearch!
I tried clustering with PyCaret
I tried gRPC with Python
I tried scraping with python
I tried connecting Raspberry Pi and conect + with Web API
I tried follow management with Twitter API and Python (easy)
I tried saving the DRF API request history with django-request
I tried to create Quip API
I tried the Naro novel API 2
I tried trimming efficiently with OpenCV
I tried summarizing sentences with summpy
I tried machine learning with liblinear
I tried web scraping with python.
I tried moving food with SinGAN
I tried implementing DeepPose with PyTorch
I tried to touch Tesla's API
I tried the Naruro novel API
I tried running prolog with python 3.8.2.
I tried SMTP communication with Python
I tried sentence generation with GPT-2
I tried learning LightGBM with Yellowbrick
Introducing Google Map API with rails
I tried face recognition with OpenCV
I tried using the checkio API
I tried to automate everything including Google OAuth with two-step verification
I tried ChatOps with Slack x API Gateway x Lambda (Python) x RDS
I tried using docomo speech recognition API and Google Speech API in Java
I tried hitting the Google API with Ruby and Python-Make the database a Spreadsheet and manage it with Google Drive
Google Cloud Vision API sample for python
I tried to introduce a serverless chatbot linked with Rakuten API to Teams
I tried sending an SMS with Twilio
I tried to delete bad tweets regularly with AWS Lambda + Twitter API
I tried using Twitter api and Line api
I tried linebot with flask (anaconda) + heroku
I tried to visualize AutoEncoder with TensorFlow
I tried to get started with Hy
I tried using YOUTUBE Data API V3
I tried factor analysis with Titanic data!
I tried learning with Kaggle's Titanic (kaggle②)
I tried sending an email with python.
I tried non-photorealistic rendering with Python + opencv
I tried to make "Sakurai-san" a LINE BOT with API Gateway + Lambda
I tried to get the authentication code of Qiita API with Python.
Use Google Cloud Vision API from Python
I tried using UnityCloudBuild API from Python
Transcription of images with GCP's Vision API
I tried a functional language with Python
I tried batch normalization with PyTorch (+ note)
When introducing the Google Cloud Vision API to rails, I followed the documentation.