Introduction

There is a technology called "OCR (Optical Character Recognition)" that reads printed and handwritten characters and converts them into character data.

OCR services are provided for various documents such as invoices, receipts, business cards and driver's licenses. By using OCR, you can reduce the trouble of data entry. In addition, by linking with other systems, it is possible to make effective use of data.

The OCR provided by each company includes services for companies and individuals. As an OCR that can be used by individuals, there is "Google Vision API (hereinafter referred to as Vision API)". Vision API is a very high-performance image analysis service provided by Google. (Click here for the free trial page (https://cloud.google.com/vision?hl=ja))

This time, I tried a simple license OCR using the Vision API.

License OCR

environment

The environment uses Google Colaboratory. The Python version is below.

import platform
print("python " + platform.python_version())
# python 3.6.9

Image display

Now let's write the code. First, import the library required to display the image.

import cv2
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib

I will also prepare a sample image of the driver's license. Now let's display the image.

img = cv2.imread(input_file) # input_file is the path of the image
plt.figure(figsize=[10,10])
plt.axis('off')
plt.imshow(img[:,:,::-1])

Vision API setup

Now, let's throw this receipt image into the Vision API and do OCR.

About Vision API pricing *
・ Free for the first 1,000 units / month (as of June 21, 2020) *
・ Please refer to here for details on the pricing system. *

First, make the necessary preparations to use the Vision API. Set up by referring to here. You will need to install the client library and issue a service account key. The installation of the client library is as follows.

pip install google-cloud-vision

Use the service account key you issued to set environment variables.

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = json_path # json_path is the path of the service account key

Text detection

Now let's perform text detection by OCR.

This time, we will use the Vision API DOCUMENT_TEXT_DETECTION option for text detection. For more information on the Vision API DOCUMENT_TEXT_DETECTION, see here.

Send request to API / Get response

Now let's send a request to the Vision API and get a response.

import io

from google.cloud import vision
from google.cloud.vision import types

client = vision.ImageAnnotatorClient()
with io.open(input_file, 'rb') as image_file:
    content = image_file.read()
image = types.Image(content=content)
response = client.document_text_detection(image=image) #Text detection

If you can execute it without any error, you can successfully send the request to the API and get the response.

This response contains the OCR results for the Vision API. It contains various information such as read characters, character coordinates, certainty, and language type. Here, let's check the read full text. I will display it side by side with the image.

print(response.text_annotations[0].description)

Name
Sun
Book
Hanako
Born May 1, 1986)
Address Tokyo
2-1-2 Kasumi, Chiyoda-ku
Granted May 07, 2001 12345
12024 (Imawa 06) June 01 Inactive
Glasses etc.
License
Conditions, etc.
Sample
Excellent
Number | No. 012345678900
| --April 01, 2003
In the middle
Others June 01, 2005
(Three kinds August 01, 2017
Driver's license
Type
Large small special
Medium-sized moped
Ichiten
Tentoku Fuji
Great self-reliance
Fuou Hiki
Chuni
00000
Public Safety Commission
KA | ||
Q00

I was able to confirm the reading result.

The Vision API has coordinate information for each character. Let's plot each coordinate on the image and check it.

document = response.full_text_annotation
img_symbol = img.copy()
for page in document.pages:
  for block in page.blocks:
    for paragraph in block.paragraphs:
      for word in paragraph.words:
        for symbol in word.symbols:
          bounding_box = symbol.bounding_box
          xmin = bounding_box.vertices[0].x
          ymin = bounding_box.vertices[0].y
          xmax = bounding_box.vertices[2].x
          ymax = bounding_box.vertices[2].y
          cv2.rectangle(img_symbol, (xmin, ymin), (xmax, ymax), (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
plt.figure(figsize=[10,10])
plt.imshow(img_symbol[:,:,::-1]);plt.title("img_symbol")

Template creation

The place where the driver's license is written for each item, such as "name", "date of birth", and "address", is fixed. What is written where and what is written is called ** fixed ** in OCR industry terminology. And, OCR of a standard one is called ** standard OCR **. On the other hand, receipts, business cards, invoices, etc. for which it is uncertain where and what is written are called ** atypical **, and those OCRs are called ** atypical OCR **.

The standard OCR allows you to create templates. By specifying the area for each item in the template and extracting the OCR result contained in the area, the reading result for each item can be output.

Now let's create a template. This time, we will use the annotation tool labelImg. Annotation is the addition of some information to a piece of data. Here, it means that the area surrounded by the frame is labeled as "name" or "date of birth". The result of annotation with labelImg is saved as an xml file.

The following is an example of the annotation result xml file.

<annotation>
	<folder>Downloads</folder>
	<filename>drivers_license.jpg</filename>
	<path>/path/to/jpg_file</path>
	<source>
		<database>Unknown</database>
	</source>
	<size>
		<width>681</width>
		<height>432</height>
		<depth>3</depth>
	</size>
	<segmented>0</segmented>
	<object>
		<name>name</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<bndbox>
			<xmin>78</xmin>
			<ymin>26</ymin>
			<xmax>428</xmax>
			<ymax>58</ymax>
		</bndbox>
	</object>
	<object>
		<name>birthday</name>
		<pose>Unspecified</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<bndbox>
			<xmin>428</xmin>
			<ymin>27</ymin>
			<xmax>652</xmax>
			<ymax>58</ymax>
		</bndbox>
	</object>
<!--Omission-->
</annotation>

Loading template information

Now, let's read the above xml file. For confirmation, try drawing the template frame and label information on the image.

import xml.etree.ElementTree as ET

tree = ET.parse(input_xml) # input_xml is the xml path
root = tree.getroot()

img_labeled = img.copy()
for obj in root.findall("./object"):
  name = obj.find('name').text
  xmin = obj.find('bndbox').find('xmin').text
  ymin = obj.find('bndbox').find('ymin').text
  xmax = obj.find('bndbox').find('xmax').text
  ymax = obj.find('bndbox').find('ymax').text
  xmin, ymin, xmax, ymax = int(xmin), int(ymin), int(xmax), int(ymax)
  cv2.rectangle(img_labeled, (xmin, ymin), (xmax, ymax), (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
  cv2.putText(img_labeled, name, (xmin, ymin), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), thickness=1)
plt.figure(figsize=[10,10])
plt.imshow(img_labeled[:,:,::-1]);plt.title("img_labeled")

I was able to confirm that the template information was set properly. I use labelImg to label the items I want to read, such as my name and date of birth, by enclosing them in a frame.

Template matching

Now let's match the template with the OCR result.

The character strings in the frame of the template are classified as the result for each item. The result of template matching is displayed side by side with the image.

text_infos = []
document = response.full_text_annotation
for page in document.pages:
  for block in page.blocks:
    for paragraph in block.paragraphs:
      for word in paragraph.words:
        for symbol in word.symbols:
          bounding_box = symbol.bounding_box
          xmin = bounding_box.vertices[0].x
          ymin = bounding_box.vertices[0].y
          xmax = bounding_box.vertices[2].x
          ymax = bounding_box.vertices[2].y
          xcenter = (xmin+xmax)/2
          ycenter = (ymin+ymax)/2
          text = symbol.text
          text_infos.append([text, xcenter, ycenter])

result_dict = {}
for obj in root.findall("./object"):
  name = obj.find('name').text
  xmin = obj.find('bndbox').find('xmin').text
  ymin = obj.find('bndbox').find('ymin').text
  xmax = obj.find('bndbox').find('xmax').text
  ymax = obj.find('bndbox').find('ymax').text
  xmin, ymin, xmax, ymax = int(xmin), int(ymin), int(xmax), int(ymax)
  texts = ''
  for text_info in text_infos:
    text = text_info[0]
    xcenter = text_info[1]
    ycenter = text_info[2]
    if xmin <= xcenter <= xmax and ymin <= ycenter <= ymax:
      texts += text
  result_dict[name] = texts

for k, v in result_dict.items():
  print('{} : {}'.format(k, v))

name: Hanako Nihon
birthday: Born May 1, 1986
address: 2-1-2 Kasumi, Chiyoda-ku, Tokyo
date of issue: May 07, 2001 12345
expiration date: 2024 (Imawa 06) June 01 Inactive
number: No. 012345678900
drivers license: driver's license
Public Safety Commission: 00000 Public Safety Commission

As a result of template matching, it was confirmed that the OCR results could be classified by item.

Face detection

So far, we have looked at text detection by OCR.

By the way, when judging the image of an ID card such as a driver's license, it is possible to check the face photo as well. Vision API has various image analysis functions other than OCR, and face detection is one of them. For more information on Vision API face detection, see here (https://cloud.google.com/vision/docs/detecting-faces?hl=ja).

Now let's use the Vision API to perform face detection as well.

Send request to API / Get response

Now, like text detection, let's send a request to the Vision API and get a response.

import io

from google.cloud import vision

client = vision.ImageAnnotatorClient()
with io.open(input_file, 'rb') as image_file:
    content = image_file.read()
image = types.Image(content=content)
response2 = client.face_detection(image=image) #Face detection

This response2 contains the result of face detection of Vision API. It contains various information such as the detected facial coordinates, feature points, certainty, and emotional potential (whether angry or laughing, etc.).

Now, let's display the coordinates of the detected face.

faces = response2.face_annotations

img_face = img.copy()

for face in faces:
  bounding_poly = face.bounding_poly
  fd_bounding_poly = face.fd_bounding_poly
  xmin = bounding_poly.vertices[0].x
  ymin = bounding_poly.vertices[0].y
  xmax = bounding_poly.vertices[2].x
  ymax = bounding_poly.vertices[2].y
  cv2.rectangle(img_face, (xmin, ymin), (xmax, ymax), (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
  cv2.putText(img_face, 'bounding_poly', (xmin, ymin), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), thickness=1)

  xmin = fd_bounding_poly.vertices[0].x
  ymin = fd_bounding_poly.vertices[0].y
  xmax = fd_bounding_poly.vertices[2].x
  ymax = fd_bounding_poly.vertices[2].y
  cv2.rectangle(img_face, (xmin, ymin), (xmax, ymax), (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
  cv2.putText(img_face, 'fd_bounding_poly', (xmin, ymin), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), thickness=1)

plt.figure(figsize=[10,10])
plt.imshow(img_face[:,:,::-1]);plt.title("img_face")

I was able to confirm that face detection was possible.

Now let's display the confidence level of face detection. It is also possible to set a threshold value in advance and determine that the face has been detected if it is above the threshold value. This allows you to remove blurry images and those that cannot be distinguished from facial photographs, and narrow down to only reliable images.

for face in faces:
  detection_confidence = face.detection_confidence
  if detection_confidence > 0.90:
    print('Face detected')
    print('detection_confidence : ' + str(detection_confidence))

# Face detected
# detection_confidence : 0.953563392162323

In the above, I set the threshold value to 0.90 and tried to judge the reliability of the facial photograph. The degree of conviction this time is as high as 0.95, and it can be said that it is reliable as a face photograph.

Summary

How was it?

This time, I tried to perform license OCR using Vision API.

First, we performed text detection. I also created a separate template using labelImg. By matching the OCR result with the template, the reading result was classified for each item. In doing so, we used the character-by-character coordinate information included as a result of the Vision API. With a standard OCR such as a driver's license, it is possible to create a template and output the result for each item you want to read.

We also performed face detection. This time, we used only the detected face coordinates, but the response also includes the coordinates of the feature points and the possibility of emotions. I think it would be interesting to try face detection for facial photographs with various facial expressions.

Vision API is a tool that can perform various image analysis. In addition to the text detection and face detection introduced this time, why not try various things?

I tried "License OCR" with Google Vision API

Introduction

License OCR

environment

Image display

Vision API setup

Text detection

Send request to API / Get response

Template creation

Loading template information

Template matching

Face detection

Send request to API / Get response

Summary