There is a technology called "OCR (Optical Character Recognition)" that reads printed and handwritten characters and converts them into character data.
OCR services are provided for various documents such as invoices, receipts, business cards and driver's licenses. By using OCR, you can reduce the trouble of data entry. In addition, by linking with other systems, it is possible to make effective use of data.
The OCR provided by each company includes services for companies and individuals. As an OCR that can be used by individuals, there is "Google Vision API (hereinafter referred to as Vision API)". Vision API is a very high-performance image analysis service provided by Google. (Click here for the free trial page (https://cloud.google.com/vision?hl=ja))
This time, I tried a simple license OCR using the Vision API.
The environment uses Google Colaboratory. The Python version is below.
import platform
print("python " + platform.python_version())
# python 3.6.9
Now let's write the code. First, import the library required to display the image.
import cv2
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
I will also prepare a sample image of the driver's license. Now let's display the image.
img = cv2.imread(input_file) # input_file is the path of the image
plt.figure(figsize=[10,10])
plt.axis('off')
plt.imshow(img[:,:,::-1])
Now, let's throw this receipt image into the Vision API and do OCR.
First, make the necessary preparations to use the Vision API. Set up by referring to here. You will need to install the client library and issue a service account key. The installation of the client library is as follows.
pip install google-cloud-vision
Use the service account key you issued to set environment variables.
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = json_path # json_path is the path of the service account key
Now let's perform text detection by OCR.
This time, we will use the Vision API DOCUMENT_TEXT_DETECTION option for text detection. For more information on the Vision API DOCUMENT_TEXT_DETECTION, see here.
Now let's send a request to the Vision API and get a response.
import io
from google.cloud import vision
from google.cloud.vision import types
client = vision.ImageAnnotatorClient()
with io.open(input_file, 'rb') as image_file:
content = image_file.read()
image = types.Image(content=content)
response = client.document_text_detection(image=image) #Text detection
If you can execute it without any error, you can successfully send the request to the API and get the response.
This response contains the OCR results for the Vision API. It contains various information such as read characters, character coordinates, certainty, and language type. Here, let's check the read full text. I will display it side by side with the image.
print(response.text_annotations[0].description)
Name Sun Book Hanako Born May 1, 1986) Address Tokyo 2-1-2 Kasumi, Chiyoda-ku Granted May 07, 2001 12345 12024 (Imawa 06) June 01 Inactive Glasses etc. License Conditions, etc. Sample Excellent Number | No. 012345678900 | --April 01, 2003 In the middle Others June 01, 2005 (Three kinds August 01, 2017 Driver's license Type Large small special Medium-sized moped Ichiten Tentoku Fuji Great self-reliance Fuou Hiki Chuni 00000 Public Safety Commission KA | || Q00 |
I was able to confirm the reading result.
The Vision API has coordinate information for each character. Let's plot each coordinate on the image and check it.
document = response.full_text_annotation
img_symbol = img.copy()
for page in document.pages:
for block in page.blocks:
for paragraph in block.paragraphs:
for word in paragraph.words:
for symbol in word.symbols:
bounding_box = symbol.bounding_box
xmin = bounding_box.vertices[0].x
ymin = bounding_box.vertices[0].y
xmax = bounding_box.vertices[2].x
ymax = bounding_box.vertices[2].y
cv2.rectangle(img_symbol, (xmin, ymin), (xmax, ymax), (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
plt.figure(figsize=[10,10])
plt.imshow(img_symbol[:,:,::-1]);plt.title("img_symbol")
The place where the driver's license is written for each item, such as "name", "date of birth", and "address", is fixed. What is written where and what is written is called ** fixed ** in OCR industry terminology. And, OCR of a standard one is called ** standard OCR **. On the other hand, receipts, business cards, invoices, etc. for which it is uncertain where and what is written are called ** atypical **, and those OCRs are called ** atypical OCR **.
The standard OCR allows you to create templates. By specifying the area for each item in the template and extracting the OCR result contained in the area, the reading result for each item can be output.
Now let's create a template. This time, we will use the annotation tool labelImg. Annotation is the addition of some information to a piece of data. Here, it means that the area surrounded by the frame is labeled as "name" or "date of birth". The result of annotation with labelImg is saved as an xml file.
The following is an example of the annotation result xml file.
<annotation>
<folder>Downloads</folder>
<filename>drivers_license.jpg</filename>
<path>/path/to/jpg_file</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>681</width>
<height>432</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>name</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>78</xmin>
<ymin>26</ymin>
<xmax>428</xmax>
<ymax>58</ymax>
</bndbox>
</object>
<object>
<name>birthday</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>428</xmin>
<ymin>27</ymin>
<xmax>652</xmax>
<ymax>58</ymax>
</bndbox>
</object>
<!--Omission-->
</annotation>
Now, let's read the above xml file. For confirmation, try drawing the template frame and label information on the image.
import xml.etree.ElementTree as ET
tree = ET.parse(input_xml) # input_xml is the xml path
root = tree.getroot()
img_labeled = img.copy()
for obj in root.findall("./object"):
name = obj.find('name').text
xmin = obj.find('bndbox').find('xmin').text
ymin = obj.find('bndbox').find('ymin').text
xmax = obj.find('bndbox').find('xmax').text
ymax = obj.find('bndbox').find('ymax').text
xmin, ymin, xmax, ymax = int(xmin), int(ymin), int(xmax), int(ymax)
cv2.rectangle(img_labeled, (xmin, ymin), (xmax, ymax), (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
cv2.putText(img_labeled, name, (xmin, ymin), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 0, 0), thickness=1)
plt.figure(figsize=[10,10])
plt.imshow(img_labeled[:,:,::-1]);plt.title("img_labeled")
I was able to confirm that the template information was set properly. I use labelImg to label the items I want to read, such as my name and date of birth, by enclosing them in a frame.
Now let's match the template with the OCR result.
The character strings in the frame of the template are classified as the result for each item. The result of template matching is displayed side by side with the image.
text_infos = []
document = response.full_text_annotation
for page in document.pages:
for block in page.blocks:
for paragraph in block.paragraphs:
for word in paragraph.words:
for symbol in word.symbols:
bounding_box = symbol.bounding_box
xmin = bounding_box.vertices[0].x
ymin = bounding_box.vertices[0].y
xmax = bounding_box.vertices[2].x
ymax = bounding_box.vertices[2].y
xcenter = (xmin+xmax)/2
ycenter = (ymin+ymax)/2
text = symbol.text
text_infos.append([text, xcenter, ycenter])
result_dict = {}
for obj in root.findall("./object"):
name = obj.find('name').text
xmin = obj.find('bndbox').find('xmin').text
ymin = obj.find('bndbox').find('ymin').text
xmax = obj.find('bndbox').find('xmax').text
ymax = obj.find('bndbox').find('ymax').text
xmin, ymin, xmax, ymax = int(xmin), int(ymin), int(xmax), int(ymax)
texts = ''
for text_info in text_infos:
text = text_info[0]
xcenter = text_info[1]
ycenter = text_info[2]
if xmin <= xcenter <= xmax and ymin <= ycenter <= ymax:
texts += text
result_dict[name] = texts
for k, v in result_dict.items():
print('{} : {}'.format(k, v))
name: Hanako Nihon birthday: Born May 1, 1986 address: 2-1-2 Kasumi, Chiyoda-ku, Tokyo date of issue: May 07, 2001 12345 expiration date: 2024 (Imawa 06) June 01 Inactive number: No. 012345678900 drivers license: driver's license Public Safety Commission: 00000 Public Safety Commission |
As a result of template matching, it was confirmed that the OCR results could be classified by item.
So far, we have looked at text detection by OCR.
By the way, when judging the image of an ID card such as a driver's license, it is possible to check the face photo as well. Vision API has various image analysis functions other than OCR, and face detection is one of them. For more information on Vision API face detection, see here (https://cloud.google.com/vision/docs/detecting-faces?hl=ja).
Now let's use the Vision API to perform face detection as well.
Now, like text detection, let's send a request to the Vision API and get a response.
import io
from google.cloud import vision
client = vision.ImageAnnotatorClient()
with io.open(input_file, 'rb') as image_file:
content = image_file.read()
image = types.Image(content=content)
response2 = client.face_detection(image=image) #Face detection
This response2 contains the result of face detection of Vision API. It contains various information such as the detected facial coordinates, feature points, certainty, and emotional potential (whether angry or laughing, etc.).
Now, let's display the coordinates of the detected face.
faces = response2.face_annotations
img_face = img.copy()
for face in faces:
bounding_poly = face.bounding_poly
fd_bounding_poly = face.fd_bounding_poly
xmin = bounding_poly.vertices[0].x
ymin = bounding_poly.vertices[0].y
xmax = bounding_poly.vertices[2].x
ymax = bounding_poly.vertices[2].y
cv2.rectangle(img_face, (xmin, ymin), (xmax, ymax), (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
cv2.putText(img_face, 'bounding_poly', (xmin, ymin), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), thickness=1)
xmin = fd_bounding_poly.vertices[0].x
ymin = fd_bounding_poly.vertices[0].y
xmax = fd_bounding_poly.vertices[2].x
ymax = fd_bounding_poly.vertices[2].y
cv2.rectangle(img_face, (xmin, ymin), (xmax, ymax), (0, 255, 0), thickness=1, lineType=cv2.LINE_AA)
cv2.putText(img_face, 'fd_bounding_poly', (xmin, ymin), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), thickness=1)
plt.figure(figsize=[10,10])
plt.imshow(img_face[:,:,::-1]);plt.title("img_face")
I was able to confirm that face detection was possible.
Now let's display the confidence level of face detection. It is also possible to set a threshold value in advance and determine that the face has been detected if it is above the threshold value. This allows you to remove blurry images and those that cannot be distinguished from facial photographs, and narrow down to only reliable images.
for face in faces:
detection_confidence = face.detection_confidence
if detection_confidence > 0.90:
print('Face detected')
print('detection_confidence : ' + str(detection_confidence))
# Face detected
# detection_confidence : 0.953563392162323
In the above, I set the threshold value to 0.90 and tried to judge the reliability of the facial photograph. The degree of conviction this time is as high as 0.95, and it can be said that it is reliable as a face photograph.
How was it?
This time, I tried to perform license OCR using Vision API.
First, we performed text detection. I also created a separate template using labelImg. By matching the OCR result with the template, the reading result was classified for each item. In doing so, we used the character-by-character coordinate information included as a result of the Vision API. With a standard OCR such as a driver's license, it is possible to create a template and output the result for each item you want to read.
We also performed face detection. This time, we used only the detected face coordinates, but the response also includes the coordinates of the feature points and the possibility of emotions. I think it would be interesting to try face detection for facial photographs with various facial expressions.
Vision API is a tool that can perform various image analysis. In addition to the text detection and face detection introduced this time, why not try various things?
Recommended Posts