Ateam cyma Advent Calendar 2019, 21st day! Ateam Co., Ltd. EC Business Headquarters Saima engineer @shimura_atsushi is here for the second time.
In the first attempt, Challenge the challenges of Sima using the OCR service of Google Could Platform, I took the first challenge to tackle the challenges of Sima. .. This time, which is the second time, we will make further efforts to check the delivery note.
I tried a simple OCR using GCP services in Last Post. However, many of the delivery documents currently used by Cyma have complicated texts, and the accuracy of transcription is low just by applying OCR. Even if the transcription is successful, the data is not labeled and the text data is not labeled. It was in a state of poor reusability.
Based on the previous reflection, this time we will focus on "preparing an image that is easy to perform OCR" and create preprocessing of the image to be applied to OCR.
Since the content is a continuation from the previous time, the title remains the same, but this time the implementation in Python
is the main and the Google Could Platform
is thin. Please forgive me.
This time, with the recommendation of @NamedPython, a Cyma engineer, we will use Python
, which has a rich image processing library.
--Development terminal MacBook Pro 15-inch
I'm going to go here quickly.
--Install Python
- pyenv
--You can manage the installed version of python
- python 3.8.0
--Use the latest at the time of writing
- pip
--Package management tool in python
--I think it will come with you when you install python
--pdf2image
installation
--Used to convert PDF to PNG or JPEG
--Install poppler
--Used for PDF conversion with pdf2image
--pillow
installation
--Used for image processing, mainly used for cropping
--ʻOpencv` installation
--Used for image processing, mainly used for binarization
brew install pyenv
pyenv install --list #Check the installable version
pyenv install 3.8.0
pip3 install pdf2image
brew install poppler
pip install pillow
pip install opencv
I will use the multifunction device at the head office, and when I scan it, a PDF will be attached to the registered e-mail address.
Since the scanned data is in PDF format, it will be converted to image data.
If you specify a directory, the stored PDF file will be converted to image data.
If you import pdf2image
and pass the file path you want to convert to the method convert_from_path
It will convert it, isn't it?
pdf2png.py
from pdf2image import convert_from_path
from pathlib import Path
import os
p = Path('./img/pdf')
pdf_list = os.listdir(p)
print(pdf_list)
for i, pdf_file_path in enumerate(pdf_list):
images = convert_from_path('./img/pdf/{}'.format(pdf_file_path))
for image in images:
image.save('./img/png/{}.png'.format(i), 'png')
The heart of this OCR is this process. Based on the previous reflection, we will implement the process of cutting out and labeling the necessary parts from the complicated delivery note data in this process.
Since the format of the delivery note is basically the same for each supplier (there are different patterns for bicycles and parts), prepare a JSON format setting file for each delivery note format with the coordinates required for cropping.
The necessary information on the delivery note is
--Supplier name
Therefore, have the coordinates of the place where these are described in the configuration file.
shiiresaki_setting.json
{
"wholesaler_id": 2,
"warehouse": {
"x":10,
"y":10,
"height":50,
"width":100
},
"date": {
"x":20,
"y":20,
"height":50,
"width":100
},
"product": {
"x":30,
"y":30,
"height":150,
"width":200
},
"figure": {
"x":40,
"y":40,
"height":200,
"width":250
},
"price": {
"x":50,
"y":50,
"height":200,
"width":250
}
}
crop4image.py
from PIL import Image
import sys
import json
import productsetting
args = sys.argv
p = productsetting.product.ProductSetting(args[1])
image = Image.open('img/png/{wholesaler_id}.png'.format(wholesaler_id=p.wholesaler_id))
rect = (
p.warehouse['x'],
p.warehouse['y'],
p.warehouse['x'] + p.warehouse['width'],
p.warehouse['y'] + p.warehouse['height']
)
print(rect)
cropped_image = image.crop(rect)
cropped_image.save('{wholesaler_id}.png'.format(wholesaler_id=p.wholesaler_id))
productsetting.py
import sys
import json
class ProductSetting:
CONFIG_SETTING_FILE_BASE_FORMAT = './settings/product/{wholesaler_id}.json'
def __init__(self, wholesaler):
config_file_path = open(self.CONFIG_SETTING_FILE_BASE_FORMAT.format(wholesaler_id=wholesaler), 'r')
config = json.load(config_file_path)
self.wholesaler_id = config['wholesaler_id']
self.warehouse = {
'x': config['warehouse']['x'],
'y': config['warehouse']['y'],
'height': config['warehouse']['height'],
'width': config['warehouse']['width']
}
self.product = {
'x': config['product']['x'],
'y': config['product']['y'],
'height': config['product']['height'],
'width': config['product']['width']
}
self.date = {
'x': config['date']['x'],
'y': config['date']['y'],
'height': config['date']['height'],
'width': config['date']['width']
}
self.figure = {
'x': config['figure']['x'],
'y': config['figure']['y'],
'height': config['figure']['height'],
'width': config['figure']['width']
}
When you run this script from an image like this
I was able to crop at the coordinates specified in this way.
Next, in order to improve the OCR accuracy of the cropped image, the characters are binarized to improve the reading accuracy.
Created using ʻopencv` The binarization program is simply like this
deeply_character.py
import cv2
img = cv2.imread('./result/png/1013/buyoption_1013.png', 0)
threshold = 100 #Threshold
ret, img_thresh = cv2.threshold(img, threshold, 255, cv2.THRESH_BINARY)
cv2.imwrite('./result/deeply/test/buyoption_1013.png', img_thresh)
This image that I cut out
It was binarized like this
I don't feel much benefit because the sample is not good.
I will try it with this image that seems to be tough.
If you adjust the threshold and binarize it ...
What! The image is clearer.
Let's apply this image to the OCR of GCP created last time. as a result···
It was transcribed in this way, and if you think about it carefully, the part called "delivery date" also becomes noise, so it was okay to omit it. However, with this accuracy, it seems that the reusability of the check can be maintained.
This time, as a pretreatment for OCR,
--Cut out only the necessary parts --Clarification by binarizing the cropped image
I tried to find out how to create a favorable situation for OCR by doing.
Regarding this effort to improve the efficiency of paperwork, it was good to say "I'll try it!" Within the division, but when I saw the actual delivery note, I was worried whether it could be automated. As a result, I feel that the accuracy can be improved and automation has become realistic by applying OCR after removing noise by cropping the image and sharpening by binarization.
Advent Calendar We have begun to take on the challenge of automating the issue of checking delivery notes at Cyma twice with OCR-centered technology. In the future, I would like to work on the realization of operations involving factories while proceeding with system implementation.
How was the 21st day of Ateam cyma Advent Calendar 2019? On the 22nd day, Saima's designer @ryo_cy will talk about CSS design using BEM, so stay tuned!
Ateam Co., Ltd. is looking for colleagues with a strong spirit of challenge to work with.
If you are an engineer and are interested, please see cyma's Qiita Jobs.
For other occupations, see Ateam Group Recruitment Site.
Recommended Posts