This article is a continuation of the previous one. Just around the time I wrote the previous article, I noticed that my company's Advent calendar is related to AI, and I would like to confirm while actually using the point that I wrote in summary, "If you use AWS Textract, it may be more accurate." I will.
Also, this time I would like to do a code review using the AWS service CodeGuru
.
It is a service that can be OCRed very easily. Currently it seems to be available only in the regions of Paris (eu-west-3), London (eu-west-2), Singapore (ap-southeast-1) and Mumbai (ap-south-1). It can be used on the console or using the SDK. First, let's use it from the console.
Now I want to check what the accuracy is, so let's use it.
This is the console screen.
Looking at the accuracy of the sample, it looks pretty good. I think it's amazing that you can read this far even though it is handwritten.
Where it says better app
, tt
looks like H
, and a
looks like o
. I think you can read it well.
Now, let's throw an image of Ring Fit and check it. Just drag and drop the image and it will OCR.
Hmmm, it doesn't seem to support Japanese ... However, it seems that the numerical values and other parts can be read properly. If you do the post-processing properly, you may be able to create data more accurately than last time.
Now I would like to use Textract from python.
Install awscli and boto3 for use with python
console
pip install awscli
pip install boto3
Set the iam user to be used for awscli. Use the access key and secret access key that are issued when you create a user from iam.
console
aws configure
AWS Access Key ID [None]: your access key
AWS Secret Access Key [None]: your secret access key
Default region name [None]: your region
Default output format [None]: your format
You may not need to set the region and format.
The code was based on the documentation → Boto3 Docs 1.16.37 Documentation-Textract The contents of the code are as follows.
textract.py
import boto3
# Amazon Textract client
textract = boto3.client('textract', region_name="ap-southeast-1")
# read image to bytes
with open('get_data/2020-09-28.png', 'rb') as f:
data = f.read()
# Call Amazon Textract
response = textract.detect_document_text(
Document={
'Bytes': data
}
)
# Print detected text
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
print ('\033[94m' + item["Text"] + '\033[0m')
Let's actually execute it. It took about 2 seconds in my environment.
console
python .\textract.py
R
Oti
10+29
38.40kcal
0.89km
Oxian
Looking at the execution result, it seems that the result is the same as the execution of the console.
Earlier, the execution result was extracted from the response and displayed as the document, but what kind of content will be returned in the first place? I would like to check the contents of the response.
response.json
{
"DocumentMetadata": {
"Pages": 1
},
"Blocks": [
{
"BlockType": "PAGE",
"Geometry": {
"BoundingBox": {
"Width": 1.0,
"Height": 0.9992592334747314,
"Left": 0.0,
"Top": 0.0
},
"Polygon": [
{
"X": 6.888638380355447e-17,
"Y": 0.0
},
{
"X": 1.0,
"Y": 0.0
},
{
"X": 1.0,
"Y": 0.9992592334747314
},
{
"X": 0.0,
"Y": 0.9992592334747314
}
]
},
"Id": "33a0a9cd-0569-44ed-9f0f-7e88ede1d3d3",
"Relationships": [
{
"Type": "CHILD",
"Ids": [
"b9b8fd8e-1f13-4b9a-8bfa-8c8ca4750ae0",
"3b71c094-0bac-496e-9e26-1d311b89a66c",
"366cdb0a-5d10-4f64-b88b-c1ad79013fc2",
"232492f4-3137-49df-ad21-0369622cc56e",
"738b30df-4472-4a25-90fe-eaed85e74566",
"a73953ed-6038-49fb-af64-bad77e0d1e8f"
]
}
]
},
{
"BlockType": "LINE",
"Confidence": 87.06179809570312,
"Text": "R",
"Geometry": {
"BoundingBox": {
"Width": 0.008603394031524658,
"Height": 0.018224462866783142,
"Left": 0.7822862863540649,
"Top": 0.1344471424818039
},
"Polygon": [
{
"X": 0.7822862863540649,
"Y": 0.1344471424818039
},
{
"X": 0.7908896803855896,
"Y": 0.1344471424818039
},
{
"X": 0.7908896803855896,
"Y": 0.15267160534858704
},
{
"X": 0.7822862863540649,
"Y": 0.15267160534858704
}
]
},
"Id": "b9b8fd8e-1f13-4b9a-8bfa-8c8ca4750ae0",
"Relationships": [
{
"Type": "CHILD",
"Ids": [
"1efd9875-d6a4-45e4-8fb4-63e68c668ff1"
]
}
]
},
...
{
"BlockType": "WORD",
"Confidence": 87.06179809570312,
"Text": "R",
"TextType": "PRINTED",
"Geometry": {
"BoundingBox": {
"Width": 0.008603399619460106,
"Height": 0.018224479630589485,
"Left": 0.7822862863540649,
"Top": 0.1344471424818039
},
"Polygon": [
{
"X": 0.7822862863540649,
"Y": 0.1344471424818039
},
{
"X": 0.7908896803855896,
"Y": 0.1344471424818039
},
{
"X": 0.7908896803855896,
"Y": 0.15267162024974823
},
{
"X": 0.7822862863540649,
"Y": 0.15267162024974823
}
]
},
"Id": "1efd9875-d6a4-45e4-8fb4-63e68c668ff1"
},
{
"BlockType": "WORD",
"Confidence": 37.553348541259766,
"Text": "Oti",
"TextType": "HANDWRITING",
"Geometry": {
"BoundingBox": {
"Width": 0.03588677942752838,
"Height": 0.031930990517139435,
"Left": 0.4896482229232788,
"Top": 0.2779926359653473
},
"Polygon": [
{
"X": 0.4896482229232788,
"Y": 0.2779926359653473
},
{
"X": 0.525534987449646,
"Y": 0.2779926359653473
},
{
"X": 0.525534987449646,
"Y": 0.30992361903190613
},
{
"X": 0.4896482229232788,
"Y": 0.30992361903190613
}
]
},
"Id": "4e07e16b-f78b-4564-bb30-c0e48f6610c6"
},
...
],
"DetectDocumentTextModelVersion": "1.0",
"ResponseMetadata": {
"RequestId": "87f05420-f6d9-4e67-911e-64deadd207fb",
"HTTPStatusCode": 200,
"HTTPHeaders": {
"x-amzn-requestid": "87f05420-f6d9-4e67-911e-64deadd207fb",
"content-type": "application/x-amz-json-1.1",
"content-length": "6693",
"date": "Thu, 17 Dec 2020 00:36:14 GMT"
},
"RetryAttempts": 0
}
}
The above is the actual contents. I would like to confirm while looking at the document.
key | val |
---|---|
DocumentMetadata | Document metadata. This time, 1 page is returned. |
Blocks | Items detected and analyzed by AnalyzeDocument. The result of OCR is coming in. |
BlockType | The type of text item that is recognized. There seem to be several types. I will summarize only the contents that came out this time. Page: A list of detected LINE block objects. The ID of the recognized character was stored. Word:The detected word. Judgment such as handwriting or printing was also written. LINE:A string of detected tab-delimited consecutive words. I think it will contain a sentence of data. |
Is this much data needed?
This time, I think that BlockType
in Blocks
should extract what is needed from the data of Word
.
So how do you get it out?
Looking at the returned value, it seems that the position of the read data is written. All RingFit data will be in the same format, so the range of characters to read should be about the same. The ring fit data is bottom right aligned, so the bottom right coordinates should be roughly the same. So I would like to get the data near specific coordinates.
Follow the steps below.
Although it is data with specific coordinates, I tried to allow an error of 0.01 in consideration of the deviation. The json loaded at runtime is the above-mentioned Textract response data.
textract.py
import json
#Formatted to data with only characters and lower right coordinates
def get_word(data: dict) -> dict:
words = []
for item in data["Blocks"]:
if item["BlockType"] == "WORD":
words.append({
"word":item["Text"],
"right_bottom":item["Geometry"]["Polygon"][2]
})
return words
#Lower right coordinates are near specific(Misalignment 0.Allowed up to 01)Judgment
def point_check(x: float, y: float) -> dict:
origin_point = {
"time":{"x":0.71,"y":0.46},
"kcal":{"x":0.73,"y":0.63},
"km":{"x":0.73,"y":0.78}
}
for k, v in origin_point.items():
if abs(x-v["x"])<0.01 and abs(y-v["y"])<0.01:
return k
def get_point_data(data: dict) -> dict:
prepro_data = get_word(data)
some_data = {}
for v in prepro_data:
tmp = point_check(v["right_bottom"]["X"], v["right_bottom"]["Y"])
if tmp:
some_data[tmp] = v["word"]
return some_data
if __name__ == '__main__':
with open("j.json") as f:
data = json.load(f)
d = get_point_data(data)
print(d)
When I run it ...
console
python .\textract.py
{'time': '10+29', 'kcal': '38.40kcal', 'km': '0.89km'}
It seems that it is taken properly.
Then, there are some images of Ring Fit, so I would like to try it. Execute the image loading part multiple times and check the result of OCR of multiple images. (Code omitted) The result is below.
res_list.json
[
{
"time": "27",
"kcal": "48kcal",
"km": "0.71km"
},
{
"time": "11>12*",
"kcal": "37.79kcal",
"km": "O.65km"
},
{
"kcal": "36.62kcal",
"km": "0.23km"
},
...
]
There is some data that distinguishes between 0 and o, and the time cannot be read, but it seems that it can be read in general. (The time data includes Japanese, so it can't be helped.) I would like to fill the data with no time with 0, and use the value of the first 2 digits for other data. I would like to perform post-processing including replacement of 0 and o. Outliers (40 minutes or more) are set to 1/10 because time data may not be processed well. By the way, I also added date data so that I can use it for the previous graph creation. (Outside the code)
textract.py
def post_processing(word_point_list: list):
for data in word_point_list:
if "time" not in data:
data["time"] = "0"
re_data = re.sub('[^0-9]','', data["time"])
if len(re_data) < 2:
re_data = re_data[:1]
else:
re_data = re_data[:2]
data["time"] = float(re_data) if float(re_data) < 40 else float(re_data)/10
data["kcal"] = float(data["kcal"].replace("o","0").replace("O","0").replace("k","").replace("c","").replace("a","").replace("l",""))
data["km"] = float(data["km"].replace("o","0").replace("O","0").replace("k","").replace("m",""))
return word_point_list
When I run it with this ...
res_list.json
[
{
"time": 27.0,
"kcal": 48.0,
"km": 0.71,
"date": "2020-11-09.png "
},
{
"time": 11.0,
"kcal": 37.79,
"km": 0.65,
"date": "2020-11-15.png "
},
{
"kcal": 36.62,
"km": 0.23,
"date": "2020-11-16.png ",
"time": 0.0
},
...
]
It looks like it's clean!
Then, I would like to perform OCR of the image, preprocessing, and then display the graph. Please refer to the previous article for the functions you are using.
ocr_and_graph.py
import json
import os
from src.textract import do_ocr, get_point_data, post_processing
from src.graph import create_graph
IMPORT_FILE_PATH = "output/ocr_result.json"
OUTPUT_FILE_PATH = "output/graph2.png "
if __name__ == "__main__":
data = do_ocr("./get_data")
word_point_list = []
for word_dict in data:
word_point_list.append(get_point_data(word_dict))
word_point_list = post_processing(word_point_list)
with open("./output/j.json", "w") as f:
json.dump(word_point_list, f)
#Data creation and output from DL image files
try:
os.makedirs(IMPORT_FILE_PATH.replace(IMPORT_FILE_PATH.split("/")[-1], ""))
except:
pass
with open(IMPORT_FILE_PATH, "w") as f:
json.dump(word_point_list, f, indent=2)
create_graph(IMPORT_FILE_PATH, OUTPUT_FILE_PATH)
I would like to compare the graph created this time with the graph created last time. It can be seen that the outliers of time and kcal have decreased compared to the previous time. After all, outliers are seen in the time data, so it may be better to deal with it by preprocessing or changing the language of the game. However, the kcal data is almost correct, so I think it is still useful. Besides, this time, the image is processed with this accuracy without preprocessing, so I felt that it was very easy to use.
Last time | |
---|---|
this time |
This is the end of the main subject.
There is a service called CodeGuru on AWS. It is a service that reviews the code, but now that Python is a supported language, I would like to try it. First, link the code you want to review. I did it from GitHub.
After adding, select the repository and branch you want to analyze from "Create repository analysis". It took a few minutes to run. (Is it about 10 minutes?)
I would like to see the execution result.
I think I've only seen the first one.
Apparently, exception handling is not written in detail, so it is better to write it concretely.
When I actually go to see the code, as shown below, only except
is specified and the error content is not specified.
create_fit_data.py
#DL from the acquired image URL(The file name is the tweet date and time)
for data in image_url_list:
try:
os.mkdir("get_data")
except:
pass
dst_path = f"get_data/{data['created_at'].strftime('%Y-%m-%d')}.png "
download_file(data['img_url'],dst_path)
In this way, using CodeGuru seems to help make the environment where bugs are likely to occur and debug easily. I think python is used in many places, so I think there are many opportunities to use it. If you want to co-develop or write solid code, you may want to use CodeGuru.
This time I used Textract and CodeGuru to do something like the last rework. After trying a few times, Textract was free to use up to 1000 pages a month for the first 3 months, so I was able to create it at no cost. It's very helpful for startups.
CodeGuru is also free to use for 3 months, and after that it seems to be $ 0.50 for every 100 lines of code until you analyze 1,500,000 lines of code each month.
By the way, the amount of code I wrote this time was about 250, and the code-reviewed line was displayed as 187. Maybe you read only the necessary parts.
I want Textract to support Japanese ... I hope it will be easier if you do it, but I'm looking forward to it in the future!
Recommended Posts