This article is a continuation of the previous one. Just around the time I wrote the previous article, I noticed that my company's Advent calendar is related to AI, and I would like to confirm while actually using the point that I wrote in summary, "If you use AWS Textract, it may be more accurate." I will.

Also, this time I would like to do a code review using the AWS service CodeGuru.

What is Amazon Textract?

It is a service that can be OCRed very easily. Currently it seems to be available only in the regions of Paris (eu-west-3), London (eu-west-2), Singapore (ap-southeast-1) and Mumbai (ap-south-1). It can be used on the console or using the SDK. First, let's use it from the console.

Try using Amazon Textract from the console

Now I want to check what the accuracy is, so let's use it. This is the console screen. キャプチャ.PNG Looking at the accuracy of the sample, it looks pretty good. I think it's amazing that you can read this far even though it is handwritten. Where it says better app, tt looks like H, and a looks like o. I think you can read it well. Now, let's throw an image of Ring Fit and check it. Just drag and drop the image and it will OCR.

キャプチャ.PNG

Hmmm, it doesn't seem to support Japanese ... However, it seems that the numerical values and other parts can be read properly. If you do the post-processing properly, you may be able to create data more accurately than last time.

Try using Amazon Textract from python

Now I would like to use Textract from python.

Install awscli and boto3 for use with python

`console`


pip install awscli
pip install boto3

Set the iam user to be used for awscli. Use the access key and secret access key that are issued when you create a user from iam.

`console`


aws configure

AWS Access Key ID [None]: your access key
AWS Secret Access Key [None]: your secret access key
Default region name [None]: your region
Default output format [None]: your format

You may not need to set the region and format.

The code was based on the documentation → Boto3 Docs 1.16.37 Documentation-Textract The contents of the code are as follows.

Preparing to use textract
Image loading
Perform OCR with textract (synchronous processing)
Display the result from the returned data

`textract.py`


import boto3

# Amazon Textract client
textract = boto3.client('textract', region_name="ap-southeast-1")

# read image to bytes
with open('get_data/2020-09-28.png', 'rb') as f:
    data = f.read()

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'Bytes': data
    }
)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

Let's actually execute it. It took about 2 seconds in my environment.

`console`


python .\textract.py
R
Oti
10+29
38.40kcal
0.89km
Oxian

Looking at the execution result, it seems that the result is the same as the execution of the console.

What will be returned as the execution result?

Earlier, the execution result was extracted from the response and displayed as the document, but what kind of content will be returned in the first place? I would like to check the contents of the response.

`response.json`


{
    "DocumentMetadata": {
        "Pages": 1
    },
    "Blocks": [
        {
            "BlockType": "PAGE",
            "Geometry": {
                "BoundingBox": {
                    "Width": 1.0,
                    "Height": 0.9992592334747314,
                    "Left": 0.0,
                    "Top": 0.0
                },
                "Polygon": [
                    {
                        "X": 6.888638380355447e-17,
                        "Y": 0.0
                    },
                    {
                        "X": 1.0,
                        "Y": 0.0
                    },
                    {
                        "X": 1.0,
                        "Y": 0.9992592334747314
                    },
                    {
                        "X": 0.0,
                        "Y": 0.9992592334747314
                    }
                ]
            },
            "Id": "33a0a9cd-0569-44ed-9f0f-7e88ede1d3d3",
            "Relationships": [
                {
                    "Type": "CHILD",
                    "Ids": [
                        "b9b8fd8e-1f13-4b9a-8bfa-8c8ca4750ae0",
                        "3b71c094-0bac-496e-9e26-1d311b89a66c",
                        "366cdb0a-5d10-4f64-b88b-c1ad79013fc2",
                        "232492f4-3137-49df-ad21-0369622cc56e",
                        "738b30df-4472-4a25-90fe-eaed85e74566",
                        "a73953ed-6038-49fb-af64-bad77e0d1e8f"
                    ]
                }
            ]
        },
        {
            "BlockType": "LINE",
            "Confidence": 87.06179809570312,
            "Text": "R",
            "Geometry": {
                "BoundingBox": {
                    "Width": 0.008603394031524658,
                    "Height": 0.018224462866783142,
                    "Left": 0.7822862863540649,
                    "Top": 0.1344471424818039
                },
                "Polygon": [
                    {
                        "X": 0.7822862863540649,
                        "Y": 0.1344471424818039
                    },
                    {
                        "X": 0.7908896803855896,
                        "Y": 0.1344471424818039
                    },
                    {
                        "X": 0.7908896803855896,
                        "Y": 0.15267160534858704
                    },
                    {
                        "X": 0.7822862863540649,
                        "Y": 0.15267160534858704
                    }
                ]
            },
            "Id": "b9b8fd8e-1f13-4b9a-8bfa-8c8ca4750ae0",
            "Relationships": [
                {
                    "Type": "CHILD",
                    "Ids": [
                        "1efd9875-d6a4-45e4-8fb4-63e68c668ff1"
                    ]
                }
            ]
        },
        ...
        {
            "BlockType": "WORD",
            "Confidence": 87.06179809570312,
            "Text": "R",
            "TextType": "PRINTED",
            "Geometry": {
                "BoundingBox": {
                    "Width": 0.008603399619460106,
                    "Height": 0.018224479630589485,
                    "Left": 0.7822862863540649,
                    "Top": 0.1344471424818039
                },
                "Polygon": [
                    {
                        "X": 0.7822862863540649,
                        "Y": 0.1344471424818039
                    },
                    {
                        "X": 0.7908896803855896,
                        "Y": 0.1344471424818039
                    },
                    {
                        "X": 0.7908896803855896,
                        "Y": 0.15267162024974823
                    },
                    {
                        "X": 0.7822862863540649,
                        "Y": 0.15267162024974823
                    }
                ]
            },
            "Id": "1efd9875-d6a4-45e4-8fb4-63e68c668ff1"
        },
        {
            "BlockType": "WORD",
            "Confidence": 37.553348541259766,
            "Text": "Oti",
            "TextType": "HANDWRITING",
            "Geometry": {
                "BoundingBox": {
                    "Width": 0.03588677942752838,
                    "Height": 0.031930990517139435,
                    "Left": 0.4896482229232788,
                    "Top": 0.2779926359653473
                },
                "Polygon": [
                    {
                        "X": 0.4896482229232788,
                        "Y": 0.2779926359653473
                    },
                    {
                        "X": 0.525534987449646,
                        "Y": 0.2779926359653473
                    },
                    {
                        "X": 0.525534987449646,
                        "Y": 0.30992361903190613
                    },
                    {
                        "X": 0.4896482229232788,
                        "Y": 0.30992361903190613
                    }
                ]
            },
            "Id": "4e07e16b-f78b-4564-bb30-c0e48f6610c6"
        },
        ...
    ],
    "DetectDocumentTextModelVersion": "1.0",
    "ResponseMetadata": {
        "RequestId": "87f05420-f6d9-4e67-911e-64deadd207fb",
        "HTTPStatusCode": 200,
        "HTTPHeaders": {
            "x-amzn-requestid": "87f05420-f6d9-4e67-911e-64deadd207fb",
            "content-type": "application/x-amz-json-1.1",
            "content-length": "6693",
            "date": "Thu, 17 Dec 2020 00:36:14 GMT"
        },
        "RetryAttempts": 0
    }
}

The above is the actual contents. I would like to confirm while looking at the document.

key	val
DocumentMetadata	Document metadata. This time, 1 page is returned.
Blocks	Items detected and analyzed by AnalyzeDocument. The result of OCR is coming in.
BlockType	The type of text item that is recognized. There seem to be several types. I will summarize only the contents that came out this time. Page: A list of detected LINE block objects. The ID of the recognized character was stored. Word:The detected word. Judgment such as handwriting or printing was also written. LINE:A string of detected tab-delimited consecutive words. I think it will contain a sentence of data.

Is this much data needed? This time, I think that BlockType in Blocks should extract what is needed from the data of Word. So how do you get it out?

Try to retrieve only the necessary data.

Looking at the returned value, it seems that the position of the read data is written. All RingFit data will be in the same format, so the range of characters to read should be about the same. The ring fit data is bottom right aligned, so the bottom right coordinates should be roughly the same. So I would like to get the data near specific coordinates.

Follow the steps below.

Create the lower right coordinate data corresponding to the character data from the previous data
Get the data of the coordinate position of the data you want to get
Store in each data

Although it is data with specific coordinates, I tried to allow an error of 0.01 in consideration of the deviation. The json loaded at runtime is the above-mentioned Textract response data.

`textract.py`


import json


#Formatted to data with only characters and lower right coordinates
def get_word(data: dict) -> dict:
    words = []
    for item in data["Blocks"]:
        if item["BlockType"] == "WORD":
            words.append({
                "word":item["Text"],
                "right_bottom":item["Geometry"]["Polygon"][2]
            })
    return words

#Lower right coordinates are near specific(Misalignment 0.Allowed up to 01)Judgment
def point_check(x: float, y: float) -> dict:
    origin_point = {
        "time":{"x":0.71,"y":0.46},
        "kcal":{"x":0.73,"y":0.63},
        "km":{"x":0.73,"y":0.78}
    }
    for k, v in origin_point.items():
        if abs(x-v["x"])<0.01 and abs(y-v["y"])<0.01:
            return k
     

def get_point_data(data: dict) -> dict:
    prepro_data = get_word(data)
    some_data = {}
    for v in prepro_data:
        tmp = point_check(v["right_bottom"]["X"], v["right_bottom"]["Y"])
        if tmp:
            some_data[tmp] = v["word"]
    return some_data


if __name__ == '__main__':
    with open("j.json") as f:
        data = json.load(f)
    d = get_point_data(data)
    print(d)

When I run it ...

`console`


python .\textract.py
{'time': '10+29', 'kcal': '38.40kcal', 'km': '0.89km'}

It seems that it is taken properly.

Try with multiple images

Then, there are some images of Ring Fit, so I would like to try it. Execute the image loading part multiple times and check the result of OCR of multiple images. (Code omitted) The result is below.

`res_list.json`


[
    {
        "time": "27",
        "kcal": "48kcal",
        "km": "0.71km"
    },
    {
        "time": "11>12*",
        "kcal": "37.79kcal",
        "km": "O.65km"
    },
    {
        "kcal": "36.62kcal",
        "km": "0.23km"
    },
    ...
]

There is some data that distinguishes between 0 and o, and the time cannot be read, but it seems that it can be read in general. (The time data includes Japanese, so it can't be helped.) I would like to fill the data with no time with 0, and use the value of the first 2 digits for other data. I would like to perform post-processing including replacement of 0 and o. Outliers (40 minutes or more) are set to 1/10 because time data may not be processed well. By the way, I also added date data so that I can use it for the previous graph creation. (Outside the code)

`textract.py`


def post_processing(word_point_list: list):
    for data in word_point_list:
        if "time" not in data:
            data["time"] = "0"
        re_data = re.sub('[^0-9]','', data["time"])
        if len(re_data) < 2:
            re_data = re_data[:1]
        else:
            re_data = re_data[:2]
        data["time"] = float(re_data) if float(re_data) < 40 else float(re_data)/10
        data["kcal"] = float(data["kcal"].replace("o","0").replace("O","0").replace("k","").replace("c","").replace("a","").replace("l",""))
        data["km"] = float(data["km"].replace("o","0").replace("O","0").replace("k","").replace("m",""))
    return word_point_list

When I run it with this ...

`res_list.json`


[
    {
        "time": 27.0,
        "kcal": 48.0,
        "km": 0.71,
        "date": "2020-11-09.png "
    },
    {
        "time": 11.0,
        "kcal": 37.79,
        "km": 0.65,
        "date": "2020-11-15.png "
    },
    {
        "kcal": 36.62,
        "km": 0.23,
        "date": "2020-11-16.png ",
        "time": 0.0
    },
    ...
]

It looks like it's clean!

Let's actually operate

Then, I would like to perform OCR of the image, preprocessing, and then display the graph. Please refer to the previous article for the functions you are using.

`ocr_and_graph.py`


import json
import os

from src.textract import do_ocr, get_point_data, post_processing
from src.graph import create_graph

IMPORT_FILE_PATH = "output/ocr_result.json"
OUTPUT_FILE_PATH = "output/graph2.png "
if __name__ == "__main__":
    data = do_ocr("./get_data")
    word_point_list = []
    for word_dict in data:
        word_point_list.append(get_point_data(word_dict))
    word_point_list = post_processing(word_point_list)
    with open("./output/j.json", "w") as f:
        json.dump(word_point_list, f)
    
    #Data creation and output from DL image files
    try:
        os.makedirs(IMPORT_FILE_PATH.replace(IMPORT_FILE_PATH.split("/")[-1], ""))
    except:
        pass
    with open(IMPORT_FILE_PATH, "w") as f:
        json.dump(word_point_list, f, indent=2)
    
    create_graph(IMPORT_FILE_PATH, OUTPUT_FILE_PATH)

I would like to compare the graph created this time with the graph created last time. It can be seen that the outliers of time and kcal have decreased compared to the previous time. After all, outliers are seen in the time data, so it may be better to deal with it by preprocessing or changing the language of the game. However, the kcal data is almost correct, so I think it is still useful. Besides, this time, the image is processed with this accuracy without preprocessing, so I felt that it was very easy to use.

Last time
this time

This is the end of the main subject.

Code review with CodeGuru

There is a service called CodeGuru on AWS. It is a service that reviews the code, but now that Python is a supported language, I would like to try it. First, link the code you want to review. I did it from GitHub. キャプチャ.PNG

After adding, select the repository and branch you want to analyze from "Create repository analysis". It took a few minutes to run. (Is it about 10 minutes?)

I would like to see the execution result. I think I've only seen the first one. Apparently, exception handling is not written in detail, so it is better to write it concretely. When I actually go to see the code, as shown below, only except is specified and the error content is not specified.

キャプチャ.PNG

`create_fit_data.py`


 #DL from the acquired image URL(The file name is the tweet date and time)
    for data in image_url_list:
        try:
            os.mkdir("get_data")
        except:
            pass
        dst_path = f"get_data/{data['created_at'].strftime('%Y-%m-%d')}.png "
        download_file(data['img_url'],dst_path)

In this way, using CodeGuru seems to help make the environment where bugs are likely to occur and debug easily. I think python is used in many places, so I think there are many opportunities to use it. If you want to co-develop or write solid code, you may want to use CodeGuru.

Summary

This time I used Textract and CodeGuru to do something like the last rework. After trying a few times, Textract was free to use up to 1000 pages a month for the first 3 months, so I was able to create it at no cost. It's very helpful for startups.

CodeGuru is also free to use for 3 months, and after that it seems to be $ 0.50 for every 100 lines of code until you analyze 1,500,000 lines of code each month.

By the way, the amount of code I wrote this time was about 250, and the code-reviewed line was displayed as 187. Maybe you read only the necessary parts.

I want Textract to support Japanese ... I hope it will be easier if you do it, but I'm looking forward to it in the future!

[Python] Try to create ring fit data using Amazon Textract [OCR] (Try code review with Code Guru)

What is Amazon Textract?

Try using Amazon Textract from the console

Try using Amazon Textract from python

console

console

textract.py

console

What will be returned as the execution result?

response.json

Try to retrieve only the necessary data.

textract.py

console

Try with multiple images

res_list.json

textract.py

res_list.json

Let's actually operate

ocr_and_graph.py

Code review with CodeGuru

create_fit_data.py

Summary

`console`

`console`

`textract.py`

`console`

`response.json`

`textract.py`

`console`

`res_list.json`

`textract.py`

`res_list.json`

`ocr_and_graph.py`

`create_fit_data.py`