There is a service called Azure Form recognizer. https://azure.microsoft.com/ja-jp/services/cognitive-services/form-recognizer/

It is an excellent one that reads the form nicely and extracts the target data. Since there is also an API, I wrote a Python script that can process multiple forms at once https://github.com/yosukearaiMS13/formrecognizerbatch/blob/master/fy.py

The contents of the script and how to use it are explained below.

The contents of the script

The script is made by extending the sample in the document https://docs.microsoft.com/ja-jp/azure/cognitive-services/form-recognizer/quickstarts/python-labeled-data?tabs=v2-0

The script consists of 4 sections https://github.com/yosukearaiMS13/formrecognizerbatch/blob/master/fy.py

`fr.py`



# Configurations:Various setting parameters

#Post Analysis target pdf section
##Post all the data to be analyzed to Form recognizer once

# Get analyze results section
##Get the analysis result (including the extracted data) of the data posted earlier.

#Csv output section of extraction result
##The extraction result is output. Remove extra whitespace and replace unreliable extract values
##(If it is below the threshold value, the extracted value is not adopted, and the reliability is used instead.[]Output in box)
##Is doing

Get analyze results and csv output section of the extraction results parse the json returned by the Form recognizer. Click here for json format https://github.com/Azure-Samples/cognitive-services-REST-api-samples/blob/master/curl/form-recognizer/Invoice_1.pdf.ocr.json

The format of the output csv is as follows. --First column: Form file name to be analyzed --Second and subsequent columns: All labels (tags) set in the analysis model and the corresponding extracted values

The API used in each section is as follows --Post Analysis target pdf: Analyze Form

Get analyze results: Get Analyze Form Result --CSV output section: Get Custom Model --Get all the labels defined in the API and use it as the header value of csv

How to use the script

1. Environment

Win10 Enterprise, Python 3.8.5, IDE is optional

2. Data extraction preparation

(* From the prerequisite work to data extraction preparation 1, this Qiita article is also helpful)

--Prerequisite work: Do the following first -[Create Form Recognizer Resource](https://docs.microsoft.com/en-us/azure/cognitive-services/form-recognizer/quickstarts/label-tool?tabs=v2-0#create-a-form -recognizer-resource) -Create Azure blob (Create storage account-> Create container) -[Azure blob settings (Create Shared Access Signature)](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-storage-explorer#work-with-shared -access-signatures) (Use the Azure Portal Storage Explorer menu for convenience) +

(Check all permissions fields)![Image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/327770/aaae5370-fc5e-688e -b454-39a60518978e.png)
(The generated URL value will be used later, so save it)![1.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/ 0/327770/705f3739-b5be-ddbe-0001-915d953131b5.png)

--Data extraction preparation 1 (implemented only for the first time) --Store training data for model creation in Azure blob: Place at least 5 files (invoice_1 ~ 5.pdf in this case) in the following form (xx.json is a file created later, so ignore it here)

--Label (tagging) tool settings:

Click here for labeling tools https://fott.azurewebsites.net/
Labeling procedure: Follow the steps in the document below to connect to the labeling tool-> create a project. + https://docs.microsoft.com/ja-jp/azure/cognitive-services/form-recognizer/quickstarts/label-tool?tabs=v2-0#connect-to-the-sample-labeling-tool --Settings to python script # 1

`fr.py`


## Configurations
endpoint = r"https://xxxxx.cognitiveservices.azure.com/"
apim_key = "xxxxx"
model_id = "xxxxx"
sourceDir = r"C:\xxxxx\*"
confidence_setting = 0.9 # 0~1.Not adopted if reliability is below this value

--endpoint: Form Recognizer endpoint --apim_key: Form Recognizer key 1 or 2![Image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/327770/21e50af5-135e-a847-6040 -233601fdfbbf.png) --sourceDir: Describe the location of the form file to be analyzed with the full path --confidence_setting: Set a value from 0 to 1 (* As a script specification, if the reliability is less than this value, the extracted value is not adopted, but instead the reliability evaluation value is output in []. Is)

--Data extraction preparation 2 (implemented every time a label is added or modified) --Label the read training data (form) with the label (tagging) tool (https://fott.azurewebsites.net/). .. Train when you're done and generate a model

Steps in the documentation below: Label the form-> Train your custom model, go on + https://docs.microsoft.com/ja-jp/azure/cognitive-services/form-recognizer/quickstarts/label-tool?tabs=v2-0#label-your-forms
[This Qiita article](https://qiita.com/komiyasa/items/afee82f7baddcd820251#%E3%82%AB%E3%82%B9%E3%82%BF%E3%83%9E%E3%82 The procedure of% A4% E3% 82% BA% E3% 82% 92% E5% AE% 9F% E8% A1% 8C% E3% 81% 99% E3% 82% 8B) is also helpful.
After Train, Model ID will be generated (below). You will use this value later![Image.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws. com% 2F0% 2F149921% 2Ffa4048dc-c641-38e0-6ad9-d1a6e2b291bf.png? ixlib = rb-1.2.2 & auto = format & gif-q = 60 & q = 75 & s = c73deb1bac3ffc3074b2b6cbd14d359e) --Settings in python script # 2: model_id

`fr.py`


## Configurations
endpoint = r"https://xxxxx.cognitiveservices.azure.com/"
apim_key = "xxxxx"
model_id = "xxxxx"
sourceDir = r"C:\xxxxx\*"
confidence_setting = 0.9 # 0~1.Not adopted if reliability is below this value

--Model_id: Set the Model ID obtained above

3. Data extraction

--Place the form file to be analyzed in sourceDir --Run fr.py --Data extraction result csv is output to the same folder as the script

4. Constraints, etc.

――It is a file format of training and analysis target forms, but I have only tried PDF --I am making it based on the current version v2.0 of Form recognizer. When using it in other versions, I think that it is necessary to change the API URL as appropriate and respond to the json format change returned by Form recognizer.

Batch processing forms with Azure Form Recognizer

The contents of the script

fr.py

How to use the script

1. Environment

2. Data extraction preparation

fr.py

fr.py

3. Data extraction

4. Constraints, etc.

`fr.py`

`fr.py`

`fr.py`