There is a service called Azure Form recognizer. https://azure.microsoft.com/ja-jp/services/cognitive-services/form-recognizer/
It is an excellent one that reads the form nicely and extracts the target data. Since there is also an API, I wrote a Python script that can process multiple forms at once https://github.com/yosukearaiMS13/formrecognizerbatch/blob/master/fy.py
The contents of the script and how to use it are explained below.
The script is made by extending the sample in the document https://docs.microsoft.com/ja-jp/azure/cognitive-services/form-recognizer/quickstarts/python-labeled-data?tabs=v2-0
The script consists of 4 sections https://github.com/yosukearaiMS13/formrecognizerbatch/blob/master/fy.py
fr.py
# Configurations:Various setting parameters
#Post Analysis target pdf section
##Post all the data to be analyzed to Form recognizer once
# Get analyze results section
##Get the analysis result (including the extracted data) of the data posted earlier.
#Csv output section of extraction result
##The extraction result is output. Remove extra whitespace and replace unreliable extract values
##(If it is below the threshold value, the extracted value is not adopted, and the reliability is used instead.[]Output in box)
##Is doing
Get analyze results and csv output section of the extraction results parse the json returned by the Form recognizer. Click here for json format https://github.com/Azure-Samples/cognitive-services-REST-api-samples/blob/master/curl/form-recognizer/Invoice_1.pdf.ocr.json
The format of the output csv is as follows. --First column: Form file name to be analyzed --Second and subsequent columns: All labels (tags) set in the analysis model and the corresponding extracted values
The API used in each section is as follows --Post Analysis target pdf: Analyze Form
Win10 Enterprise, Python 3.8.5, IDE is optional
(* From the prerequisite work to data extraction preparation 1, this Qiita article is also helpful)
--Prerequisite work: Do the following first -[Create Form Recognizer Resource](https://docs.microsoft.com/en-us/azure/cognitive-services/form-recognizer/quickstarts/label-tool?tabs=v2-0#create-a-form -recognizer-resource) -Create Azure blob (Create storage account-> Create container) -[Azure blob settings (Create Shared Access Signature)](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-storage-explorer#work-with-shared -access-signatures) (Use the Azure Portal Storage Explorer menu for convenience) +
--Data extraction preparation 1 (implemented only for the first time) --Store training data for model creation in Azure blob: Place at least 5 files (invoice_1 ~ 5.pdf in this case) in the following form (xx.json is a file created later, so ignore it here)
--Label (tagging) tool settings:
fr.py
## Configurations
endpoint = r"https://xxxxx.cognitiveservices.azure.com/"
apim_key = "xxxxx"
model_id = "xxxxx"
sourceDir = r"C:\xxxxx\*"
confidence_setting = 0.9 # 0~1.Not adopted if reliability is below this value
--endpoint: Form Recognizer endpoint --apim_key: Form Recognizer key 1 or 2![Image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/327770/21e50af5-135e-a847-6040 -233601fdfbbf.png) --sourceDir: Describe the location of the form file to be analyzed with the full path --confidence_setting: Set a value from 0 to 1 (* As a script specification, if the reliability is less than this value, the extracted value is not adopted, but instead the reliability evaluation value is output in []. Is)
--Data extraction preparation 2 (implemented every time a label is added or modified) --Label the read training data (form) with the label (tagging) tool (https://fott.azurewebsites.net/). .. Train when you're done and generate a model
fr.py
## Configurations
endpoint = r"https://xxxxx.cognitiveservices.azure.com/"
apim_key = "xxxxx"
model_id = "xxxxx"
sourceDir = r"C:\xxxxx\*"
confidence_setting = 0.9 # 0~1.Not adopted if reliability is below this value
--Model_id: Set the Model ID obtained above
--Place the form file to be analyzed in sourceDir --Run fr.py --Data extraction result csv is output to the same folder as the script
――It is a file format of training and analysis target forms, but I have only tried PDF --I am making it based on the current version v2.0 of Form recognizer. When using it in other versions, I think that it is necessary to change the API URL as appropriate and respond to the json format change returned by Form recognizer.