This article describes OCR processing (converting to Google Docs) of PDF in Python (Google Colab environment).
Google Drive has a function that converts PDF to Documents file by OCR processing. Describes how to handle in Python code.
In particular, I didn't have any information about the double-byte problem of the file name of 4, so I wanted to share it as knowledge for those who are suffering from the same problem.
--Google Colaboratory (Colab)
This is the final source code. Processing is performed according to the following flow.
Details will be described later.
def full_to_half(val):
"""
Convert full-width to half-width
* Address to the problem that the alphabetic characters included in the file name after OCR become full-width
"""
return val.translate(str.maketrans({chr(0xFF01 + i): chr(0x21 + i) for i in range(94)}))
import os
import glob
from google.colab import auth
from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload
#Authentication
auth.authenticate_user()
#Get Service to operate Drive
drive_service = build('drive', 'v3')
#Local path mounted on Colab
input_path = 'drive/My Drive/PDF/INPUT' #input(PDF)Directory path
output_path = 'drive/My Drive/PDF/OUTPUT' #Output destination directory path
#####
#Processed PDF files are checked for duplicates by file name and excluded from the target
####
#Get files recursively
files_o = glob.glob(output_path, recursive=True)
exist_filenames = ['']
for root, dirs, files_o in os.walk(output_path):
for filename in files_o:
#Convert full-width to half-width, remove extension
exist_filename = full_to_half(filename).replace('.gdoc', '')
#Add existing file name
exist_filenames.append(exist_filename)
#####
#Create a list of PDFs to convert
####
#Get files recursively
files = glob.glob(input_path, recursive=True)
pdf_infos = []
for root, dirs, files in os.walk(input_path):
for filename in files:
#print(filename)
#Excludes existing file names
if full_to_half(filename) in exist_filenames:
#print('Exists')
pass
else:
#PDF extension
if filename[-4:] == '.PDF' or filename[-4:] == '.pdf':
#print('not exist')
filepath = os.path.join(root, filename) #Local file path on Colab
pdf_infos.append({
'path': filepath,
'name': filename
})
#print('number of files: ' + str(len(pdf_infos)))
#MIME type of Google Docs file
MIME_TYPE = 'application/vnd.google-apps.document'
#####
#Convert target PDF file
####
for pdf_info in pdf_infos:
pdf_path = pdf_info['path']
#print(pdf_path)
pdf_filename = pdf_info['name']
#File name after OCR
#print(pdf_filename)
#Convert full-width alphabetic characters to half-width
pdf_filename = full_to_half(pdf_filename)
body = {
'name': pdf_filename,
'mimeType': MIME_TYPE,
'parents': ['Output destination Drive directory ID']
}
try:
media_body = MediaFileUpload(pdf_path, mimetype=MIME_TYPE, resumable=True)
drive_service.files().create(
body=body,
media_body=media_body,
).execute()
except:
print('error:Failed to create Documents file.')
print(pdf_path)
Make some preparations before running the above code.
Colab has a mount feature that allows you to virtually treat Google Drive as a local file system. You can operate Drive, but if it is a Google API client, it will take time to process via Web API, so performance will decrease. Therefore, in order to increase the processing speed, try to process in the mounted position as much as possible.
To mount Drive on Colab, connect to the runtime and press the icon below.
Then the following code will be inserted, please execute this.
from google.colab import drive
drive.mount('/content/drive')
Open the displayed URL in your browser, copy the verification code beyond it, and paste it into the text box.
This completes the mount.
Install the Google API client for Python.
!pip install google-api-python-client
I will explain the implementation of the source code mentioned above.
Get a Service object to work with Drive in the Google API client.
Authenticate using Colab's auth and get the Drive Service object in the Google API client.
from google.colab import auth
from googleapiclient.discovery import build
#Authentication
auth.authenticate_user()
#Get Service to operate Drive
drive_service = build('drive', 'v3')
This time, the converted file is stored in one place. In addition, a duplicate check is performed to enable re-execution when the PDF is terminated in the middle or when a PDF is added.
It recursively searches the root directory of the virtual local and adds the filenames that exist in the variable exist_filenames (array) in order.
#Get files recursively
files_o = glob.glob(output_path, recursive=True)
exist_filenames = ['']
for root, dirs, files_o in os.walk(output_path):
for filename in files_o:
#Convert full-width to half-width, remove extension
exist_filename = full_to_half(filename).replace('.gdoc', '')
#Add existing file name
exist_filenames.append(exist_filename)
Create a list of PDFs to convert at runtime. If the non-target files acquired in process 2 match, they will be skipped. If the PDF file does not match, it is a new addition, so add it to the variable pdf_infos (array) as the PDF to be processed.
#Get files recursively
files = glob.glob(input_path, recursive=True)
pdf_infos = []
for root, dirs, files in os.walk(input_path):
for filename in files:
#print(filename)
#Excludes existing file names
if full_to_half(filename) in exist_filenames:
#print('Exists')
pass
else:
#PDF extension
if filename[-4:] == '.PDF' or filename[-4:] == '.pdf':
#print('not exist')
filepath = os.path.join(root, filename) #Local file path on Colab
pdf_infos.append({
'path': filepath,
'name': filename
})
Convert the PDF file based on the list extracted in the process up to 3.
Create a new file in Drive with the Drive Service object files (). create () .execute (). At that time, if you specify the value of Documents for the MIME type, it will be automatically converted to an OCR-processed Documents file.
Specify the converted file name, MIME type, and parent directory ID in the body parameter of create (). For the media_body parameter, specify the PDF file uploaded to Google by Media File Update.
for pdf_info in pdf_infos:
pdf_path = pdf_info['path']
#print(pdf_path)
pdf_filename = pdf_info['name']
#File name after OCR
#print(pdf_filename)
#Convert full-width alphabetic characters to half-width
pdf_filename = full_to_half(pdf_filename)
body = {
'name': pdf_filename,
'mimeType': MIME_TYPE,
'parents': ['Output destination Drive directory ID']
}
try:
media_body = MediaFileUpload(pdf_path, mimetype=MIME_TYPE, resumable=True)
drive_service.files().create(
body=body,
media_body=media_body,
).execute()
except:
print('error:Failed to create Documents file.')
print(pdf_path)
Documents files created by OCR conversion of PDF files will have full-width alphabetic characters. I investigated this with the following code.
chars = [
'm', #Characters copied from the Documents file
'm' #Characters entered by direct typing
]
#Full-width (file name after conversion)
print(hex(ord(chars[0])))
#Half size
print(hex(ord(chars[1])))
#Convert full-width alphabetic characters to half-width alphabetic characters
print(hex(ord(chars[0].translate(str.maketrans({chr(0xFF01 + i): chr(0x21 + i) for i in range(94)})))))
Execution result
0xff4d
0x6d
0x6d
From the above execution results, it was found that the converted file name is full-width and that it can be converted to half-width.
For the conversion, I referred to this article. [Python] Convert full-width and half-width characters to each other in one line (alphabet + number + symbol) --Qiita
With the above, OCR conversion of PDF file has been implemented. We hope for your reference.
Recommended Posts