Convert PDF to Documents by OCR

Introduction

This article describes OCR processing (converting to Google Docs) of PDF in Python (Google Colab environment).

Google Drive has a function that converts PDF to Documents file by OCR processing. Describes how to handle in Python code.

  1. Extract text from PDF
  2. Use the OCR function of Google Drive for text extraction
  3. Convert to Google Documents by OCR processing and extract text
  4. Address to the problem that the alphabet of the file name becomes full-width when converted to Documents

In particular, I didn't have any information about the double-byte problem of the file name of 4, so I wanted to share it as knowledge for those who are suffering from the same problem.

Technical elements

--Google Colaboratory (Colab)

Source code

This is the final source code. Processing is performed according to the following flow.

  1. Authenticate, get Drive Service
  2. Processed PDF files are checked for duplicates by file name and excluded from the target.
  3. Create a list of PDFs to convert
  4. Convert the target PDF file

Details will be described later.

def full_to_half(val):
  """
Convert full-width to half-width
* Address to the problem that the alphabetic characters included in the file name after OCR become full-width
  """
  return val.translate(str.maketrans({chr(0xFF01 + i): chr(0x21 + i) for i in range(94)}))

import os
import glob
from google.colab import auth
from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload

#Authentication
auth.authenticate_user()
#Get Service to operate Drive
drive_service = build('drive', 'v3')

#Local path mounted on Colab
input_path = 'drive/My Drive/PDF/INPUT' #input(PDF)Directory path
output_path = 'drive/My Drive/PDF/OUTPUT' #Output destination directory path

#####
#Processed PDF files are checked for duplicates by file name and excluded from the target
####
#Get files recursively
files_o = glob.glob(output_path, recursive=True)
exist_filenames = ['']
for root, dirs, files_o in os.walk(output_path):
    for filename in files_o:
        #Convert full-width to half-width, remove extension
        exist_filename = full_to_half(filename).replace('.gdoc', '')
        #Add existing file name
        exist_filenames.append(exist_filename)

#####
#Create a list of PDFs to convert
####
#Get files recursively
files = glob.glob(input_path, recursive=True)
pdf_infos = []
for root, dirs, files in os.walk(input_path):
    for filename in files:
      #print(filename)
      #Excludes existing file names
      if full_to_half(filename) in exist_filenames:
        #print('Exists')
        pass
      else:
        #PDF extension
        if filename[-4:] == '.PDF' or filename[-4:] == '.pdf':
          #print('not exist')
          filepath = os.path.join(root, filename) #Local file path on Colab
          pdf_infos.append({
                'path': filepath,
                'name': filename
            })

#print('number of files: ' + str(len(pdf_infos)))

#MIME type of Google Docs file
MIME_TYPE = 'application/vnd.google-apps.document'

#####
#Convert target PDF file
####
for pdf_info in pdf_infos:
  pdf_path = pdf_info['path']

  #print(pdf_path)

  pdf_filename = pdf_info['name']
  #File name after OCR
  #print(pdf_filename)

  #Convert full-width alphabetic characters to half-width
  pdf_filename = full_to_half(pdf_filename)

  body = {
      'name': pdf_filename,
      'mimeType': MIME_TYPE,
      'parents': ['Output destination Drive directory ID']
  }
  try:
    media_body = MediaFileUpload(pdf_path, mimetype=MIME_TYPE, resumable=True)

    drive_service.files().create(
        body=body,
        media_body=media_body,
    ).execute()
  except:
    print('error:Failed to create Documents file.')
    print(pdf_path)

Preparation

Make some preparations before running the above code.

Google Drive mount

Colab has a mount feature that allows you to virtually treat Google Drive as a local file system. You can operate Drive, but if it is a Google API client, it will take time to process via Web API, so performance will decrease. Therefore, in order to increase the processing speed, try to process in the mounted position as much as possible.

To mount Drive on Colab, connect to the runtime and press the icon below.

Then the following code will be inserted, please execute this.

from google.colab import drive
drive.mount('/content/drive')

Open the displayed URL in your browser, copy the verification code beyond it, and paste it into the text box.

This completes the mount.

Install Google API client

Install the Google API client for Python.

!pip install google-api-python-client

Implementation

I will explain the implementation of the source code mentioned above.

1. Authenticate, get Drive Service

Get a Service object to work with Drive in the Google API client.

Authenticate using Colab's auth and get the Drive Service object in the Google API client.

from google.colab import auth
from googleapiclient.discovery import build

#Authentication
auth.authenticate_user()
#Get Service to operate Drive
drive_service = build('drive', 'v3')

2. Processed PDF files are checked for duplicates by file name and excluded from the target.

This time, the converted file is stored in one place. In addition, a duplicate check is performed to enable re-execution when the PDF is terminated in the middle or when a PDF is added.

It recursively searches the root directory of the virtual local and adds the filenames that exist in the variable exist_filenames (array) in order.

#Get files recursively
files_o = glob.glob(output_path, recursive=True)
exist_filenames = ['']
for root, dirs, files_o in os.walk(output_path):
    for filename in files_o:
        #Convert full-width to half-width, remove extension
        exist_filename = full_to_half(filename).replace('.gdoc', '')
        #Add existing file name
        exist_filenames.append(exist_filename)

3. Create a list of PDFs to convert

Create a list of PDFs to convert at runtime. If the non-target files acquired in process 2 match, they will be skipped. If the PDF file does not match, it is a new addition, so add it to the variable pdf_infos (array) as the PDF to be processed.

#Get files recursively
files = glob.glob(input_path, recursive=True)
pdf_infos = []
for root, dirs, files in os.walk(input_path):
    for filename in files:
      #print(filename)
      #Excludes existing file names
      if full_to_half(filename) in exist_filenames:
        #print('Exists')
        pass
      else:
        #PDF extension
        if filename[-4:] == '.PDF' or filename[-4:] == '.pdf':
          #print('not exist')
          filepath = os.path.join(root, filename) #Local file path on Colab
          pdf_infos.append({
                'path': filepath,
                'name': filename
            })

4. Convert the target PDF file

Convert the PDF file based on the list extracted in the process up to 3.

Create a new file in Drive with the Drive Service object files (). create () .execute (). At that time, if you specify the value of Documents for the MIME type, it will be automatically converted to an OCR-processed Documents file.

Specify the converted file name, MIME type, and parent directory ID in the body parameter of create (). For the media_body parameter, specify the PDF file uploaded to Google by Media File Update.

for pdf_info in pdf_infos:
  pdf_path = pdf_info['path']

  #print(pdf_path)

  pdf_filename = pdf_info['name']
  #File name after OCR
  #print(pdf_filename)

  #Convert full-width alphabetic characters to half-width
  pdf_filename = full_to_half(pdf_filename)

  body = {
      'name': pdf_filename,
      'mimeType': MIME_TYPE,
      'parents': ['Output destination Drive directory ID']
  }
  try:
    media_body = MediaFileUpload(pdf_path, mimetype=MIME_TYPE, resumable=True)

    drive_service.files().create(
        body=body,
        media_body=media_body,
    ).execute()
  except:
    print('error:Failed to create Documents file.')
    print(pdf_path)

Addressing the problem that the alphabetic characters in the converted file name become full-width

Documents files created by OCR conversion of PDF files will have full-width alphabetic characters. I investigated this with the following code.

chars = [
  'm',  #Characters copied from the Documents file
  'm'  #Characters entered by direct typing
]

#Full-width (file name after conversion)
print(hex(ord(chars[0])))
#Half size
print(hex(ord(chars[1])))

#Convert full-width alphabetic characters to half-width alphabetic characters
print(hex(ord(chars[0].translate(str.maketrans({chr(0xFF01 + i): chr(0x21 + i) for i in range(94)})))))

Execution result

0xff4d
0x6d
0x6d

From the above execution results, it was found that the converted file name is full-width and that it can be converted to half-width.

For the conversion, I referred to this article. [Python] Convert full-width and half-width characters to each other in one line (alphabet + number + symbol) --Qiita

in conclusion

With the above, OCR conversion of PDF file has been implemented. We hope for your reference.

Recommended Posts

Convert PDF to Documents by OCR
[Python] Convert PDF text to CSV page by page (2/24 postscript)
Convert markdown to PDF in Python
Convert A4 PDF to A3 every 2 pages
Convert from pdf to txt 2 [pyocr]
Convert PDF to image with ImageMagick
Convert from PDF to CSV with pdfplumber
Convert PDF attached to email to text format
Convert PDF files to PNG files with GIMP
Convert to HSV
How to convert m4a acquired by iTunes to wav
Convert PDF to image (JPEG / PNG) with Python
How to convert SVG to PDF and PNG [Python]
Convert multiple jpg files to one PDF file
Batch convert PSD files in directory to PDF
[Small story] Easy way to convert Jupyter to PDF
[Python] Continued-Convert PDF text to CSV page by page
Beginners try to convert Word files to PDF at once
Convert 202003 to 2020-03 with pandas
Convert kanji to kana
Convert jupyter to py
Convert keras-yolo3 to onnx
Convert the image in .zip to PDF with Python
Convert dict to array
Convert json to excel
Multiply PDF by OCR on command line on Linux (Ubuntu)
Batch convert image files uploaded to MS Forms / Google Forms to PDF
Convert garbled scanned images to PDF with Pillow and PyPDF
Convert hexadecimal string to binary
[python] Convert date to string
Convert numpy int64 to python int
[Python] Convert list to Pandas [Pandas]
Convert HTML to text file
OCR from PDF in Python
Add page number to PDF
Convert Scratch project to Python
[Python] Convert Shift_JIS to UTF-8
Convert IP address to decimal
Convert SDF to CSV quickly
Convert genbank file to gff file
Convert python 3.x code to python 2.x
Convert files written in python etc. to pdf with syntax highlighting
Convert PDF of Go To Eat Hokkaido campaign dealer list to CSV
Convert a large number of PDF files to text files using pdfminer
Download Google logo → Convert to text with OCR → Display on HTML
[Good By Excel] python script to generate sql to convert csv to table