Try translating with Python while maintaining the PDF layout

Introduction

image.png

[There are various methods to machine translate PDF](https://needtec.sakura.ne.jp/wod07672/2020/05/07/pdf%e3%82%92%e7%bf%bb%e8%a8% b3% e3% 81% 97% e3% 81% 9f% e3% 81% 84 /) is DocTranslator in the sense that it maintains the layout and translates. ) Is easy to use. However, there seems to be a size limit, and PDF32000_2008.pdf etc. failed to translate. I will end up.

This time, let's think about how to translate without breaking the layout. From the conclusion, the method explained this time is troublesome, so if you can use another method, I recommend another method.

Embed translation as annotation in PDF

It would be nice if you could embed the translation in the body without breaking the layout like DocTranslator, but I can predict that it will not work in various ways, so as a comment I will embed the translation.

For example, if you translate PDF32000_2008.pdf, the result will be as follows.

http://needtec.sakura.ne.jp/doc/tmp/output.pdf

When viewed with Adobe Acrobat Reader, it will be displayed as follows.

Overview

The mechanism of this translation is as follows. image.png

The CSV that extracts the text for translation from the original PDF and the JSON that stores information such as the position of the text are extracted. Then upload the CSV to Google Drive and open it in Google Sheets. Translate the source text using the GOOGLE TRANSLATE formula in Google Sheets. After that, download the CSV edited with Google Sheets, and add annotations that describe the translation in PDF based on the JSON that stores the CSV and the position of the text.

how to use

Advance preparation

    1. Prepare Python 3.7.5.
  1. Install the library.

    1. Get ready to use the Google Drive API and the Google Sheets API. You should be ready by running the quick start below.

Google Drive API- Python Quickstart https://developers.google.com/drive/api/v3/quickstart/python

Google Sheets API- Python Quickstart https://developers.google.com/sheets/api/quickstart/python

During this quick start, JSON that stores the authentication information will be created, so use it.

  1. Download the required script from the following. https://github.com/mima3/pdf_translate

Translation method

    1. Download the PDF you want to translate to your local PC.
  1. Use the following command to create a JSON that records the text information and position and a CSV that records a list of texts from the PDF to be translated.

python ./analyze_pdf_text.py PDF32000_2008.pdf

The following files will be created.

  1. Enter the translation in the second column of PDF32000_2008.pdf.csv. This time, I will upload it to Google Spreadsheet and translate it with the GOOGLE TRANSLATE formula. The script that automates this is as follows.
python ./translate_google_sheets.py PDF32000_2008.pdf.csv client_secret.json
  1. Execute the following command to embed the translation as an annotation
python ./embed_annots.py PDF32000_2008.pdf.json output.pdf

Description of the library you are using

How do you get the text information in PDF?

I am using PyMuPDF.

You can get the position of the text block and its contents by using the Page.getText method.

** Sample code **

import fitz
doc = fitz.open('PDF32000_2008.pdf')
print(doc[5].getText('blocks'))

** Output example **

[(36.779998779296875, 39.29692077636719, 130.1901397705078, 52.363121032714844, 'PDF 32000-1:2008', 0, 0),Abbreviation]

In addition, it is also possible to get in word units by setting the value of the first argument of getText to "words".

** Output example **

[(36.779998779296875, 39.29692077636719, 58.759761810302734, 52.363121032714844, 'PDF', 0, 0, 0),Abbreviation]

If you give "json" or "dict", you can get more detailed information such as fonts and colors.

{
 "width":595.0,
 "height":842.0,
 "blocks":[
  {
   "type":0,
   "bbox":[
    36.779998779296875,
    39.29692077636719,
    130.1901397705078,
    52.363121032714844
   ],
   "lines":[
    {
     "wmode":0,
     "dir":[
      1.0,
      0.0
     ],
     "bbox":[
      36.779998779296875,
      39.29692077636719,
      130.1901397705078,
      52.363121032714844
     ],
     "spans":[
      {
       "size":10.979999542236328,
       "flags":20,
       "font":"DDPEIM+Helvetica-Bold",
       "color":0,
       "text":"PDF 32000-1:2008",
       "bbox":[
        36.779998779296875,
        39.29692077636719,
        130.1901397705078,
        52.363121032714844
       ]
      }
     ]
    }
   ]
  },
Abbreviation

How do you embed annotations?

Use Page.addTextAnnot. If you want to try something other than the simple ones added this time, the following code will be helpful.

https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/annotations/new-annots.py

CSV upload

[Upload CSV to Google Drive, edit it as Google Spreadsheet, and download it](https://needtec.sakura.ne.jp/wod07672/2020/05/08/google-drive%e3%81%abcsv% e3% 82% 92% e3% 82% a2% e3% 83% 83% e3% 83% 97% e3% 83% ad% e3% 83% bc% e3% 83% 89% e3% 81% 97% e3% 81% a6google% e3% 82% b9% e3% 83% 97% e3% 83% ac% e3% 83% 83% e3% 83% 89% e3% 82% b7% e3% 83% bc% e3% 83% 88% e3% 81% a8 /) has sample code for upload only.

Excerpt from the upload process


    service_drive = build('drive', 'v3', credentials=creds)

    #Upload CSV for editing in Google Sheets.
    # https://developers.google.com/drive/api/v3/manage-uploads#python
    file_metadata = {
        'name': 'Test',
        'mimeType': 'application/vnd.google-apps.spreadsheet'
    }
    media = MediaFileUpload('test.csv',
                            mimetype='text/csv',
                            resumable=True)
    file = service_drive.files().create(body=file_metadata,
                                    media_body=media,
                                    fields='id').execute()
    print('File ID: %s' % file.get('id'))

CSV will be added as a Google spreadsheet by specifying "application / vnd.google-apps.spreadsheet" in the mimeType of the body parameter of create.

Edit as Google Spreadsheet

[Upload CSV to Google Drive, edit it as Google Sheets, and then download it](https://needtec.sakura.ne.jp/wod07672/2020/05/08/google-drive%e3%81%abcsv% e3% 82% 92% e3% 82% a2% e3% 83% 83% e3% 83% 97% e3% 83% ad% e3% 83% bc% e3% 83% 89% e3% 81% 97% e3% 81% a6google% e3% 82% b9% e3% 83% 97% e3% 83% ac% e3% 83% 83% e3% 83% 89% e3% 82% b7% e3% 83% bc% e3% 83% 88% e3% 81% a8 /) contains sample code that allows you to edit the CSV uploaded to Google Drive with the Google Sheets API.

This is fine for small CSVs, but for updating a large number of rows, the GOOGLE TRANSLATE formula (https://support.google.com/docs/answer/3093331?hl=ja) is complete. It may not be cut. In this case, the cell will display "Loading ..." or "Loading ..." instead of the translated text. This is also a problem if you are editing Google Sheets on your screen instead of via the API.

This time I couldn't find a good workaround so I'm checking every 10 seconds that "Loading ..." or "Loading ..." doesn't exist. https://github.com/mima3/pdf_translate/blob/master/translate_google_sheets.py#L45

In addition, this time I am honestly updating the cell with the update method, but those who used batchUpdate Looks good.

Google spreadsheet download

[Upload CSV to Google Drive, edit it as Google Spreadsheet, and download it](https://needtec.sakura.ne.jp/wod07672/2020/05/08/google-drive%e3%81%abcsv% e3% 82% 92% e3% 82% a2% e3% 83% 83% e3% 83% 97% e3% 83% ad% e3% 83% bc% e3% 83% 89% e3% 81% 97% e3% 81% a6google% e3% 82% b9% e3% 83% 97% e3% 83% ac% e3% 83% 83% e3% 83% 89% e3% 82% b7% e3% 83% bc% e3% 83% A sample download is available at 88% e3% 81% a8 /).

    # download
    request = service_drive.files().export_media(fileId=file_id, mimeType='text/csv')
    fh = io.BytesIO()
    downloader = MediaIoBaseDownload(fh, request)
    done = False
    while done is False:
        status, done = downloader.next_chunk()
        print ("Download %d%%." % int(status.progress() * 100))
    with open('download.csv', 'wb') as f:
        f.write(fh.getvalue())

There are two ways to download from Google Drive: get_media and export_media. https://developers.google.com/drive/api/v3/manage-downloads

This time, I will use export_media because I will download the one converted to Google spread as CSV once.

Summary

For the time being, I was able to translate a PDF with about 700 pages like this. However, this method is cumbersome and time consuming, so if another method is possible, you should consider another method. At least if you can use another translation API, you don't have to do the trouble of uploading, embedding the formula, and downloading again.

Recommended Posts

Try translating with Python while maintaining the PDF layout
Try to solve the man-machine chart with Python
[Automation] Extract the table in PDF with Python
Try scraping with Python.
Try to solve the programming challenge book with python3
[Cloudian # 8] Try setting the bucket versioning with Python (boto3)
Try to solve the internship assignment problem with Python
Try touching the micro: bit with VS Code + Python
Convert the image in .zip to PDF with Python
Try to automate pdf format report creation with Python
Try translating the Python Data Science Handbook into Japanese
[Python] Automatically translate PDF with DeepL while keeping the original format. [Windows / Word required]
Try Python output with Haxe 3.2
Try translating English PDF Part 1
Integrate PDF files with Python
Try running Python with Try Jupyter
Call the API with python3.
Try face recognition with Python
Try hitting the Twitter API quickly and easily with Python
Probably the easiest way to create a pdf with Python3
Try to automate the operation of network devices with Python
Try to decipher the garbled attachment file name with Python
Try scraping with Python + Beautiful Soup
Extract the xz file with python
Try the Python LINE Pay SDK
[Cloudian # 6] Try deleting the object stored in the bucket with Python (boto3)
Try to operate Facebook with Python
Try singular value decomposition with Python
Get the weather with Python requests
First python ② Try to write code while examining the features of python
Get the weather with Python requests 2
Find the Levenshtein Distance with python
Hit the Etherpad-lite API with Python
Install the Python plugin with Netbeans 8.0.2
Try face recognition with python + OpenCV
I liked the tweet with python. ..
Master the type with Python [Python 3.9 compatible]
Try using the Python Cmd module
Try frequency control simulation with Python
Try blurring the image with opencv2
Try to solve the shortest path with Python + NetworkX + social data
Put Cabocha 0.68 on Windows and try to analyze the dependency with Python
Try to image the elevation data of the Geographical Survey Institute with Python
I measured the speed of list comprehension, for and while with python2.7.
[Cloudian # 5] Try to list the objects stored in the bucket with Python (boto3)
Make the Python console covered with UNKO
Try to reproduce color film with Python
Try logging in to qiita with Python
Try using the Wunderlist API in Python
[Python] Set the graph range with matplotlib
Try using the Kraken API in Python
Behind the flyer: Using Docker with Python
Learn the basics while touching python Variables
Try working with binary data in Python
Try the Variational-Quantum-Eigensolver (VQE) algorithm with Blueqat
Check the existence of the file with python
Try using the camera with Python's OpenCV
[Python] Get the variable name with str
[Python] Round up with just the operator
Display Python 3 in the browser with MAMP
Search the maze with the python A * algorithm