Introduction

[There are various methods to machine translate PDF](https://needtec.sakura.ne.jp/wod07672/2020/05/07/pdf%e3%82%92%e7%bf%bb%e8%a8% b3% e3% 81% 97% e3% 81% 9f% e3% 81% 84 /) is DocTranslator in the sense that it maintains the layout and translates. ) Is easy to use. However, there seems to be a size limit, and PDF32000_2008.pdf etc. failed to translate. I will end up.

This time, let's think about how to translate without breaking the layout. From the conclusion, the method explained this time is troublesome, so if you can use another method, I recommend another method.

Embed translation as annotation in PDF

It would be nice if you could embed the translation in the body without breaking the layout like DocTranslator, but I can predict that it will not work in various ways, so as a comment I will embed the translation.

For example, if you translate PDF32000_2008.pdf, the result will be as follows.

http://needtec.sakura.ne.jp/doc/tmp/output.pdf

Please browse with Adobe Acrobat Reader. Annotations are garbled via a browser.

When viewed with Adobe Acrobat Reader, it will be displayed as follows.

Overview

The mechanism of this translation is as follows.

The CSV that extracts the text for translation from the original PDF and the JSON that stores information such as the position of the text are extracted. Then upload the CSV to Google Drive and open it in Google Sheets. Translate the source text using the GOOGLE TRANSLATE formula in Google Sheets. After that, download the CSV edited with Google Sheets, and add annotations that describe the translation in PDF based on the JSON that stores the CSV and the position of the text.

how to use

Advance preparation

1. Prepare Python 3.7.5.
Install the library.

1. Get ready to use the Google Drive API and the Google Sheets API. You should be ready by running the quick start below.

Google Drive API- Python Quickstart https://developers.google.com/drive/api/v3/quickstart/python

Google Sheets API- Python Quickstart https://developers.google.com/sheets/api/quickstart/python

During this quick start, JSON that stores the authentication information will be created, so use it.

Download the required script from the following. https://github.com/mima3/pdf_translate

Translation method

1. Download the PDF you want to translate to your local PC.
Use the following command to create a JSON that records the text information and position and a CSV that records a list of texts from the PDF to be translated.

python ./analyze_pdf_text.py PDF32000_2008.pdf

The following files will be created.

PDF32000_2008.pdf.json
PDF32000_2008.pdf.csv

Enter the translation in the second column of PDF32000_2008.pdf.csv. This time, I will upload it to Google Spreadsheet and translate it with the GOOGLE TRANSLATE formula. The script that automates this is as follows.

python ./translate_google_sheets.py PDF32000_2008.pdf.csv client_secret.json

Execute the following command to embed the translation as an annotation

python ./embed_annots.py PDF32000_2008.pdf.json output.pdf

Description of the library you are using

How do you get the text information in PDF?

I am using PyMuPDF.

You can get the position of the text block and its contents by using the Page.getText method.

** Sample code **

import fitz
doc = fitz.open('PDF32000_2008.pdf')
print(doc[5].getText('blocks'))

** Output example **

[(36.779998779296875, 39.29692077636719, 130.1901397705078, 52.363121032714844, 'PDF 32000-1:2008', 0, 0),Abbreviation]

In addition, it is also possible to get in word units by setting the value of the first argument of getText to "words".

** Output example **

[(36.779998779296875, 39.29692077636719, 58.759761810302734, 52.363121032714844, 'PDF', 0, 0, 0),Abbreviation]

If you give "json" or "dict", you can get more detailed information such as fonts and colors.

{
 "width":595.0,
 "height":842.0,
 "blocks":[
  {
   "type":0,
   "bbox":[
    36.779998779296875,
    39.29692077636719,
    130.1901397705078,
    52.363121032714844
   ],
   "lines":[
    {
     "wmode":0,
     "dir":[
      1.0,
      0.0
     ],
     "bbox":[
      36.779998779296875,
      39.29692077636719,
      130.1901397705078,
      52.363121032714844
     ],
     "spans":[
      {
       "size":10.979999542236328,
       "flags":20,
       "font":"DDPEIM+Helvetica-Bold",
       "color":0,
       "text":"PDF 32000-1:2008",
       "bbox":[
        36.779998779296875,
        39.29692077636719,
        130.1901397705078,
        52.363121032714844
       ]
      }
     ]
    }
   ]
  },
Abbreviation

How do you embed annotations?

Use Page.addTextAnnot. If you want to try something other than the simple ones added this time, the following code will be helpful.

https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/annotations/new-annots.py

CSV upload

[Upload CSV to Google Drive, edit it as Google Spreadsheet, and download it](https://needtec.sakura.ne.jp/wod07672/2020/05/08/google-drive%e3%81%abcsv% e3% 82% 92% e3% 82% a2% e3% 83% 83% e3% 83% 97% e3% 83% ad% e3% 83% bc% e3% 83% 89% e3% 81% 97% e3% 81% a6google% e3% 82% b9% e3% 83% 97% e3% 83% ac% e3% 83% 83% e3% 83% 89% e3% 82% b7% e3% 83% bc% e3% 83% 88% e3% 81% a8 /) has sample code for upload only.

`Excerpt from the upload process`


    service_drive = build('drive', 'v3', credentials=creds)

    #Upload CSV for editing in Google Sheets.
    # https://developers.google.com/drive/api/v3/manage-uploads#python
    file_metadata = {
        'name': 'Test',
        'mimeType': 'application/vnd.google-apps.spreadsheet'
    }
    media = MediaFileUpload('test.csv',
                            mimetype='text/csv',
                            resumable=True)
    file = service_drive.files().create(body=file_metadata,
                                    media_body=media,
                                    fields='id').execute()
    print('File ID: %s' % file.get('id'))

CSV will be added as a Google spreadsheet by specifying "application / vnd.google-apps.spreadsheet" in the mimeType of the body parameter of create.

Edit as Google Spreadsheet

[Upload CSV to Google Drive, edit it as Google Sheets, and then download it](https://needtec.sakura.ne.jp/wod07672/2020/05/08/google-drive%e3%81%abcsv% e3% 82% 92% e3% 82% a2% e3% 83% 83% e3% 83% 97% e3% 83% ad% e3% 83% bc% e3% 83% 89% e3% 81% 97% e3% 81% a6google% e3% 82% b9% e3% 83% 97% e3% 83% ac% e3% 83% 83% e3% 83% 89% e3% 82% b7% e3% 83% bc% e3% 83% 88% e3% 81% a8 /) contains sample code that allows you to edit the CSV uploaded to Google Drive with the Google Sheets API.

This is fine for small CSVs, but for updating a large number of rows, the GOOGLE TRANSLATE formula (https://support.google.com/docs/answer/3093331?hl=ja) is complete. It may not be cut. In this case, the cell will display "Loading ..." or "Loading ..." instead of the translated text. This is also a problem if you are editing Google Sheets on your screen instead of via the API.

This time I couldn't find a good workaround so I'm checking every 10 seconds that "Loading ..." or "Loading ..." doesn't exist. https://github.com/mima3/pdf_translate/blob/master/translate_google_sheets.py#L45

In addition, this time I am honestly updating the cell with the update method, but those who used batchUpdate Looks good.

Google spreadsheet download

    # download
    request = service_drive.files().export_media(fileId=file_id, mimeType='text/csv')
    fh = io.BytesIO()
    downloader = MediaIoBaseDownload(fh, request)
    done = False
    while done is False:
        status, done = downloader.next_chunk()
        print ("Download %d%%." % int(status.progress() * 100))
    with open('download.csv', 'wb') as f:
        f.write(fh.getvalue())

There are two ways to download from Google Drive: get_media and export_media. https://developers.google.com/drive/api/v3/manage-downloads

This time, I will use export_media because I will download the one converted to Google spread as CSV once.

Summary

For the time being, I was able to translate a PDF with about 700 pages like this. However, this method is cumbersome and time consuming, so if another method is possible, you should consider another method. At least if you can use another translation API, you don't have to do the trouble of uploading, embedding the formula, and downloading again.

Try translating with Python while maintaining the PDF layout