[There are various methods to machine translate PDF](https://needtec.sakura.ne.jp/wod07672/2020/05/07/pdf%e3%82%92%e7%bf%bb%e8%a8% b3% e3% 81% 97% e3% 81% 9f% e3% 81% 84 /) is DocTranslator in the sense that it maintains the layout and translates. ) Is easy to use. However, there seems to be a size limit, and PDF32000_2008.pdf etc. failed to translate. I will end up.
This time, let's think about how to translate without breaking the layout. From the conclusion, the method explained this time is troublesome, so if you can use another method, I recommend another method.
It would be nice if you could embed the translation in the body without breaking the layout like DocTranslator, but I can predict that it will not work in various ways, so as a comment I will embed the translation.
For example, if you translate PDF32000_2008.pdf, the result will be as follows.
http://needtec.sakura.ne.jp/doc/tmp/output.pdf
When viewed with Adobe Acrobat Reader, it will be displayed as follows.
The mechanism of this translation is as follows.
The CSV that extracts the text for translation from the original PDF and the JSON that stores information such as the position of the text are extracted. Then upload the CSV to Google Drive and open it in Google Sheets. Translate the source text using the GOOGLE TRANSLATE formula in Google Sheets. After that, download the CSV edited with Google Sheets, and add annotations that describe the translation in PDF based on the JSON that stores the CSV and the position of the text.
Install the library.
Google Drive API- Python Quickstart https://developers.google.com/drive/api/v3/quickstart/python
Google Sheets API- Python Quickstart https://developers.google.com/sheets/api/quickstart/python
During this quick start, JSON that stores the authentication information will be created, so use it.
Use the following command to create a JSON that records the text information and position and a CSV that records a list of texts from the PDF to be translated.
python ./analyze_pdf_text.py PDF32000_2008.pdf
The following files will be created.
python ./translate_google_sheets.py PDF32000_2008.pdf.csv client_secret.json
python ./embed_annots.py PDF32000_2008.pdf.json output.pdf
I am using PyMuPDF.
You can get the position of the text block and its contents by using the Page.getText method.
** Sample code **
import fitz
doc = fitz.open('PDF32000_2008.pdf')
print(doc[5].getText('blocks'))
** Output example **
[(36.779998779296875, 39.29692077636719, 130.1901397705078, 52.363121032714844, 'PDF 32000-1:2008', 0, 0),Abbreviation]
In addition, it is also possible to get in word units by setting the value of the first argument of getText to "words".
** Output example **
[(36.779998779296875, 39.29692077636719, 58.759761810302734, 52.363121032714844, 'PDF', 0, 0, 0),Abbreviation]
If you give "json" or "dict", you can get more detailed information such as fonts and colors.
{
"width":595.0,
"height":842.0,
"blocks":[
{
"type":0,
"bbox":[
36.779998779296875,
39.29692077636719,
130.1901397705078,
52.363121032714844
],
"lines":[
{
"wmode":0,
"dir":[
1.0,
0.0
],
"bbox":[
36.779998779296875,
39.29692077636719,
130.1901397705078,
52.363121032714844
],
"spans":[
{
"size":10.979999542236328,
"flags":20,
"font":"DDPEIM+Helvetica-Bold",
"color":0,
"text":"PDF 32000-1:2008",
"bbox":[
36.779998779296875,
39.29692077636719,
130.1901397705078,
52.363121032714844
]
}
]
}
]
},
Abbreviation
Use Page.addTextAnnot. If you want to try something other than the simple ones added this time, the following code will be helpful.
https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/annotations/new-annots.py
[Upload CSV to Google Drive, edit it as Google Spreadsheet, and download it](https://needtec.sakura.ne.jp/wod07672/2020/05/08/google-drive%e3%81%abcsv% e3% 82% 92% e3% 82% a2% e3% 83% 83% e3% 83% 97% e3% 83% ad% e3% 83% bc% e3% 83% 89% e3% 81% 97% e3% 81% a6google% e3% 82% b9% e3% 83% 97% e3% 83% ac% e3% 83% 83% e3% 83% 89% e3% 82% b7% e3% 83% bc% e3% 83% 88% e3% 81% a8 /) has sample code for upload only.
Excerpt from the upload process
service_drive = build('drive', 'v3', credentials=creds)
#Upload CSV for editing in Google Sheets.
# https://developers.google.com/drive/api/v3/manage-uploads#python
file_metadata = {
'name': 'Test',
'mimeType': 'application/vnd.google-apps.spreadsheet'
}
media = MediaFileUpload('test.csv',
mimetype='text/csv',
resumable=True)
file = service_drive.files().create(body=file_metadata,
media_body=media,
fields='id').execute()
print('File ID: %s' % file.get('id'))
CSV will be added as a Google spreadsheet by specifying "application / vnd.google-apps.spreadsheet" in the mimeType of the body parameter of create.
[Upload CSV to Google Drive, edit it as Google Sheets, and then download it](https://needtec.sakura.ne.jp/wod07672/2020/05/08/google-drive%e3%81%abcsv% e3% 82% 92% e3% 82% a2% e3% 83% 83% e3% 83% 97% e3% 83% ad% e3% 83% bc% e3% 83% 89% e3% 81% 97% e3% 81% a6google% e3% 82% b9% e3% 83% 97% e3% 83% ac% e3% 83% 83% e3% 83% 89% e3% 82% b7% e3% 83% bc% e3% 83% 88% e3% 81% a8 /) contains sample code that allows you to edit the CSV uploaded to Google Drive with the Google Sheets API.
This is fine for small CSVs, but for updating a large number of rows, the GOOGLE TRANSLATE formula (https://support.google.com/docs/answer/3093331?hl=ja) is complete. It may not be cut. In this case, the cell will display "Loading ..." or "Loading ..." instead of the translated text. This is also a problem if you are editing Google Sheets on your screen instead of via the API.
This time I couldn't find a good workaround so I'm checking every 10 seconds that "Loading ..." or "Loading ..." doesn't exist. https://github.com/mima3/pdf_translate/blob/master/translate_google_sheets.py#L45
In addition, this time I am honestly updating the cell with the update method, but those who used batchUpdate Looks good.
[Upload CSV to Google Drive, edit it as Google Spreadsheet, and download it](https://needtec.sakura.ne.jp/wod07672/2020/05/08/google-drive%e3%81%abcsv% e3% 82% 92% e3% 82% a2% e3% 83% 83% e3% 83% 97% e3% 83% ad% e3% 83% bc% e3% 83% 89% e3% 81% 97% e3% 81% a6google% e3% 82% b9% e3% 83% 97% e3% 83% ac% e3% 83% 83% e3% 83% 89% e3% 82% b7% e3% 83% bc% e3% 83% A sample download is available at 88% e3% 81% a8 /).
# download
request = service_drive.files().export_media(fileId=file_id, mimeType='text/csv')
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print ("Download %d%%." % int(status.progress() * 100))
with open('download.csv', 'wb') as f:
f.write(fh.getvalue())
There are two ways to download from Google Drive: get_media and export_media. https://developers.google.com/drive/api/v3/manage-downloads
This time, I will use export_media because I will download the one converted to Google spread as CSV once.
For the time being, I was able to translate a PDF with about 700 pages like this. However, this method is cumbersome and time consuming, so if another method is possible, you should consider another method. At least if you can use another translation API, you don't have to do the trouble of uploading, embedding the formula, and downloading again.
Recommended Posts