I had to translate 20 uncopyable pdf files written in English as below, so I want to extract the text and apply it to google translate etc ...
Extract text from pdf file.
This time, pdfminer was used. https://github.com/pdfminer/pdfminer.six
I also referred to the following articles. https://qiita.com/mczkzk/items/894110558fb890c930b5
1.Please input pdf path:, then enter the pdf file name 2. Change the extension of the input file name to .txt and create a text file 3. Output the result to it
It is a simple operation such as.
The result of specifying the pdf file earlier is as follows.
Only one arrow was output. Strange, To check with other pdf files, specify the following pdf created with word.
The result is as follows.
Also, the arrow above is output, but both English and Japanese are output well. The program doesn't seem to matter. I thought it was a problem due to the protection of pdf, so I tried to remove the protection with "Print to pdf", but Also, only one arrow was output.
Since it was confirmed that pdfminer itself works well, it is considered that the problem is in the pdf file. I think the cause is that the image quality is poor, probably because the target pdf file was scanned.
Since pdfminer is too convenient, it became a very short program.
pdf2text.py
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
input_path = input("Please input pdf path : ")
output_path,ext = input_path.split(".")
output_path += ".txt"
manager = PDFResourceManager()
with open(output_path, "wb") as output:
with open(input_path, 'rb') as input:
with TextConverter(manager, output, codec='utf-8', laparams=LAParams()) as conv:
interpreter = PDFPageInterpreter(manager, conv)
for page in PDFPage.get_pages(input):
interpreter.process_page(page)
Recommended Posts