Reference: Extract Japanese text from PDF with PDFMiner
This is almost the method. I haven't done anything interesting.
A library called PDFMiner. It is one shot with pip.
pip install pdfminer.six
On the reference site, there was Japanese, but even if I put it in with pip, Japanese was detected properly.
-CSV creation date data is included in the "Update date" column. -PDF text data is included in the "Sentence" column -The PDF page number is entered in the "Page number" column.
This is the source of the 90% reference site.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import csv,re,datetime
import pandas as pd
class converter(object):
def pdf_to_csv(self,p_d_f):
df = pd.DataFrame(columns=["Update date and time","Sentence","page number"])
#PDF text extraction from here
cnt = 1
space = re.compile("[ ]+")
fp = open(p_d_f, 'rb')
for page in PDFPage.get_pages(fp):
#Sequential initialization
rsrcmgr = PDFResourceManager()
outfp = StringIO()
codec = 'utf-8'
laparams = LAParams()
laparams.detect_vertical = True
device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
interpreter.process_page(page)
text = re.sub(space, "", outfp.getvalue())
df.loc[cnt,["Sentence","page number"]] = [text,cnt]
cnt += 1
outfp.close()
fp.close()
device.close()
now = datetime.datetime.now()
df["Update date and time"] = now
csv_path = p_d_f.replace('.pdf', '.csv')
df.to_csv(csv_path, encoding='CP932', index=False)
if __name__ == "__main__":
p_d_f = "Somehow.pdf"
con=converter()
hoge=con.pdf_to_csv(p_d_f)
The difference from the reference site is that the box (outfp) that stores the text data extracted from the PDF is initialized at the point where it is put in the data frame. If it is left as it is, the text data of all pages will be added more and more. If you put it in the data frame, it will be this one, so I wonder if you can quickly add small columns.
It may be because it is easy that csv conversion was not caught in one shot even if I searched, but for writing notes.
Recommended Posts