After last time, I thought it was necessary to repair it, so it's a simple continuation.
It's okay to output the PDF page in CSV format, but I said it was ton demo data. Specifically, the subtitle came in the middle. It's sober and painful.
I found the following site when I couldn't find a similar project. Analyzing the list of black companies of the Ministry of Health, Labor and Welfare with Python (PDFMiner.six)
I knew that I had a comrade and that I could manage with the coordinates. So I will try it.
Reference: Select PDFMiner to extract text information from PDF
It seems that pdfminer can also get the coordinate information of the layout. Until now, only character data was extracted with TextConverter, In PDFPageAggregator, coordinates and character data seem to be pulled out, so use this.
For the time being, check what kind of coordinates are available. I'm sorry I couldn't prepare the sample PDF ...
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter, PDFPageAggregator
from pdfminer.layout import LAParams, LTContainer, LTTextBox, LTTextLine, LTChar
from pdfminer.pdfpage import PDFPage
def convert_pdf_to_txt(self,p_d_f):
fp = open(p_d_f, 'rb')
for page in PDFPage.get_pages(fp):
rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.detect_vertical = True
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
#Get coordinates and character data from PDF
interpreter.process_page(page)
layout = device.get_result()
#Display of coordinates and characters
for node in layout:
if isinstance(node, LTTextBox) or isinstance(node, LTTextLine):
print(node.get_text()) #letter
word =input(node.bbox) #Coordinate
word =input("---page end---")
An inefficient guy that checks at the command prompt.
To be honest, I don't really understand the judgment like LTTextBox, but I put it in as a magic. Let's find out properly.
This is an excerpt of the output result. The text is dummy.
---page end---
About popcorn machines
(68.28, 765.90036, 337.2, 779.9403599999999)
It is a machine that pops and makes popcorn.
(67.8, 697.71564, 410.4000000000001, 718.47564)
Please be careful when using it.
(67.8, 665.29564, 339.8400000000002, 686.05564)
The usage is as follows.
(67.8, 643.69564, 279.3600000000001, 653.65564000)
Description
(67.8, 730.11564, 87.96000000000001, 740.07564)
Tuples are the coordinates. The order is (x0, y0, x1, y1). For details, go to the reference site! To put it simply, if you look at y1, you can see the coordinates of the characters from the bottom. In other words, if y1 in the page is in descending order, the characters are arranged in the coordinates in order from the top = correct arrangement form (in this case).
So, looking at this output result, y1 in the last line is the second largest, so it is an irrelevant result from the viewpoint of simply arranging from the top. It may be sorted based on x0. I don't know anything. It seems that the coordinates are taken well, so I will do something with this y1.
① Make a dictionary ② Sort the dictionary (key descending order) ③ Make it a character string ④ Clean up line breaks
This should work. If you are a sly person, please look only at the finished product.
d=[]
for node in layout:
if isinstance(node, LTTextBox) or isinstance(node, LTTextLine):
y1 = node.bbox[3]
#If it is a table, the coordinates of y1 are duplicated, so string concatenation
if y1 in d:
d[y1] += "|" + node.get_text()
else:
d[y1] = node.get_text()
Make a quick dictionary of coordinates and letters. I also take table measures to relax.
But to be honest, this method of making it open is a barren effort because it has holes. Because, it seems that the coordinates are taking characters line by line, but the mechanism is to set a margin padding value and take a block of characters in the near future as a "block". It seems that it is (certainly).
Solid story, if you do not set anything, the default margin will be applied, and multiple lines will be recognized as one block for sentences with tight line spacing and fine tables. So, if you get multiple lines of characters with the same coordinates, it's already a collapse of the Ese table operation.
If so, I'm talking about setting margin padding properly, but this time I haven't asked for that much, so I won't set it in particular. When the table comes out, let's try with a feeling of "I'm sorry!"
Reference: Summary of Python sort (list, dictionary type, Series, DataFrame)
d2 = sorted(d.items(), key=lambda x: -x[0])
I did it! Ramuda Hatsuyoshi! By the way, if you do this, the dictionary will be a list. I don't really care as long as I can sort.
text = ""
for d0 in d2:
text += d0[1]
It's just round and round.
Reference: Split comma-separated strings with Python, split, remove whitespace and list I am always indebted to you.
space = re.compile("[ ]+")
text = re.sub(space, "", text )
l_text = [a for a in text.splitlines() if a != '']
text = '\n'.join(l_text).replace('\n|', '|')
There are many spaces and line breaks. This is a solution to the problem. Replace white space and delete line breaks as a list. By the way, the line break before the symbol that was used as a mark when returning to the table is also deleted.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter, PDFPageAggregator
from pdfminer.layout import LAParams, LTContainer, LTTextBox, LTTextLine, LTChar
from pdfminer.pdfpage import PDFPage
import csv,re,datetime
import pandas as pd
class converter(object):
def convert_pdf_to_txt(self,p_d_f):
print("system:pdf【" + p_d_f + "] Is read")
df = pd.DataFrame(columns=["Update date and time","Sentence","page number"])
cnt = 1
space = re.compile("[ ]+")
fp = open(p_d_f, 'rb')
#Extract coordinates and character data from pdf
for page in PDFPage.get_pages(fp):
rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.detect_vertical = True
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
#Get coordinates and character data from PDF
interpreter.process_page(page)
layout = device.get_result()
#Create a dictionary of coordinates and data
d={}
for node in layout:
if isinstance(node, LTTextBox) or isinstance(node, LTTextLine):
y1 = node.bbox[3]
#If it is a table, the coordinates of y1 are duplicated, so string concatenation
if y1 in d:
d[y1] += "|" + node.get_text()
else:
d.update({y1 : node.get_text()})
#Sort by coordinates
d2 = sorted(d.items(), key=lambda x: -x[0])
#Bump into a string
text = ""
for d0 in d2:
text += ddd[1]
#Remove blank line breaks
text = re.sub(space, "", text)
l_text = [a for a in text.splitlines() if a != '']
text = '\n'.join(l_text).replace('\n|', '|')
df.loc[cnt,["Sentence","page number"]] = [text,cnt]
cnt += 1
fp.close()
device.close()
now = datetime.datetime.now()
df["Update date and time"] = now
csv_path = p_d_f.replace('.pdf', '.csv')
with open(csv_path, mode='w', encoding='cp932', errors='ignore', newline='\n') as f:
df.to_csv(f,index=False)
if __name__ == "__main__":
p_d_f = "Somehow.pdf"
con=converter()
hoge=con.pdf_to_csv(p_d_f)
I haven't checked it well because I added and subtracted it from the last time, but something similar worked. If you get an error, please fix it yourself.
Recommended Posts