[DOCKER] Select PDFMiner to extract text information from PDF

To extract textual information from PDF

Environment

Dockerfile


FROM python:3.6
ENV LC_ALL C.UTF-8
ENV LANG C.UTF-8  
RUN apt-get -y update && \
    apt-get install -y --fix-missing \
    build-essential \
    software-properties-common \
    poppler-utils && \
    apt-get clean && \
    rm -rf /tmp/* /var/tmp/* && \
    mkdir /api
WORKDIR /api
COPY requirements.txt /api/requirements.txt
RUN pip3 install --upgrade pip && \
    pip3 install --upgrade -r requirements.txt
EXPOSE 8888
ENTRYPOINT jupyter notebook --ip=0.0.0.0 --allow-root --no-browser

requirements.txt


pandas==0.24.2
pillow==7.0.0
opencv-python==3.4.2.16
pdfminer==20191125
jupyter==1.0.0
$ docker build -t pdfminer -f ./Dockerfile .
$ docker run -it -v `pwd`:/api -p 8888:8888 --name pdfminer pdfminer bash

Extract text information from PDF

If the container is created successfully, Jupiter will start automatically, so create a python file. The following settings are the code to extract the minimum character information and save it in a text file. This time, the PDF of the Financial Services Agency is test.pdf. https://www.fsa.go.jp/news/30/wp/supervisory_approaches_revised.pdf

test.py


from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTContainer, LTTextBox, LTTextLine, LTChar
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage

def pdfminer_config(line_overlap, word_margin, char_margin,line_margin, detect_vertical):
    laparams = LAParams(line_overlap=line_overlap,
                        word_margin=word_margin,
                        char_margin=char_margin,
                        line_margin=line_margin,
                        detect_vertical=detect_vertical)
    resource_manager = PDFResourceManager()
    device = PDFPageAggregator(resource_manager, laparams=laparams)
    interpreter = PDFPageInterpreter(resource_manager, device)
    return (interpreter, device)

def find_textboxes(layout_obj):
    if isinstance(layout_obj, LTTextBox):
        return [layout_obj]
    if isinstance(layout_obj, LTContainer):
        boxes = []
        for child in layout_obj:
            boxes.extend(find_textboxes(child))
        return boxes
    return []

def find_textlines(layout_obj):
    if isinstance(layout_obj, LTTextLine):
        return [layout_obj]
    if isinstance(layout_obj, LTTextBox):
        lines = []
        for child in layout_obj:
            lines.extend(find_textlines(child))
        return lines
    return []

def find_characters(layout_obj):
    if isinstance(layout_obj, LTChar):
        return [layout_obj]
    if isinstance(layout_obj, LTTextLine):
        characters = []
        for child in layout_obj:
            characters.extend(find_characters(child))
        return characters
    return []

def write_text(text_file, text):
    text_file.write(text)

text_file = open('output.txt', 'w')
with open("./test.pdf", 'rb') as f:
    interpreter, device = pdfminer_config(line_overlap=0.5, word_margin=0.1, char_margin=2, line_margin=0.5, detect_vertical=True)
    for page in PDFPage.get_pages(f):
        interpreter.process_page(page)  #Process the page.
        layout = device.get_result()  #Get the LTPage object.
        boxes = find_textboxes(layout)
        for box in boxes:
            write_text(text_file, box.get_text().strip())
        
text_file.close()

Adjustment by laparams

If you don't get the text you want, adjust the parameters in laparams. By changing char_margin, word_margin, line_margin, the grouped characters will change. set detect_vertivcal to True if there is a vertical sentence like Japanese.

test.py


interpreter, device = pdfminer_config(line_overlap=0.5, word_margin=0.1, char_margin=2.0, line_margin=0.5, detect_vertical=False)

スクリーンショット 2020-01-18 11.53.36.png

Contents of boxes

The boxes available in the code above are packed with a lot of information.

--Text information --Character position information (Since the unit is pt, unit conversion from pt to pixel is required when processing with opencv etc.)

print(boxes[0])
# >> <LTTextBoxHorizontal(0) 92.160,755.000,524.296,766.952 'However, in the past, the international sector of the Financial Services Agency exchanged information so that the burden of introducing international regulations would be as small as possible.\n'>
print(boxes[0].get_text())
# >>However, in the past, the international sector of the Financial Services Agency exchanged information so that the burden of introducing international regulations would be as small as possible.
print(boxes[0].bbox)
# >> (92.15997480600001, 754.9998879965001, 524.2961793060001, 766.9523361965001)
# >>Inside the tuple(x0, y0, x1, y1)The positions shown are as shown in the image.

スクリーンショット 2020-01-18 11.46.08.png

Contents of lines

LTTextLines are listed in the box. So let's get the LTTextLine using find_textline, which we didn't use in the code above.

test.py


lines = find_textlines(boxes[0])
print(lines[0])
# >><LTTextLineHorizontal 92.160,755.000,524.296,766.952 'However, in the past, the international sector of the Financial Services Agency exchanged information so that the burden of introducing international regulations would be as small as possible.\n'>
print(lines[0].get_text())
# >>However, in the past, the international sector of the Financial Services Agency exchanged information so that the burden of introducing international regulations would be as small as possible.
print(lines[0].bbox)
# >> (92.15997480600001, 754.9998879965001, 524.2961793060001, 766.9523361965001)

Contents of characters

In addition, LTChar is listed in the lines. In addition to character information and location information, fonts are also packed in it.

test.py


characters = find_characters(lines[0])
print(characters[0])
# >><LTChar 92.160,755.000,104.160,766.952 matrix=[12.00,0.00,0.00,12.00, (92.16,756.68)] font='AHTYXM+MS-PGothic' adv=1.0 text='Or'>
print(characters[0].get_text())
# >>Or
print(characters[0].bbox)
# >> (92.15997480600001, 754.9998879965001, 104.16042480600001, 766.9523361965001)

If I have time, I would like to introduce how to change the color of the acquired part.

Recommended Posts

Select PDFMiner to extract text information from PDF
Extract Japanese text from PDF with PDFMiner
Conversion from pdf to txt 1 [pdfminer]
Convert from pdf to txt 2 [pyocr]
Extract text from images in Python
Convert a large number of PDF files to text files using pdfminer
[python] Extract text from pdf and read characters aloud with Open-Jtalk
Convert PDF attached to email to text format
Is it possible to extract the person's profile information from the chat log?
How to extract coefficients from a fractional formula
[Python] Continued-Convert PDF text to CSV page by page
I tried to extract various information of remote PC from Python by WMI Library
Extract images and tables from pdf with python to reduce the burden of reporting
Sum from 1 to 10
Images created with matplotlib shift from dvi to pdf
[Python] Convert PDF text to CSV page by page (2/24 postscript)
[Python] Change standard input from keyboard to text file
Passing confidential information from SSM to ECS with CloudFormation
Allows you to select by name from Django User Name