What I want to do

Is there any problem with the contents of a large number of PDFs (10,000 or more!)? I wanted to do a quick search (whether the file name matches the contents, etc.).

environment

Python 2.7 Windows7 64bit

Library to use

Use PDFMiner. Although the sample when using the command prompt is posted on the official website, For some reason, there was no information on how to import and use the library, so I'm a little confused.

Installation

Extract the file downloaded from the official and In the pdfminer-20140328 folder, this time it is Windows, so execute the following command.

mkdir pdfminer\cmap
python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt
python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt
python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
python setup.py install

If you do not install with this procedure, all Japanese will be displayed like (cid: 0000).

Actually extract the text

`pdf2txt.py`


# -*- coding: utf-8 -*-

import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

space = re.compile(ur"[ 　]+")

def convert_pdf_to_txt(path, txtname, buf=True):
    rsrcmgr = PDFResourceManager()
    if buf:
        outfp = StringIO()
    else:
        outfp = file(txtname, 'w')
    codec = 'utf-8'
    laparams = LAParams()
    laparams.detect_vertical = True
    device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)

    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
    fp.close()
    device.close()
    if buf:
        text = re.sub(space, "", outfp.getvalue())
        print text
    outfp.close()


convert_pdf_to_txt("TEST.pdf", "test.txt")

** Prepared PDF **

result

old pond
Jump into the frog
The sound of water

Cocoa*Chocolate

Cocoa is the solid content of cocoa butter separated from cocoa mass.
Or the abbreviation for the powdered cocoa powder. Also, melt the cocoa powder
It is also used as an abbreviation for a savory beverage. As shown in the history below,
Until the separation of cocoa butter from chaos, the word cocoa
There is only pasty chocolate that is neither solid nor liquid
Was there.

laparams.detect_vertical seems to be an important parameter. For vertical PDF text or PDF with complex structure If this is not set to True, Japanese will be broken for each character and the structure will be output in a mess. Also, re.sub removes the obstructive space. Then, for the character string in the memory, just check the contents with the in operator! By the way, if you pass buf = False as an argument, it will be output as text.

Digression

Characters displayed in variant characters such as Tsuji and 逗 could not be converted to Japanese well. It will be displayed like (cid: 7711). I don't know the cid font yet, so study it.

By the way, in GhostScript text extraction, when I tried to extract font data that is not in Windows, the character code of Shift_JIS with & #; was forcibly output, so the characters were garbled (from Shift_JIS --Unicode correspondence table). Forcibly convert and respond). PyPDF2 couldn't convert Japanese well. (Official also says This will be refined in the future.)

The site I referred to

http://stackoverflow.com/questions/26748788/extraction-of-text-from-pdf-with-pdfminer-gives-multiple-copies

Extract Japanese text from PDF with PDFMiner