Is there any problem with the contents of a large number of PDFs (10,000 or more!)? I wanted to do a quick search (whether the file name matches the contents, etc.).
Python 2.7 Windows7 64bit
Use PDFMiner. Although the sample when using the command prompt is posted on the official website, For some reason, there was no information on how to import and use the library, so I'm a little confused.
Extract the file downloaded from the official and In the pdfminer-20140328 folder, this time it is Windows, so execute the following command.
mkdir pdfminer\cmap
python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt
python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt
python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
python setup.py install
If you do not install with this procedure, all Japanese will be displayed like (cid: 0000).
pdf2txt.py
# -*- coding: utf-8 -*-
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
space = re.compile(ur"[ ]+")
def convert_pdf_to_txt(path, txtname, buf=True):
rsrcmgr = PDFResourceManager()
if buf:
outfp = StringIO()
else:
outfp = file(txtname, 'w')
codec = 'utf-8'
laparams = LAParams()
laparams.detect_vertical = True
device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
fp.close()
device.close()
if buf:
text = re.sub(space, "", outfp.getvalue())
print text
outfp.close()
convert_pdf_to_txt("TEST.pdf", "test.txt")
** Prepared PDF **
result
old pond
Jump into the frog
The sound of water
Cocoa*Chocolate
Cocoa is the solid content of cocoa butter separated from cocoa mass.
Or the abbreviation for the powdered cocoa powder. Also, melt the cocoa powder
It is also used as an abbreviation for a savory beverage. As shown in the history below,
Until the separation of cocoa butter from chaos, the word cocoa
There is only pasty chocolate that is neither solid nor liquid
Was there.
laparams.detect_vertical
seems to be an important parameter.
For vertical PDF text or PDF with complex structure
If this is not set to True, Japanese will be broken for each character and the structure will be output in a mess.
Also, re.sub
removes the obstructive space.
Then, for the character string in the memory, just check the contents with the in operator!
By the way, if you pass buf = False
as an argument, it will be output as text.
Characters displayed in variant characters such as Tsuji and 逗 could not be converted to Japanese well. It will be displayed like (cid: 7711). I don't know the cid font yet, so study it.
By the way, in GhostScript text extraction, when I tried to extract font data that is not in Windows, the character code of Shift_JIS with & #; was forcibly output, so the characters were garbled (from Shift_JIS --Unicode correspondence table). Forcibly convert and respond). PyPDF2 couldn't convert Japanese well. (Official also says This will be refined in the future.)
http://stackoverflow.com/questions/26748788/extraction-of-text-from-pdf-with-pdfminer-gives-multiple-copies
Recommended Posts