I want to read Large amount of English pdf files, but I don't understand English words in the first place. For now, convert the pdf file to a text file, list the words, and memorize the frequently-used words in order from the top. I'm sure you can read it faster! I decided to believe.
That's why I decided to put a lot of English pdf files into a pot, boil them and convert them to text files. I feel like boiling a large amount of soba for the Wanko soba tournament.
macOS Python3.6 anaconda
A large number of pdf files that are difficult to digest
pdfminer ← Check the reference URL at the end of the installation method os re
It seems that pdfminer gives better results than PyPDF2.
Please note that it does not (probably) support Japanese sentences.
PdfToTextConverter.py
#! python3
# PdfToTextConverter.py
#Read the contents of a PDF file and output it as a text file
import os
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
pdf_folder_path = os.getcwd() #Get the path of the current folder
text_folder_path = os.getcwd() + '/' + 'text_folder' #Notation of path is mac specification. For windows'/'To'\'Correct to.
os.makedirs(text_folder_path, exist_ok=True)
pdf_file_name = os.listdir(pdf_folder_path)
#name is a PDF file (ends.pdf) returns TRUE, otherwise FALSE is returned.
#This post was quoted and partially changed → http://qiita.com/korkewriya/items/72de38fc506ab37b4f2d
def pdf_checker(name):
pdf_regex = re.compile(r'.+\.pdf')
if pdf_regex.search(str(name)):
return True
else:
return False
#Convert PDF to text file
def convert_pdf_to_txt(path, txtname, buf=True):
rsrcmgr = PDFResourceManager()
if buf:
outfp = StringIO()
else:
outfp = file(txtname, 'w')
codec = 'utf-8'
laparams = LAParams()
laparams.detect_vertical = True
device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
fp.close()
device.close()
if buf:
text = outfp.getvalue()
make_new_text_file = open(text_folder_path + '/' + path + '.txt', 'w')
make_new_text_file.write(text)
make_new_text_file.close()
outfp.close()
#Get the pdf file name in the folder and list it
for name in pdf_file_name:
if pdf_checker(name):
convert_pdf_to_txt(name, name + '.txt') # pdf_Use checker and TRUE (end is.For pdf) proceed to conversion)
else:
pass #Pass if not a PDF file
A large number of text files that are likely to cause stomach upset
Move a large number of text files to a bowl and extract about 500 frequently-used words. Remember the meaning of the word (I don't know if it's effective for reading English quickly)
http://qiita.com/korkewriya/items/72de38fc506ab37b4f2d → The part that converts a pdf file to a text file is quoted (partially modified) from this article.
https://kusanohitoshi.blogspot.jp/2017/01/python3cstringiostringio.html → Refer to here for how to deal with StringIO import error.
"Let Python do the boring things" → How to use the os module
http://www.unixuser.org/%7Eeuske/python/pdfminer/index.html → pdfminer page
https://github.com/conda-forge/pdfminer-feedstock https://conda-forge.org/feedstocks → Refer to here for the installation procedure of pdfminer in the anaconda environment.
Recommended Posts