Today's menu

I want to read Large amount of English pdf files, but I don't understand English words in the first place. For now, convert the pdf file to a text file, list the words, and memorize the frequently-used words in order from the top. I'm sure you can read it faster! I decided to believe.

That's why I decided to put a lot of English pdf files into a pot, boil them and convert them to text files. I feel like boiling a large amount of soba for the Wanko soba tournament.

Countertop environment

macOS Python3.6 anaconda

Foodstuff

A large number of pdf files that are difficult to digest

kitchenware

pdfminer ← Check the reference URL at the end of the installation method os re

It seems that pdfminer gives better results than PyPDF2.

What to expect as a cooking failure

Please note that it does not (probably) support Japanese sentences.

Today's pot

`PdfToTextConverter.py`


#! python3
# PdfToTextConverter.py
#Read the contents of a PDF file and output it as a text file

import os
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

pdf_folder_path = os.getcwd()			                #Get the path of the current folder
text_folder_path = os.getcwd() + '/' + 'text_folder'		#Notation of path is mac specification. For windows'/'To'\'Correct to.

os.makedirs(text_folder_path, exist_ok=True)
pdf_file_name = os.listdir(pdf_folder_path)

#name is a PDF file (ends.pdf) returns TRUE, otherwise FALSE is returned.
#This post was quoted and partially changed → http://qiita.com/korkewriya/items/72de38fc506ab37b4f2d
def pdf_checker(name):
	pdf_regex = re.compile(r'.+\.pdf')
	if pdf_regex.search(str(name)):
		return True
	else:
		return False

#Convert PDF to text file
def convert_pdf_to_txt(path, txtname, buf=True):
    rsrcmgr = PDFResourceManager()
    if buf:
        outfp = StringIO()
    else:
        outfp = file(txtname, 'w')
    codec = 'utf-8'
    laparams = LAParams()
    laparams.detect_vertical = True
    device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)

    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
    fp.close()
    device.close()
    if buf:
        text = outfp.getvalue()
        make_new_text_file = open(text_folder_path + '/' + path + '.txt', 'w')
        make_new_text_file.write(text)
        make_new_text_file.close()
    outfp.close()

#Get the pdf file name in the folder and list it
for name in pdf_file_name:
	if pdf_checker(name):
		convert_pdf_to_txt(name, name + '.txt')		# pdf_Use checker and TRUE (end is.For pdf) proceed to conversion)
	else:
		pass									    #Pass if not a PDF file

Finished product

A large number of text files that are likely to cause stomach upset

Next cooking

Move a large number of text files to a bowl and extract about 500 frequently-used words. Remember the meaning of the word (I don't know if it's effective for reading English quickly)

References, reference URL

http://qiita.com/korkewriya/items/72de38fc506ab37b4f2d → The part that converts a pdf file to a text file is quoted (partially modified) from this article.

https://kusanohitoshi.blogspot.jp/2017/01/python3cstringiostringio.html → Refer to here for how to deal with StringIO import error.

"Let Python do the boring things" → How to use the os module

http://www.unixuser.org/%7Eeuske/python/pdfminer/index.html → pdfminer page

https://github.com/conda-forge/pdfminer-feedstock https://conda-forge.org/feedstocks → Refer to here for the installation procedure of pdfminer in the anaconda environment.

Convert a large number of PDF files to text files using pdfminer