An email with a PDF attached is sent from a certain company.
It's also possible to use a mail service that can also search PDFs such as Gmail, but I didn't want to go out so much, and I wanted to handle it appropriately when a specific keyword was found, so I tried Python's pyPdf.
First. From email settings. Set the mail software you normally use to "Leave mail on the server" and fetch the mail remaining on the server with fetchmail and procmail. This time, I used the one in Cygwin environment and wrote the following configuration file.
.fetchmailrc
defaults
fetchall
keep
mda "/usr/bin/procmail"
poll pop.xxx.com
protocol pop3
port 110
username "XXXXX"
password "XXXXX"
.procmailrc
MAILDIR=$HOME/Mail/
DEFAULT=/dev/null
:0 H
* ^From:.*[email protected]
/var/spool/mail/t.uehara/
By writing this, the mail from test @ example will be stored in / var / spool / mail, and other mail will be / dev / null (that is, discarded). Since it is written as keep on the fetchmail side, it is left on the server. The storage location can be anywhere, but it is convenient to read it with Mutt etc., so I set it to / var / spool / mail.
Then download pyPdf and do python setup.py install.
http://pybrary.net/pyPdf/
PyPdf sample program and Sample writing attachments using email package Copy .html) appropriately and write a PDF file and a script to generate a file with text extracted from the PDF file in an appropriate folder.
pdfmail.py
import os
import sys
import email
import mailbox
import mimetypes
import pyPdf
def pdfmail(msgfile):
fp = open(msgfile)
msg = email.message_from_file(fp)
fp.close()
counter = 1
for part in msg.walk():
if part.get_content_maintype() == 'multipart':
continue
fname = part.get_filename()
if not fname:
ext = mimetypes.guess_extension(part.get_type())
if not ext:
ext = '.bin'
fname = 'part-%03d%s' % (counter, ext)
counter += 1
if fname.find('.pdf') != -1:
print fname
fp = open('pdf/'+fname, 'wb')
fp.write(part.get_payload(decode=True))
fp.close()
c = getPDFContent('pdf/'+fname).encode("ascii","xmlcharrefreplace")
fp = open('pdf/'+fname+".txt", 'wb')
fp.write(c)
fp.close()
def getPDFContent(path):
content = ""
pdf = pyPdf.PdfFileReader(file(path, "rb"))
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
if __name__ == '__main__':
maildir = '/var/spool/mail/t.uehara'
m = mailbox.Maildir(maildir)
for key in m.keys():
pdfmail(maildir+'/new/'+key)
Finally, you can create a batch file like this and run it regularly with a Windows task scheduler.
pdfmail.bat
C:\cygwin\bin\bash --login -i -c "fetchmail"
C:\cygwin\bin\python2.7.exe pdfmail.py
Recommended Posts