Let's think about judgment of whether it is PDF and exception handling. Until you create your own exception handling

"Creating a service that edits the usage history of Mobile Suica so that it can be easily used for expense settlement" Make Mobile Suica usage history PDF into pandas DataFrame format with tabula-py Click here for the finished product https://www.mobilesuica.work

Error handling

Error handling is a high threshold for amateurs like me. It is difficult because it is made in anticipation of exceptions that may occur and erroneous user operations (although it will be added in the test results, of course).

The following two error handling are considered this time.

--If the PDF passed to tabula-py is incorrect --The uploaded file is not a PDF, it is not a PDF of Mobile Suica, etc. --Something went wrong with the tabula-py process itself

As a result of looking at the links below and other pages, I felt that it would be correct to catch the exception with try-except and raise it with raise. Most of the things you pick up on GitHub are like that. * How to implement error notification in a Python program

Judgment whether it is PDF

I found a handy one called ** PyPDF2 **.

test.py


import PyPDF2
with open('a.pdf','rb') as f:
    pageNum = PyPDF2.PdfFileReader(f).getNumPages()
    print(f"The number of pages{pageNum}is")

Execution result

(app-root) bash-4.2# python3 test.py
The number of pages is 2

If you pass a non-PDF file, it will look like this

test.py


fileList = ['a.pdf','test.py']
for fileName in fileList:
    with open(fileName,'rb') as f:
        pageNum = PyPDF2.PdfFileReader(f).getNumPages()
        print(f"{fileName}The number of pages is{pageNum}is")

Execution result

(app-root) bash-4.2# python3 test.py
a.pdf has 2 pages
Traceback (most recent call last):
  File "test.py", line 9, in <module>
    pageNum = PyPDF2.PdfFileReader(f).getNumPages()
  File "/opt/app-root/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1084, in __init__
    self.read(stream)
  File "/opt/app-root/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1696, in read
    raise utils.PdfReadError("EOF marker not found")
PyPDF2.utils.PdfReadError: EOF marker not found

** PyPDF2.utils.PdfReadError: EOF marker not found ** is like the exception PyPDF2 throws when it's not a PDF file. The message seems to change depending on the file passed, and there was also ** Could not read malformed PDF file **. In any case, if you don't handle this exception well, the program will stop here.

Exception handling

For the time being, it would be nice to be able to pick up ** PyPDF2.utils.PdfReadError **.

test.py


try:
    fileList = ['a.pdf','test.py']
    for fileName in fileList:
        with open(fileName,'rb') as f:
            pageNum = PyPDF2.PdfFileReader(f).getNumPages()
            print(f"{fileName}The number of pages is{pageNum}is")
except PyPDF2.utils.PdfReadError as e:
    print(f"ERROR: {e}PDF is incorrect")

Execution result

(app-root) bash-4.2# python3 test.py
a.pdf has 2 pages
ERROR:EOF marker not found PDF is incorrect

Now, even if an exception occurs, processing can be continued. Ultimately, you can notify the user who uploaded the non-PDF file.

However, there is one inconvenience as it is. Since there is no traceback, you will not know later where the program stopped in the source code. The solution is simple, just use a module called ** traceback **.

test.py


import traceback
#...abridgement
except PyPDF2.utils.PdfReadError as e:
    print(f"ERROR: {e}PDF is incorrect")
    traceback.print_exc() 

Execution result

(app-root) bash-4.2# python3 test.py
a.pdf has 2 pages
ERROR:EOF marker not found PDF is incorrect
Traceback (most recent call last):
  File "test.py", line 10, in <module>
    pageNum = PyPDF2.PdfFileReader(f).getNumPages()
  File "/opt/app-root/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1084, in __init__
    self.read(stream)
  File "/opt/app-root/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1696, in read
    raise utils.PdfReadError("EOF marker not found")
PyPDF2.utils.PdfReadError: EOF marker not found

You can see that the traceback is out unlike the previous one.

Looking at other people's source code, it says ** except Exception as e: **, so I'm picking up all the exceptions. It seems that this is a base class. After reading various things, if you want to change the processing depending on the exception you picked up, specify which exception like ** PyPDF2.utils.PdfReadError **, otherwise if the processing is the same ** except Exception as e: * * But it looks good. For example, is it like this?

test.py


try:
    fileList = ['test.py']
    for fileName in fileList:
        with open(fileName,'rb') as f:
            pageNum = PyPDF2.PdfFileReader(f).getNumPages()
            print(f"{fileName}The number of pages is{pageNum}is")
            df = tabula.read_pdf(fileName,pages='all',pandas_options={'dtype':'object'})
except PyPDF2.utils.PdfReadError as e:
    print(f"ERROR: {e}PDF is incorrect")
except Exception as e:
    print(f"ERROR: {e}Something is wrong")

By the way, if you write the base class first, it will enter the processing route there, so try to write the base class last.

Error handling with tabula-py

Even if you give a PDF that is not in tabular format, no exception will occur, so it seems that you should just confirm that it is not a PDF of Mobile Suica. First, read the header of DataFrame. I prepared a blank file (blank.pdf) as the wrong PDF.

test.py


fileList = ['a.pdf','blank.pdf']
for fileName in fileList:
    df = tabula.read_pdf(fileName,pages='all',pandas_options={'dtype':'object'})
    h = df[0].columns.values.tolist()
    print(f"The header is{h}is")

Execution result

(app-root) bash-4.2# python3 test.py
The header is['Month', 'Day', 'Type', 'Station', 'Type.1', 'Station.1', 'Balance', 'difference']is
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    h = df[0].columns.values.tolist()
IndexError: list index out of range

Apparently it gives an Index Error, so this is an exception handling. As a further error handling, it is possible that the PDF in a tabular format different from the PDF of Mobile Suica was uploaded although the header of the DataFrame was removed. If you check the contents of the header, it will pop, but it seems to create a different bug, so stop error handling up to that point. The program that finally receives the list of files uploaded by the user as a fileList and processes it is like this.

test.py


import tabula
import PyPDF2
import traceback
import pandas as pd

try:
    fileList = ['a.pdf','blank.pdf']
    dfList = []
    for fileName in fileList:
        with open(fileName,'rb') as f:
            pageNum = PyPDF2.PdfFileReader(f).getNumPages()
            df = tabula.read_pdf(fileName,pages='all',pandas_options={'dtype':'object'})
            if df[0].columns.values.tolist():
                for i in range(len(df)):
                    dfList.append(df[i])
                print(f"{fileName}Could be processed correctly")
    d = pd.concat(dfList,ignore_index=True)
    
except PyPDF2.utils.PdfReadError as e:
    print(f"ERROR: {e} {fileName}Does not seem to be in PDF")
    traceback.print_exc() 
except IndexError as e:
    print(f"ERROR: {e} {fileName}Is a PDF, but it seems that it is not a PDF of Mobile Suica")
    traceback.print_exc() 
except Exception as e:
    print(f"ERROR: {e} {fileName}Something is wrong")
    traceback.print_exc() 

Execution result

(app-root) bash-4.2# python3 test.py
a.pdf was processed correctly
ERROR: list index out of range blank.pdf is a PDF, but it seems that it is not a PDF of Mobile Suica
Traceback (most recent call last):
  File "test.py", line 13, in <module>
    if df[0].columns.values.tolist():
IndexError: list index out of range

Self-made exception handling

Since the processing up to this point will be called as a function, the error message will only appear in the server log and will not be visible to the user. This error message must be returned to the caller to be visible to the user. If you resend it as it is with ** raise **, only the original error message will be sent. When it is made into a function, it looks like this

test.py


def test():
    try:
        fileList = ['copy.sh']
        dfList = []
        for fileName in fileList:
            with open(fileName,'rb') as f:
                pageNum = PyPDF2.PdfFileReader(f).getNumPages()
                df = tabula.read_pdf(fileName,pages='all',pandas_options={'dtype':'object'})
                if df[0].columns.values.tolist():
                    for i in range(len(df)):
                        dfList.append(df[i])
                    print(f"{fileName}Could be processed correctly")
        d = pd.concat(dfList,ignore_index=True)
        print(d)
        
    except PyPDF2.utils.PdfReadError as e:
        print(f"ERROR: {e} {fileName}Does not seem to be in PDF test()in")
        raise
    except IndexError as e:
        print(f"ERROR: {e} {fileName}Is a PDF, but it seems that it is not a PDF of Mobile Suica")
    except Exception as e:
        print(f"ERROR: {e} {fileName}Something is wrong")

try:
    test()
except Exception as e:
    print(f"{e}Caller")

Execution result

(app-root) bash-4.2# python3 test.py
ERROR: Could not read malformed PDF file copy.sh doesn't seem to be in PDF test()in
Could not read malformed PDF file Caller

I don't know which file was bad. The only way to solve this was to make an exception, which was a bit dented, but I was relieved that it turned out to be fairly easy. There is an easy-to-understand explanation on this site. Just add two lines! !! * Code example (3 types) to create and use exceptions in Python

So here is the one with your own exception handling

test.py


import tabula
import PyPDF2
import traceback
import pandas as pd

class ConvertError(Exception):
    pass

def test():
    try:
        fileList = ['copy.sh']
        dfList = []
        for fileName in fileList:
            with open(fileName,'rb') as f:
                pageNum = PyPDF2.PdfFileReader(f).getNumPages()
                df = tabula.read_pdf(fileName,pages='all',pandas_options={'dtype':'object'})
                if df[0].columns.values.tolist():
                    for i in range(len(df)):
                        dfList.append(df[i])
                    print(f"{fileName}Could be processed correctly")
        d = pd.concat(dfList,ignore_index=True)
        print(d)
        
    except PyPDF2.utils.PdfReadError as e:
        traceback.print_exc() 
        errorText =  f"ERROR: {e} {fileName}Does not seem to be in PDF=> test()in"
        print(errorText)
        raise ConvertError(errorText)
    except IndexError as e:
        traceback.print_exc()
        errorText = f"ERROR: {e} {fileName}Is a PDF, but it seems that it is not a PDF of Mobile Suica"
        print(errorText)
        raise ConvertError(errorText)
    except Exception as e:
        traceback.print_exc()  
        errorText = f"ERROR: {e} {fileName}Something is wrong"
        print(errorText)
        raise ConvertError(errorText)

try:
    test()
except Exception as e:
    print(f"{e}Caller")

Execution result

(app-root) bash-4.2# python3 test.py
Traceback (most recent call last):
  File "test.py", line 15, in test
    pageNum = PyPDF2.PdfFileReader(f).getNumPages()
  File "/opt/app-root/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1084, in __init__
    self.read(stream)
  File "/opt/app-root/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1697, in read
    line = self.readNextEndLine(stream)
  File "/opt/app-root/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1937, in readNextEndLine
    raise utils.PdfReadError("Could not read malformed PDF file")
PyPDF2.utils.PdfReadError: Could not read malformed PDF file
ERROR: Could not read malformed PDF file copy.sh doesn't seem to be a PDF=> test()in
ERROR: Could not read malformed PDF file copy.sh doesn't seem to be a PDF=> test()In the caller

Recommended Posts

Let's think about judgment of whether it is PDF and exception handling. Until you create your own exception handling
Create your own exception
Introduction to how to use Pytorch Lightning ~ Until you format your own model and output it to tensorboard ~
Until you self-host your own interpreter
Until you get a snapshot of Amazon Elasticsearch service and restore it