"Creating a service that edits the usage history of Mobile Suica so that it can be easily used for expense settlement" Last time, I created an execution environment with Docker Click here for the finished product https://www.mobilesuica.work
tabula-py tabula-py First of all, I tried this easily
test.py
import tabula
import pandas as pd
df = tabula.read_pdf('a.pdf')
print(df.head())
Unfortunately it was an error
AttributeError: 'list' object has no attribute 'head'
Apparently it's returned in an array If so, should I do this?
test.py
df = tabula.read_pdf('a.pdf')
print(df[0].head())
It was done (the result station name has changed)
Date Type Use Station Type.1 Station used.1 Balance difference
0 2 6 Repeat NaN NaN NaN\9,753 NaN
2 2 7 Enter Tokyo Exit Yokohama\9,573 -180
3 2 7 Enter Yokohama Exit Tokyo\9,393 -180
By default, only the first page is used, so add pages ='all' and combine the two pages. The maximum usage history of Mobile Suica is 100, and if you include the deferral, there are 101, and if you convert it to PDF, it will be up to 2 pages.
test.py
df = tabula.read_pdf('a.pdf',pages='all')
for i in range(len(df)):
print(f"{i+1}On the page{len(df[i])}There is a line")
Execution result
(app-root) bash-4.2# python3 test.py
There are 51 lines on one page
There are 50 lines on 2 pages
You can see that the first page has one more line for "deferred". Since it is difficult to handle if they are separated, we will use one DataFrmae. Use concat.
test.py
df = tabula.read_pdf('a.pdf',pages='all')
d = pd.concat(df,ignore_index=True)
print(f"{len(d)}There is a line")
Execution result
(app-root) bash-4.2# python3 test.py
There are 101 lines
** ignore_index = True **, but if you do not include this, the index will not be a serial number properly. Let's compare.
ignore_index = True None Index returns to 0 at the boundary between the 1st and 2nd pages
50 3 4 Enter Tokyo Depart Yokohama\9,443 -180
0 3 5 Cash NaN NaN NaN\9,543 100
ignore_index = True Yes It is a serial number as it is
50 3 4 Enter Tokyo Depart Yokohama\9,443 -180
51 3 5 Cash NaN NaN NaN\9,543 100
If you go this far, you can do the same with multiple files. There is also convert_into_by_batch in tabula-py, but it's a bit difficult to use because it's like specifying a directory and collecting PDF files under it. (Since I had only one usage history, I read the same file many times)
test.py
fileList = ['a.pdf','a.pdf','a.pdf']
dfList = []
for fileName in fileList:
df = tabula.read_pdf(fileName,pages='all')
for i in range(len(df)):
dfList.append(df[i])
d = pd.concat(dfList,ignore_index=True)
print(f"{len(d)}There is a line")
Execution result
(app-root) bash-4.2# python3 test.py
There are 303 lines
So far, I noticed that there is a slight difference between the first page and the second page.
Charge part on the first page
24 2 21 Cash NaN NaN NaN\9,883 +100
Charge part on the second page
78 3 21 Cash NaN NaN NaN\10,317 100
There is no "+" on the second page. I can do it without any problem in terms of data processing, but it feels bad. When I looked it up, it was the difference between whether it was recognized as a number (second page) or as a character (first page).
14 2 15 Bus etc. OKK NaN NaN\9,103 -2,000
If it exceeds 3 digits, Thousand Separator (,) will be included, so it seems that all the pages are recognized as characters. Page 2 was recognized as a number because it wasn't there (all less than 3 digits). It is more convenient to take it as it is as a character, so when I looked it up with pandas, I found that ** dtype ='object' ** should be used. tabula-py is a great way to pass pandas options as is. The final result is as follows.
test.py
fileList = ['a.pdf','a.pdf','a.pdf']
dfList = []
for fileName in fileList:
df = tabula.read_pdf(fileName,pages='all',pandas_options={'dtype':'object'})
for i in range(len(df)):
dfList.append(df[i])
d = pd.concat(dfList,ignore_index=True)
It's amazing that you can drop a PDF table into a DataFrame with just this.
Recommended Posts