Working with PDF files in Python is hard.
In some cases, the table is embedded in the PDF.
For example
A lot of table data is embedded in 2018 Zeneiren Stress Check Service Implementation Result Report. I will.
For example, suppose you want to extract [Table 14 Percentage of people with high stress by industry] on page 40 of the file.
Select this table and copy and paste it into Excel.
Copy and paste into Excel.
Oh? It doesn't work. All the data is listed in one cell.
In fact, you can use Python to convert the tables in this PDF to csv or Excel relatively easily.
The procedure to convert a table in PDF to csv or excel with Python is 2 steps.
Step 1. Extract the table from PDF as a pandas DataFrame Step 2. Write DataFrame as csv or excel
Let's look at them in order.
To extract a pdf table as a DataFrame, use a module called ** tabula **.
I think that there are many people who do not have tabula installed.
pip install tabula-py
Install with.
In addition, this tabula runs on Java, so you also need to install Java.
Now that tabula is ready, import pandas and tabula.
python
import pandas as pd
import tabula
To extract a table from PDF
tabula.read_pdf("xxx.pdf", lattice=True, pages='xxxx')
Use the function.
--In " xxx.pdf "
, write the path of the PDF file you want to read.
--lattice = True
is an option to determine cells by table borders.
If the table you want to extract is separated by a ruled line, specify lattice = True.
--pages
specifies the pages you want to load.
--If you want to read only the 40th page, specify as pages = '40'
.
--If you want to load pages 40-45, specify as pages = '40 -45'
.
--If you want to load all pages, you can use pages ='all'
.
--The function returns a `list of pandas.DataFrame``. If you have multiple tables, you can retrieve this list in order with a for statement.
Now, let's read the table on page 40 of the "2018 Zeneiren Stress Check Service Implementation Result Report".
python
# lattice=True to determine cells by table axis
dfs = tabula.read_pdf("2018 Zeneiren stress check service implementation result report.pdf", lattice=True, pages = '40')
for df in dfs:
display(df)
It's strange that \ r
is included because the column name was broken in the cell, but the contents of the table can be extracted well.
For the time being, let's correct the column names properly.
You can change the column name with df.rename (columns = {'original column name':'changed column name'})
.
When changing multiple column names, it is OK if you connect the 'original column name':' the changed column name'
by separating them with commas.
python
df = df.rename(columns={'High stress\Number of people': 'High stressス者数', 'High stressス\Percentage of r people': 'High stressス者の割合'})
Originally, pandas has a function to write DataFrame as csv or Excel.
Save as csv: df.to_csv ("filename.csv", index = None)
Save as Excel: df.to_excel ("filename.xlsx", index = None)
Finally, I will summarize this code.
pyhon
import pandas as pd
import tabula
# lattice=True to determine cells by table axis
dfs = tabula.read_pdf("2018 Zeneiren stress check service implementation result report.pdf", lattice=True, pages = '40')
#Check if you can get the PDF table properly
for df in dfs:
display(df)
# csv/Save as Excel(This time dfs[0]only)
df = dfs[0].rename(columns={'High stress\Number of people': 'High stressス者数', 'High stressス\Percentage of r people': 'High stressス者の割合'})
df.to_csv("PDF table.csv", index=None) # csv
df.to_excel("PDF table.xlsx", index=None) # Excel
Here are other automation series I wrote. If you are interested, please!
[Automation] Read a Word document with Python https://qiita.com/konitech913/items/c30236bdf47775535e2f
[Automation] Convert Python code into an exe file https://qiita.com/konitech913/items/6259f13e057bc25ebc23
[Automation] Send Outlook email with Python https://qiita.com/konitech913/items/51867dbe24a2a4272bb6
[Automation] Read Outlook emails with Python https://qiita.com/konitech913/items/8a285522b0c118d5f905
[Automation] Read mail (msg file) with Python https://qiita.com/konitech913/items/fa0cf66aad27d16258c0
[Automation] Operate the clipboard with Python and paste the table into Excel https://qiita.com/konitech913/items/83975332e395a387eace
Recommended Posts