Read table data in PDF file with Python

PDF data

People in the world seem to love PDF, and even if they say they hate it, they have to deal with it. However, it is normal for people to think that it is a little time to spend hours on it. There are cases where there is only PDF table data, but there was a super convenient library called tabula-py that was useful in such cases. Make a note.

https://github.com/chezou/tabula-py

About tabula

tabula is a Java library for extracting PDF tables. tabula-py is the trumpet. Therefore, you need to install Java to use it.

After installing Java, you can use the Python library by doing the following.

$ pip install tabula-py

How to Use

It's easy to use, and you can use the read_pdf function to read the table in the PDF file. The number of people positive for the new coronavirus of the Ministry of Health, Labor and Welfare (excluding those returning from charter flights) and the number of people conducting PCR tests (https://www.mhlw.go.jp/content/10906000/000618483.pdf) are used as examples. ..


from tabula import read_pdf

df = read_pdf("https://www.mhlw.go.jp/content/10906000/000618483.pdf")

The result of reading the table is displayed as below.

It looks like the above because there are multiple tables. Specify the table to retrieve next.

As you can see above, the table is in the form of a pandas data table. It's super convenient. In this PDF file, the data is divided into two columns, so you need to rub the table. In this case as well, since it is a data table, you can use the pandas concat function.

Since it is a data frame, it is easy to visualize.

With that feeling, you can easily get PDF table data by using tabula-py!

Recommended Posts

Read table data in PDF file with Python

[Automation] Extract the table in PDF with Python

Read json data with python

Read Protocol Buffers data in Python3

Read files in parallel with Python

[python] Read data

Get additional data in LDAP with python

Exclusive control with lock file in Python

Read CSV file with python (Download & parse CSV file)

Try working with binary data in Python

Let's read the RINEX file with Python ①

Read the file line by line in Python

Read the file line by line in Python

Read a character data file with numpy

[Python] Read the specified line in the file

Read text in images with python OCR

[Automation] Read mail (msg file) with Python

Read a file in Python with a relative path from the program

[Python] Read a csv file with a large data size using a generator

Data analysis with python 2

File operations in Python

How to read a CSV file with Python 2/3

Read DXF in python

File processing in Python

Read data with python / netCDF> nc.variables [] / Check data size

Read a file containing garbled lines in Python

Read Python csv data with Pandas ⇒ Graph with Matplotlib

Rasterize PDF in Python

[Python] How to read excel file with pandas

Read a Python # .txt file for a super beginner in Python with a working .py

File operations in Python

Read line by line from a file with Python

Read Python csv file

Python / numpy> Read the data file with the item name line> Use genfromtxt ()

Data analysis with Python

Collectively register data in Firestore using csv file in Python

Convert the image in .zip to PDF with Python

Read QR code from image file with Python (Mac)

Read json file with Python, format it, and output json

Run a Python file with relative import in PyCharm

Sample data created with python

Handle Ambient data in Python

Read csv with python pandas

Scraping with selenium in Python

Working with LibreOffice in Python

Download the file in Python

Scraping with chromedriver in python

Display UTM-30LX data in Python

Debugging with pdb in Python

Draw netCDF file with python

Get Youtube data with python

OCR from PDF in Python

Read Euler's formula in Python

Working with sounds in Python

Scraping with Selenium in Python

Scraping with Tor in Python

Read Namespace-specified XML in Python

Tweet with image in Python

Read Outlook emails in Python

Combined with permutations in Python

Integrate PDF files with Python