People in the world seem to love PDF, and even if they say they hate it, they have to deal with it. However, it is normal for people to think that it is a little time to spend hours on it. There are cases where there is only PDF table data, but there was a super convenient library called tabula-py that was useful in such cases. Make a note.
https://github.com/chezou/tabula-py
tabula is a Java library for extracting PDF tables. tabula-py is the trumpet. Therefore, you need to install Java to use it.
After installing Java, you can use the Python library by doing the following.
$ pip install tabula-py
It's easy to use, and you can use the read_pdf function to read the table in the PDF file. The number of people positive for the new coronavirus of the Ministry of Health, Labor and Welfare (excluding those returning from charter flights) and the number of people conducting PCR tests (https://www.mhlw.go.jp/content/10906000/000618483.pdf) are used as examples. ..
from tabula import read_pdf
df = read_pdf("https://www.mhlw.go.jp/content/10906000/000618483.pdf")
The result of reading the table is displayed as below.
It looks like the above because there are multiple tables. Specify the table to retrieve next.
As you can see above, the table is in the form of a pandas data table. It's super convenient. In this PDF file, the data is divided into two columns, so you need to rub the table. In this case as well, since it is a data table, you can use the pandas concat function.
Since it is a data frame, it is easy to visualize.
With that feeling, you can easily get PDF table data by using tabula-py!
Recommended Posts