Open data is It is a political activity that seeks to freely process government data, freely redistribute it, and make it freely available for commercial use. Currently, it is attracting attention from the perspective of political transparency and economic revitalization. The Japanese government is actually starting to release data.
-> Reference site: Open DATA METI | Ministry of Economy, Trade and Industry's open data catalog site
However, as a problem of open data in Japan, ☆ There are many cases where open data of 1 comes out. Open data is rated 5 stars due to its openness.
☆ 1 open data, that is, PDF It is considered the most closed because it is not structured data.
However, it is difficult to explain the importance of machine readability to civil servants who are not tech-savvy. Even if you understand it, it is delicate whether you can allocate a budget for machine readability. As a matter of fact, we need to confront PDF.
PDFMiner is an interesting tool for that.
PDFMiner is a Python library for mainly acquiring and analyzing text information from PDF. Looking at Google Trends, it seems that it has been attracting attention since around 2011.
There is already an app that converts PDF to TXT / HTML, PDFMiner is a mechanism to manage the components of PDF pages in a tree structure. (ex.LTPage->LTTextBox->LTTextLine->LTChar,LTText) You can make finer adjustments.
You can make finer adjustments, so for example I think we can prepare a program to convert the standard PDF of each ministry and agency to TXT.
Follow the steps below to install PDFMiner.
python setup.py install
pdf2txt.py samples/simple1.pdf
#After executing the command, it is OK if Hello World is displayed continuously.
# ->We have succeeded in extracting the text from the sample PDF.
make cmap
python setup.py install
mkdir pdfminer\cmap
python tools\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt
python tools\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt
python tools\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt
python tools\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt
python setup.py install
PDFMiner seems to come with two command line tools. One is * pdf2txt.py *. Converts the specified PDF to TXT / HTML.
pdf2txt.py -o output.txt input.pdf
The other is * dumppdf.py *. This is a debugging tool that outputs the specified content of the specified PDF in pseudo XML format. It can also be used to extract only specific elements such as images.
dumppdf.py -a foo.pdf
For detailed specifications, see Original article.
API Read the sample code after reading the tree diagram of API Explanatory Material (English) to deepen your understanding. Also, there seems to be Page introduced as a more detailed example (English). .. In addition, it seems that there was already a Japanese blog article.
Let's use it now. Enjoy it! Enjoy mining life!
Recommended Posts