Extract and plot the latest population data from the PDF data provided by the city

Cities and wards prefer PDF to provide data. We will try to convert the data in such awkward format using the command to convert it to text format and plot it using the data of Gotemba City. (It corresponds to the data as of June 08, 2017 of the administration.)

Advance preparation

  1. BeautifulSoup in python
  2. R zoo
  3. Linux poppler, parallel, wget

The installation method is as follows.

# pip install bs4
$ R
> install.packages("zoo")
# pacman -S poppler parallel wget

Created script

get_pdf_links.py


import urllib.request
from bs4 import BeautifulSoup
import re

url = "http://www.city.gotemba.shizuoka.jp/gyousei/g-6/g-6-1/2475.html"
req = urllib.request.Request(url, headers={'User-Agent': "Magic Browser"})
con = urllib.request.urlopen(req)
soup = BeautifulSoup(con.read(), 'html.parser')
result = soup.find_all("li")
li = []
for link in result:
    if re.match(r'.*PDF.*', link.get_text()) is not None:
        li.append(link.find("a")['href'])

for link in li:
    print(link)

pdf/print_data.py


import re, os

txt_files = []
for filename in os.listdir('.'):
    if filename.endswith('txt'):
        txt_files.append(filename)

txt_files.remove(".txt")

data = []
for filename in txt_files:
    fp = open(filename)
    year = None
    month = None
    population = None
    for i,line in enumerate(fp):
        if i == 0:
            year = re.sub(r'Heisei([0-9]+)Year.*$', r'\1', line)
            year = year.replace("\n","")
            month = re.sub(r'Heisei[0-9]+Year([0-9]+)Month.*$', r'\1', line)
            month = month.replace("\n","")
        elif i == 554:
            population = line.replace(",","")
            population = population.replace("\n","")
    data.append([int(year), int(month), int(population)])
    fp.close()
data_fmt = []
for val in data:
    data_fmt.append([val[0]+1988, val[1], val[2]])

data_fmt.sort()
data_fmt2 = []
for val in data_fmt:
    data_fmt2.append([str(val[0])+"-"+str(val[1]), val[2]])

print("date, population")
for val in data_fmt2:
    print(val[0]+","+str(val[1]))

pdf/plot_data.R


library(zoo)
data <- read.csv("data.csv", header=T)
z <- read.zoo(data, FUN = as.yearmon)
plot(z)

Script for execution

process.sh


#/bin/bash

python get_pdf_links.py | parallel --gnu "wget {}"
mv *.pdf pdf
cd pdf
for file in *.pdf; do pdftotext "$file" "$file.txt"; done
rm dd92f76ed99f94259ade29d559663bc1.pdf.txt
rm 7a76d9a16bcc1ce29875b76a6ef12a2e.pdf.txt 
python print_data.py > data.csv
Rscript plot_data.R

The output data is Rplots.pdf in pdf

Output data

Screenshot from 2017-06-08 15-36-54.png

Caution

PDF files are good for printing and making them easier to read, but they can be tedious to parse as plain text. Depending on the PDF, the captured image may be embedded instead of the text, so it may not open at all. Therefore, files that cannot be opened by process.sh are deleted by rm. There is no workaround for these files.

Personal request

If the government wants to visualize the data, the file format should be not only PDF but also plain text format such as csv. You cannot get new insights just by looking at the aggregated graphs. By reading the raw data numerically, a wide range of analysis is possible.

Recommended Posts

Extract and plot the latest population data from the PDF data provided by the city
Image analysis was easy using the data and API provided by Microsoft COCO.
Extract images and tables from pdf with python to reduce the burden of reporting
Data Langling PDF on the outbreak of influenza by the Ministry of Health, Labor and Welfare
Extract data from S3
Follow Blender's data structure and extract vertex coordinates from fbx
Scraping desired data from website by linking Python and Excel
Create a summary table by product and time by processing the data extracted from a certain POS system
Gzip the data by streaming
Extract csv data and calculate
Beginning of Nico Nico Pedia analysis ~ JSON and touch the provided data ~
[python] Extract text from pdf and read characters aloud with Open-Jtalk
Prepare a high-speed analysis environment by hitting mysql from the data analysis environment