Cities and wards prefer PDF to provide data. We will try to convert the data in such awkward format using the command to convert it to text format and plot it using the data of Gotemba City. (It corresponds to the data as of June 08, 2017 of the administration.)
The installation method is as follows.
# pip install bs4
$ R
> install.packages("zoo")
# pacman -S poppler parallel wget
get_pdf_links.py
import urllib.request
from bs4 import BeautifulSoup
import re
url = "http://www.city.gotemba.shizuoka.jp/gyousei/g-6/g-6-1/2475.html"
req = urllib.request.Request(url, headers={'User-Agent': "Magic Browser"})
con = urllib.request.urlopen(req)
soup = BeautifulSoup(con.read(), 'html.parser')
result = soup.find_all("li")
li = []
for link in result:
if re.match(r'.*PDF.*', link.get_text()) is not None:
li.append(link.find("a")['href'])
for link in li:
print(link)
pdf/print_data.py
import re, os
txt_files = []
for filename in os.listdir('.'):
if filename.endswith('txt'):
txt_files.append(filename)
txt_files.remove(".txt")
data = []
for filename in txt_files:
fp = open(filename)
year = None
month = None
population = None
for i,line in enumerate(fp):
if i == 0:
year = re.sub(r'Heisei([0-9]+)Year.*$', r'\1', line)
year = year.replace("\n","")
month = re.sub(r'Heisei[0-9]+Year([0-9]+)Month.*$', r'\1', line)
month = month.replace("\n","")
elif i == 554:
population = line.replace(",","")
population = population.replace("\n","")
data.append([int(year), int(month), int(population)])
fp.close()
data_fmt = []
for val in data:
data_fmt.append([val[0]+1988, val[1], val[2]])
data_fmt.sort()
data_fmt2 = []
for val in data_fmt:
data_fmt2.append([str(val[0])+"-"+str(val[1]), val[2]])
print("date, population")
for val in data_fmt2:
print(val[0]+","+str(val[1]))
pdf/plot_data.R
library(zoo)
data <- read.csv("data.csv", header=T)
z <- read.zoo(data, FUN = as.yearmon)
plot(z)
process.sh
#/bin/bash
python get_pdf_links.py | parallel --gnu "wget {}"
mv *.pdf pdf
cd pdf
for file in *.pdf; do pdftotext "$file" "$file.txt"; done
rm dd92f76ed99f94259ade29d559663bc1.pdf.txt
rm 7a76d9a16bcc1ce29875b76a6ef12a2e.pdf.txt
python print_data.py > data.csv
Rscript plot_data.R
The output data is Rplots.pdf in pdf
PDF files are good for printing and making them easier to read, but they can be tedious to parse as plain text. Depending on the PDF, the captured image may be embedded instead of the text, so it may not open at all. Therefore, files that cannot be opened by process.sh are deleted by rm. There is no workaround for these files.
If the government wants to visualize the data, the file format should be not only PDF but also plain text format such as csv. You cannot get new insights just by looking at the aggregated graphs. By reading the raw data numerically, a wide range of analysis is possible.
Recommended Posts