The values aggregated for each year are summarized for each 10 years.
Set the class with cut
of pandas
and aggregate with groupby
.
The data used is the CSV format of the Excel of "Population by Age" published by the Statistics Bureau of the Ministry of Internal Affairs and Communications.
For ease of use, delete the description line at the top of the data, the note at the bottom, and the "100+" and "Unknown" lines. The adjusted file is population-by-age.csv
.
First, load the numpy
and pandas
modules.
I also added a setting to draw a graph in IPython.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.mpl_style = 'default'
Read the CSV file. Specify the first column as the index. After reading, check the data type.
df = pd.read_csv('population-by-age.csv', index_col='age')
print df.dtypes
y1920 int64
y1930 int64
y1940 int64
y1950 int64
y1960 int64
y1970 int64
y1980 int64
y1990 int64
y2000 int64
y2010 int64
dtype: object
In addition, let's check the beginning, end, and statistics. The display is omitted.
print df.head(3)
print df.tail(3)
print df.describe()
Use cut
to set the class.
If you want to change the class width, adjust the third argument of range
. Include or do not include both ends of the class is specified as an option. Switch between the * include_lowest * and * right * options accordingly.
labels = [ "{0} - {1}".format(i, i + 9) for i in range(0, 100, 10) ]
c = pd.cut(df.index, np.arange(0, 101, 10),
include_lowest=True, right=False,
labels=labels)
print df.groupby(c).sum()
y1920 y1930 y1940 y1950 y1960 y1970 y1980
0 - 9 14314635 16778220 17961607 20728122 17049068 16965066 18547450
10 - 19 11520624 13340649 15816378 17267585 20326076 16921989 17231873
20 - 29 8533259 10367140 11756837 13910662 16527810 19749434 16882381
30 - 39 7020188 7798498 9370143 10250310 13555835 16578939 19973312
40 - 49 5902331 6332741 7041270 8487529 9835689 13217564 16427887
50 - 59 4074855 5046797 5446760 6137697 7842597 9230197 12813527
60 - 69 2968342 2977915 3782574 4074610 5092019 6709761 8429928
70 - 79 1378630 1478319 1541314 1967261 2518482 3401952 5059662
80 - 89 236419 315624 338472 354836 638738 879221 1503633
90 - 99 13657 13997 18567 16258 32043 65629 118391
y1990 y2000 y2010
0 - 9 13959454 11925887 10882409
10 - 19 18533872 14034777 11984392
20 - 29 16870834 18211769 13720134
30 - 39 16791465 16891475 18127846
40 - 49 19676302 16716227 16774981
50 - 59 15813274 19176162 16308233
60 - 69 11848590 14841772 18247422
70 - 79 6835747 10051176 12904315
80 - 89 2665908 4147012 6768852
90 - 99 286141 688769 1318463
So, I was able to aggregate the values aggregated for each year of age every 10 years.
Aggregate functions can be specified in addition to sum
, and multiple aggregate functions can be specified.
Let's check the following results.
print df.groupby(c).agg(['count', 'min', 'max', 'mean', 'std'])
Since it is difficult to understand the relationship with only the above numbers, make a graph to get an overview of the numbers. Try arranging them side by side to compare the * stacked * options when drawing.
fig, axes = plt.subplots(ncols=2)
df.groupby(c).sum().plot(kind='bar', ax=axes[0])
df.groupby(c).sum().T.plot(kind='bar', stacked=True, ax=axes[1])
Looking at the numbers by 10 years old, the population over 60 years old is increasing more recently. On the other hand, we can see that the youth population is declining. If you look at the stacked graphs, you can see that the population has been steadily increasing from 1920 to 2000, but has been declining through 2010. As for the generation distribution, the ratio of the upper part of the bar in the graph is increasing.
Now that we've aggregated the general trends, we'll draw each series in the original data frame. If you simply plot it, it will be messy, so let's draw it as a separate graph for each year. This time the ʻaxes` variable is two-dimensional, so be careful when specifying the array index.
fig, axes = plt.subplots(nrows=5, ncols=2)
for i, y in enumerate(['y1920', 'y1930', 'y1940', 'y1950', 'y1960']):
df[y].plot(ax=axes[i, 0])
axes[i, 0].set_title(y)
if y != 'y1960':
axes[i, 0].get_xaxis().set_visible(False)
for i, y in enumerate(['y1970', 'y1980', 'y1990', 'y2000', 'y2010']):
df[y].plot(ax=axes[i, 1])
axes[i, 1].set_title(y)
if y != 'y2010':
axes[i, 1].get_xaxis().set_visible(False)
If you look at the individual graphs, you can see the impact of the baby boom. You can also see that the number of births has decreased since the second baby boom, and that the base of the elderly has expanded (lifespan has been extended) since 1970.
Recommended Posts