This is the story of what to do when you want a table with the year, region, sub-region, country, country code, and population.
LocID,Location,Time,PopTotal,SubRegName,GeoRegName
392,Japan,2019,126860299,Eastern Asia,Asia
You can download the required data from https://population.un.org/wpp/.
The contents of WPP2019_TotalPopulationBySex.csv look like this.
LocID,Location,VarID,Variant,Time,MidPeriod,PopMale,PopFemale,PopTotal,PopDensity
...
392,Japan,2,Medium,1950,1950.5,40602.499,42199.585,82802.084,227.132
392,Japan,2,Medium,1951,1951.5,41380.556,42935.709,84316.265,231.285
This LocID and Location items include region names such as Asia and East Asia in addition to country names, so they cannot be used as they are. Refer to WPP2019_F01_LOCATIONS.XLSX to align only to countries except region names. In this file
There are three sheets, of which the DB is just machine readable. For example, the item of Japan is
Index, Location, Notes, LocID, ISO3_Code, LocType, LocTypeName, ParentID, WorldID, SubRegID, SubRegName, SDGSubRegID, SDGSubRegName, SDGRegID, SDGRegName, GeoRegID, GeoRegName
133, Japan, 392, JPN, 4, Country/Area, 906, 900, 906, Eastern Asia, 1832, Eastern and South-Eastern Asia, 935, Asia
is. So you can see the following.
(Notebook link: https://colab.research.google.com/drive/160xZ5tAGKb1enC0LU2JYEOA6m3l3w1cn?usp=sharing)
After investigating up to this point, the work is finally started. First, load WPP2019_TotalPopulationBySex.csv.
import pandas as pd
population_src = pd.read_csv("WPP2019_TotalPopulationBySex.csv")
population_src.head()
LocID | Location | VarID | Variant | Time | MidPeriod | PopMale | PopFemale | PopTotal | PopDensity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | Afghanistan | 2 | Medium | 1950 | 1950.5 | 4099.243 | 3652.874 | 7752.117 | 11.874 |
1 | 4 | Afghanistan | 2 | Medium | 1951 | 1951.5 | 4134.756 | 3705.395 | 7840.151 | 12.009 |
2 | 4 | Afghanistan | 2 | Medium | 1952 | 1952.5 | 4174.450 | 3761.546 | 7935.996 | 12.156 |
3 | 4 | Afghanistan | 2 | Medium | 1953 | 1953.5 | 4218.336 | 3821.348 | 8039.684 | 12.315 |
4 | 4 | Afghanistan | 2 | Medium | 1954 | 1954.5 | 4266.484 | 3884.832 | 8151.316 | 12.486 |
Extract only the necessary information.
population = population_src[population_src.Variant == "Medium"][["LocID", "Location", "Time", "PopTotal"]]
population["PopTotal"] = (population["PopTotal"] * 1000).astype(int)
population
LocID | Location | Time | PopTotal | |
---|---|---|---|---|
0 | 4 | Afghanistan | 1950 | 7752116 |
1 | 4 | Afghanistan | 1951 | 7840151 |
2 | 4 | Afghanistan | 1952 | 7935996 |
3 | 4 | Afghanistan | 1953 | 8039684 |
4 | 4 | Afghanistan | 1954 | 8151316 |
Loading WPP2019_F01_LOCATIONS.XLSX.
locations_src = pd.read_excel('WPP2019_F01_LOCATIONS.XLSX', sheet_name="DB")
locations_src.head()
Index | Location | Notes | LocID | ISO3_Code | LocType | LocTypeName | ParentID | WorldID | SubRegID | SubRegName | SDGSubRegID | SDGSubRegName | SDGRegID | SDGRegName | GeoRegID | GeoRegName | MoreDev | LessDev | LeastDev | oLessDev | LessDev_ExcludingChina | LLDC | SIDS | WB_HIC | WB_MIC | WB_UMIC | WB_LMIC | WB_LIC | WB_NoIncomeGroup | MaxHIV_Male | MaxHIV_Female | MaxHIV_BothSexes | YearMaxHIV_BothSexes | HIVAIDSMortalityImpact_AgePattern | HIVAIDSMortalityImpact_e0 | TotPop2019LessThan90k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | WORLD | NaN | 900 | NaN | NaN | NaN | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2 | UN development groups | a | 1803 | NaN | 25.0 | Label/Separator | 900 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 3 | More developed regions | b | 901 | NaN | 5.0 | Development group | 1803 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 4 | Less developed regions | c | 902 | NaN | 5.0 | Development group | 1803 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 5 | Least developed countries | d | 941 | NaN | 5.0 | Development group | 902 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Extract only the necessary information.
location = locations_src[locations_src.LocType == 4][["LocID", "SubRegName", "GeoRegName"]]
location.head()
LocID | SubRegName | GeoRegName | |
---|---|---|---|
26 | 108 | Eastern Africa | Africa |
27 | 174 | Eastern Africa | Africa |
28 | 262 | Eastern Africa | Africa |
29 | 232 | Eastern Africa | Africa |
30 | 231 | Eastern Africa | Africa |
Combine population and country data.
population_by_countries = population.merge(location)
population_by_countries.head()
LocID | Location | Time | PopTotal | SubRegName | GeoRegName | |
---|---|---|---|---|---|---|
0 | 4 | Afghanistan | 1950 | 7752116 | Southern Asia | Asia |
1 | 4 | Afghanistan | 1951 | 7840151 | Southern Asia | Asia |
2 | 4 | Afghanistan | 1952 | 7935996 | Southern Asia | Asia |
3 | 4 | Afghanistan | 1953 | 8039684 | Southern Asia | Asia |
4 | 4 | Afghanistan | 1954 | 8151316 | Southern Asia | Asia |
Let's look for Japanese data.
population_by_countries[(population_by_countries.Location == "Japan") & (population_by_countries.Time == 2019)]
LocID | Location | Time | PopTotal | SubRegName | GeoRegName | |
---|---|---|---|---|---|---|
16377 | 392 | Japan | 2019 | 126860299 | Eastern Asia | Asia |
It looks good so I will save it.
population_by_countries.to_csv("population_by_countries.csv", index=False)
!head population_by_countries.csv
LocID,Location,Time,PopTotal,SubRegName,GeoRegName
4,Afghanistan,1950,7752116,Southern Asia,Asia
4,Afghanistan,1951,7840151,Southern Asia,Asia
4,Afghanistan,1952,7935996,Southern Asia,Asia
4,Afghanistan,1953,8039684,Southern Asia,Asia
4,Afghanistan,1954,8151316,Southern Asia,Asia
4,Afghanistan,1955,8270992,Southern Asia,Asia
4,Afghanistan,1956,8398873,Southern Asia,Asia
4,Afghanistan,1957,8535157,Southern Asia,Asia
4,Afghanistan,1958,8680097,Southern Asia,Asia
So, I'll record here that I've spent a lot of time just to get the mundane data of the world population.