I want to know the population of each country in the world.

Target

This is the story of what to do when you want a table with the year, region, sub-region, country, country code, and population.

LocID,Location,Time,PopTotal,SubRegName,GeoRegName
392,Japan,2019,126860299,Eastern Asia,Asia

Conclusion

You can download the required data from https://population.un.org/wpp/.

Download population data from All variants (CSV, 21.35 MB) WPP2019_TotalPopulationBySex.csv can.
You can download the country code data from Locations (XLSX, 131 KB) WPP2019_F01_LOCATIONS.XLSX.

The contents of WPP2019_TotalPopulationBySex.csv look like this.

LocID,Location,VarID,Variant,Time,MidPeriod,PopMale,PopFemale,PopTotal,PopDensity
...
392,Japan,2,Medium,1950,1950.5,40602.499,42199.585,82802.084,227.132
392,Japan,2,Medium,1951,1951.5,41380.556,42935.709,84316.265,231.285

LocID: Country code with ISO 3166-1 numbers
Location: The name of the country or region
VarID, Variant: Predicts future values with different fertility, mortality and immigration rates. The standard is Medium.
Time, MidPeriod: Year
PopMale, PopFemale, PopTotal: Gender and total population (1000)
PopDensity: Population density per 1km ^ 2 (1000 people)

This LocID and Location items include region names such as Asia and East Asia in addition to country names, so they cannot be used as they are. Refer to WPP2019_F01_LOCATIONS.XLSX to align only to countries except region names. In this file

Location
DB
NOTES

There are three sheets, of which the DB is just machine readable. For example, the item of Japan is

Index, Location, Notes, LocID, ISO3_Code, LocType, LocTypeName, ParentID, WorldID, SubRegID, SubRegName, SDGSubRegID, SDGSubRegName, SDGRegID, SDGRegName, GeoRegID, GeoRegName
133, Japan, 392, JPN, 4, Country/Area, 906, 900, 906, Eastern Asia, 1832, Eastern and South-Eastern Asia, 935, Asia

is. So you can see the following.

The item LocType = 4 or LocTypeName = Country / Area represents a country.
SubRegName represents the subregion of the country. (Example: Eastern Asia)
GeoRegName represents the region of the country. (Example: Asia)

Create data with Jupyter

(Notebook link: https://colab.research.google.com/drive/160xZ5tAGKb1enC0LU2JYEOA6m3l3w1cn?usp=sharing)

After investigating up to this point, the work is finally started. First, load WPP2019_TotalPopulationBySex.csv.

import pandas as pd

population_src = pd.read_csv("WPP2019_TotalPopulationBySex.csv")
population_src.head()

	LocID	Location	VarID	Variant	Time	MidPeriod	PopMale	PopFemale	PopTotal	PopDensity
0	4	Afghanistan	2	Medium	1950	1950.5	4099.243	3652.874	7752.117	11.874
1	4	Afghanistan	2	Medium	1951	1951.5	4134.756	3705.395	7840.151	12.009
2	4	Afghanistan	2	Medium	1952	1952.5	4174.450	3761.546	7935.996	12.156
3	4	Afghanistan	2	Medium	1953	1953.5	4218.336	3821.348	8039.684	12.315
4	4	Afghanistan	2	Medium	1954	1954.5	4266.484	3884.832	8151.316	12.486

Extract only the necessary information.

population = population_src[population_src.Variant == "Medium"][["LocID", "Location", "Time", "PopTotal"]]
population["PopTotal"] = (population["PopTotal"] * 1000).astype(int)
population

	LocID	Location	Time	PopTotal
0	4	Afghanistan	1950	7752116
1	4	Afghanistan	1951	7840151
2	4	Afghanistan	1952	7935996
3	4	Afghanistan	1953	8039684
4	4	Afghanistan	1954	8151316

Loading WPP2019_F01_LOCATIONS.XLSX.

locations_src = pd.read_excel('WPP2019_F01_LOCATIONS.XLSX', sheet_name="DB")
locations_src.head()

	Index	Location	Notes	LocID	ISO3_Code	LocType	LocTypeName	ParentID	WorldID	SubRegID	SubRegName	SDGSubRegID	SDGSubRegName	SDGRegID	SDGRegName	GeoRegID	GeoRegName	MoreDev	LessDev	LeastDev	oLessDev	LessDev_ExcludingChina	LLDC	SIDS	WB_HIC	WB_MIC	WB_UMIC	WB_LMIC	WB_LIC	WB_NoIncomeGroup	MaxHIV_Male	MaxHIV_Female	MaxHIV_BothSexes	YearMaxHIV_BothSexes	HIVAIDSMortalityImpact_AgePattern	HIVAIDSMortalityImpact_e0	TotPop2019LessThan90k
0	1	WORLD	NaN	900	NaN	NaN	NaN	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2	UN development groups	a	1803	NaN	25.0	Label/Separator	900	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	3	More developed regions	b	901	NaN	5.0	Development group	1803	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	4	Less developed regions	c	902	NaN	5.0	Development group	1803	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	5	Least developed countries	d	941	NaN	5.0	Development group	902	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Extract only the necessary information.

location = locations_src[locations_src.LocType == 4][["LocID", "SubRegName", "GeoRegName"]]
location.head()

	LocID	SubRegName	GeoRegName
26	108	Eastern Africa	Africa
27	174	Eastern Africa	Africa
28	262	Eastern Africa	Africa
29	232	Eastern Africa	Africa
30	231	Eastern Africa	Africa

Combine population and country data.

population_by_countries = population.merge(location)
population_by_countries.head()

	LocID	Location	Time	PopTotal	SubRegName	GeoRegName
0	4	Afghanistan	1950	7752116	Southern Asia	Asia
1	4	Afghanistan	1951	7840151	Southern Asia	Asia
2	4	Afghanistan	1952	7935996	Southern Asia	Asia
3	4	Afghanistan	1953	8039684	Southern Asia	Asia
4	4	Afghanistan	1954	8151316	Southern Asia	Asia

Let's look for Japanese data.

population_by_countries[(population_by_countries.Location == "Japan") & (population_by_countries.Time == 2019)]

	LocID	Location	Time	PopTotal	SubRegName	GeoRegName
16377	392	Japan	2019	126860299	Eastern Asia	Asia

It looks good so I will save it.

population_by_countries.to_csv("population_by_countries.csv", index=False)
!head population_by_countries.csv

LocID,Location,Time,PopTotal,SubRegName,GeoRegName
4,Afghanistan,1950,7752116,Southern Asia,Asia
4,Afghanistan,1951,7840151,Southern Asia,Asia
4,Afghanistan,1952,7935996,Southern Asia,Asia
4,Afghanistan,1953,8039684,Southern Asia,Asia
4,Afghanistan,1954,8151316,Southern Asia,Asia
4,Afghanistan,1955,8270992,Southern Asia,Asia
4,Afghanistan,1956,8398873,Southern Asia,Asia
4,Afghanistan,1957,8535157,Southern Asia,Asia
4,Afghanistan,1958,8680097,Southern Asia,Asia

Bonus hardship story

First, when I looked up the population-ordered list of countries on the Japanese version of Wikipedia and WolframAlpha, a site called http://data.un.org/ appeared, so I decided to pull it from here.
The CSV in the Population, surface area and density links doesn't feel machine readable.
Apparently this is an excerpt from a book called Statistical Yearbook.
According to Annex I of the Yearbook, the country code and country name of this data is called United Nations Standard Country Codes M 49. A description of the M49 can be found at https://unstats.un.org/unsd/methodology/m49/. According to this, it has the following hierarchical structure.
- Global Name: World (1)
- Region Name: Asia (142)
- Sub-region Name: Eastern Asia (30)
Intermediate Region Name: None
- Country or Area: Japan (392)
M49 seems to be the standard that was the basis of ISO 3166-1.
The M49 machine-readable file cannot be found. There is a CSV link, but the country name has a comma and the column is out of alignment.
I felt uncomfortable.
Looking for another example, datahub.io used data from data.worldbank.org.
Example code to actually download: * https://github.com/datasets/population/blob/master/scripts/process.py
Query: http://api.worldbank.org/countries/all/indicators/SP.POP.TOTL?format=csv
- World Bank: http://data.worldbank.org/indicator/SP.POP.TOTL
Searching for the word world population will bring up https://www.worldometers.info/world-population/.
I used https://population.un.org/wpp/ as the material!
This seems to be safer as data to download and use.

So, I'll record here that I've spent a lot of time just to get the mundane data of the world population.