Purpose

I drew a Python graph using the data of positive patients of the new coronavirus (COVID-19) released by the Tokyo Metropolitan Government.

I wrote it with the minimum necessary code, so I hope it will be helpful for those who are thinking of performing data analysis using Python from now on.

Since the public data in csv format, which is updated daily by the Tokyo Metropolitan Government, is directly read, there is no need to download the csv file one by one.

If you copy the following Python code to your own execution environment (Jupyter Notebook etc.), you can draw the latest information graph every time.

Also, I added a link to the Japanese national version of csv data later in this article, so I think it will be easier for you to acquire skills if you practice using it.

Python runtime environment

The Python code in this article has been tested using Jupyter Lab on a Windows 10 machine with Anaconda installed.

Data source

csv data

The data graphed this time is the following csv data. The results up to the previous day are updated daily. Tokyo Metropolitan Government_New Coronavirus Positive Patient Announcement Details (CSV Format)

home page

The following is the homepage with links to csv data. Details of Tokyo Metropolitan Government_New Coronavirus Positive Patient Announcement

Graphing with Python

Now let's draw a graph in Python using csv data.

First, read the data

First, use the Python code described below to connect to the Tokyo homepage, acquire the latest data (csv format), and convert it to pandas DataFlame.

The point here is that the csv file is not saved in the local folder, but directly converted to pandas DataFlame (df). This saves you the trouble of opening a browser and downloading the latest version of the csv file, which is updated daily, just by running the code below.

import requests
import pandas as pd
import io

#Import csv directly into pandas dataframe
url = 'https://stopcovid19.metro.tokyo.lg.jp/data/130001_tokyo_covid19_patients.csv'
r = requests.get(url).content
df = pd.read_csv(io.StringIO(r.decode('utf-8')))
df

When the data is loaded successfully, the contents of DataFrame (df) should be displayed.

Transition graph of newly infected people

I will draw a graph using the DataFrame (df) read above. First, the horizontal axis is the date and the vertical axis is the bar graph of the number of infected people. Let's continue to execute the following code.

((5/15 postscript)) Since the order of the original csv data is no longer in chronological order, I added a line of code to sort the data in the order of published_date near the center of the code below.

#Matplotlib for drawing graphs.Import pyplot and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Draw a graph
plt.figure(figsize=(13,7))                   #Define the size of the graph

sns.set(font='Yu Gothic', font_scale = 1.2)  #Specify the font because Japanese characters are garbled

df = df.sort_values('Published_date')           #Published_dateの順番にデータを並び替える（5/15 postscript)

sns.countplot(data=df, x='Published_date')      #Create an aggregate graph of the number of infected people using Seaborn.

plt.title('COVID-19 Changes in the number of newly infected people @ Tokyo')
plt.xticks(rotation=90, fontsize=10)         #Since the date and time of the x-axis overlap, rotate 90 ° and display
plt.ylabel('Number of infected people(Man)')                   #Y-axis label'Number of infected people'Displayed as

Did you draw a graph like the one below? It feels like it has converged, but I wonder what will happen in the future. .. ..

Graph of the number of infected people by day of the week

By the way, using the same bar graph, if you try to divide the horizontal axis by day of the week,

#Draw a graph of the number of infected people by day of the week
sns.countplot(data=df, x="Day of the week")            #Draw a graph
plt.title('Number of newly infected people by day of the week @ Tokyo')    #Show graph title
plt.ylabel('Number of infected people(Man)')                  #Show title on vertical axis

It was easy to draw, but the order of the days of the week is strange.

To sort the days of the week, rewrite as follows.

#Rearrange the horizontal axis of the graph and draw the graph again
list_weekday = ['Month','fire','water','wood','Money','soil','Day']     #Make a list showing the order of the horizontal axis
sns.countplot(data=df, x="Day of the week",order=list_weekday)     #Draw a graph
plt.title('Number of newly infected people by day of the week @ Tokyo')    #Show graph title
plt.ylabel('Number of infected people(Man)')                  #Show title on vertical axis

It was safely sorted by day of the week. It seems that the number of Fridays and Saturdays on weekends is high, and the number of Sundays and Mondays is low.

Graph of the number of infected men and women

Next, the male-female ratio is ...

#Draw a graph of the number of infected people by gender
sns.countplot(data=df, x="patient_sex")      #Draw a graph
plt.title('Number of new infections by gender @ Tokyo')    #Show graph title
plt.ylabel('Number of infected people(Man)')                 #Show title on vertical axis

As reported every day, there are more men here, but ... Rather, it was found that the data included items "under investigation" and "unknown" in addition to "male" and "female". These unexpected discoveries are common in data analysis, Just in case, let's aggregate the patient_gender data with pivot_table. You can aggregate from the original data in the following one line.

#patient_Aggregate gender data
df.pivot_table(index='patient_sex',aggfunc='size').sort_values(ascending=False)

I think that the following tabulation results (number of each item) will appear.

In other words, in the patient_gender item, In addition to "male" and "female", it seems that six "unknown" and one "under investigation" are mixed.

It's a common story that unexpected items are included when analyzing data, so It is very important to remember not only graph visualization but also data aggregation and preprocessing techniques.

Graph of the number of infected people by age

Next, by age group ...

list_age = ['Under 10 years old','10's','20's','30s','Forties','50s','60s','70s','80s','90s','100 years and over','unknown']
sns.countplot(data=df, x="patient_Age", order=list_age)
plt.xticks(rotation=90)
plt.ylabel('Number of infected people(Man)')

Looking at it this way, it seems that the number of infected people in their 20s and 30s is large for the population, not to mention the proportion of elderly people in their 60s and above. (It may be better to express that the ratio of people in their 40s and 50s is small for the population.)

Graph of population by age group in Tokyo (reference)

For reference, the graph [^ 1] of the population of Tokyo by age group (as of January 1, 2nd year of Reiwa) is shown below. [^ 1]: From Tokyo's households and population (by town and age) based on the Basic Resident Register

Age	Total population	Male population	Female population
Under 10 years old	1,048,921	536,920	512,001
10's	1,029,680	526,065	503,615
20's	1,557,966	779,053	778,913
30s	1,842,086	939,710	902,376
Forties	2,177,935	1,108,561	1,069,374
50s	1,832,946	946,158	886,788
60s	1,373,395	688,654	684,741
70s	1,414,012	645,774	768,238
80s	794,805	304,309	490,496
90s and over	185,849	47,609	138,240
unknown	1	0	1

Number of infected people per population by age group (total for men and women)

And the graph below compares the number of infected people per 100,000 people by dividing the number of infected people by age group by the population by age group. I was a little surprised. .. .. It seems that people in their 90s and above are overwhelming, followed by those in their 20s, 30s, and 40s to 80s.

Number of infected people per population by age group (by gender)

And if you divide it into men and women.

:boy_tone1:	:girl_tone1:

This is also a surprising result. I was wondering if there were many infected people in their 20s, but it was women who tended to have more infected people in their 20s and 30s. I don't know the cause, but it's a little worrisome result.

Heatmap by age and date

And if you look at the heatmaps by age and date, ...

#Published_Date and patient_Create a pivot table with a column of ages
df_pivot = df[['Published_date','patient_Age']].pivot_table(index='Published_date',columns='patient_Age',aggfunc='size')

#patient_List each item of the age (used on the vertical axis of the heat map)
list_age = ['Under 10 years old','10's','20's','30s','Forties','50s','60s','70s','80s','90s','100 years and over','unknown']

plt.figure(figsize=(6,16))                         #Define the size of the graph
plt.yticks(fontsize = 10)                          #Define y-axis font size

sns.heatmap(df_pivot[list_age], annot = True, annot_kws={"size": 10}, linewidth = .1)    #Draw heatmap

It looks like that, but it feels like "that's why". .. .. (-_-;) Since it seems that other information can be extracted, I will continue the analysis little by little.

About data trimming

By the way, I did not check the contents of the raw data (csv) at all, but since the csv data has been converted to DataFrame (df) with the code at the beginning, let's display the contents of the data again with the following command. Let's do it.

df

There are 4,883 lines of data (as of May 12, 2020), but it seems that there are many nans that indicate blanks. To be on the safe side, let's take a look at the unique values contained in each column. Try running the code below.

#Data frame containing csv data(df)Extract the column name of and the unique value stored in each column.
for i in df.columns:                                       #Repeat for each column
    print('Column name:' + i)                                    #Print the name of the column
    print('Number of unique values:' + str(len(df[i].unique())))    #Count the number of unique values in each column
    print('Unique value:' + str(df[i].unique()))             #Extract unique values for each column
    print('///////////////////////////////////////////')   #Separator

Since the result is long, I folded it below and stored it.

Execution result (click) Column name: No Number of unique values: 4987 Unique value: [1 2 3 ... 10109 10110 10111] /////////////////////////////////////////// Column name: National local government code Number of unique values: 1 Unique value: [130001] /////////////////////////////////////////// Column name: Prefecture name Number of unique values: 1 Unique value: ['Tokyo'] /////////////////////////////////////////// Column name: City name Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Published_Date Number of unique values: 84 Unique value: ['2020-01-24' '2020-01-25' '2020-01-30' '2020-02-13' '2020-02-14' '2020-02-15' '2020-02-16' '2020-02-18' '2020-02-19' '2020-02-21' '2020-02-22' '2020-02-24' '2020-02-26' '2020-02-27' '2020-02-29' '2020-03-01' '2020-03-03' '2020-03-04' '2020-03-05' '2020-03-06' '2020-03-07' '2020-03-10' '2020-03-11' '2020-03-12' '2020-03-13' '2020-03-14' '2020-03-15' '2020-03-17' '2020-03-18' '2020-03-19' '2020-03-20' '2020-03-21' '2020-03-22' '2020-03-23' '2020-03-24' '2020-03-25' '2020-03-26' '2020-03-27' '2020-03-28' '2020-03-29' '2020-03-30' '2020-03-31' '2020-04-01' '2020-04-02' '2020-04-03' '2020-04-04' '2020-04-05' '2020-04-06' '2020-04-07' '2020-04-08' '2020-04-09' '2020-04-10' '2020-04-11' '2020-04-12' '2020-04-13' '2020-04-14' '2020-04-15' '2020-04-16' '2020-04-17' '2020-04-18' '2020-04-19' '2020-04-20' '2020-04-21' '2020-04-22' '2020-04-23' '2020-04-24' '2020-04-25' '2020-04-26' '2020-04-27' '2020-04-28' '2020-04-29' '2020-04-30' '2020-05-01' '2020-05-02' '2020-05-03' '2020-05-04' '2020-05-05' '2020-05-06' '2020-05-07' '2020-05-08' '2020-05-09' '2020-05-10' '2020-05-11' '2020-05-12'] /////////////////////////////////////////// Column name: Day of the week Number of unique values: 7 Unique values: ['Friday''Saturday''Thu''Sun''Tue''Wed''Monday'] /////////////////////////////////////////// Column name: Onset_date Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Patient_Residence Number of unique values: 7 Unique value: ['Wuhan City, Hubei Province''Changsha City, Hunan Province''Tokyo''Outside Tokyo' nan'Under investigation'''―'] /////////////////////////////////////////// Column name: Patient_age Number of unique values: 13 Unique values: ['40s' '30s' '70s' '50s' '80s' '60s' '20s''under 10s' '90s''teens' '100s and over' 'Unknown''-'] /////////////////////////////////////////// Column name: Patient_Gender Number of unique values: 4 Unique values: ['Men''Women'' Under investigation''Unknown'] /////////////////////////////////////////// Column name: Patient_attribute Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Patient_Status Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Patient_Symptoms Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Patient_ Travel history flag Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Remarks Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Discharged flag Number of unique values: 2 Unique value: [1. nan] ///////////////////////////////////////////

At least for the following columns everything seems to be blank (nan).

"City name"
"Onset_date"
"Patient_attribute"
"Patient_state"
"Patient_Symptoms"
"Patient_travel history flag"
"Remarks"

In addition, the "national local government code" and "prefecture name" all have the same value, which makes no sense in data analysis. It is desirable to remove such unnecessary data from the data in advance. Create a new data frame (df_extract) by extracting only the necessary items. Execute the following code.

#Trim unnecessary columns (extract only necessary columns)
df_extract = df[['No','Published_date','Day of the week','patient_residence','patient_Age','patient_sex','Discharged flag']]
df_extract = df_extract.set_index('No')     #Set the "No" column to index.
df_extract

This made me feel pretty refreshed. I think that trimming work, which properly judges and excludes unnecessary data when analyzing data, is also a very important skill.

Practice data

This time it was data from Tokyo, Jag Japan Co., Ltd. has released the national version of csv data. https://dl.dropboxusercontent.com/s/6mztoeb6xf78g5w/COVID-19.csv There is a lot of data and I think it is just right for practicing data analysis using Python. The procedure is almost the same, so if you are interested, why don't you try it yourself?

bonus

In fact, you can do almost the same thing with Excel (and relatively easily)

Below is a pivot graph drawn in Excel using the same data. In fact, you can easily do almost the same thing with Excel, including the heatmap introduced in this article. I also love Python, and I have a lot of feelings about Python, but when I think about what data analysis is for and who it is for, what I can do with Excel is what I can do with Excel. Every day I think that the basic style should not be done in Python.

Transition graph of newly infected people drawn in Excel

Heat map drawn in Excel (Modoki)

Thank you for reading through to the end.

I will continue to update it to improve my skills.

I drew a Python graph using public data on the number of patients positive for the new coronavirus (COVID-19) in Tokyo + with a link to the national version of practice data