I drew a Python graph using the data of positive patients of the new coronavirus (COVID-19) released by the Tokyo Metropolitan Government.
I wrote it with the minimum necessary code, so I hope it will be helpful for those who are thinking of performing data analysis using Python from now on.
Since the public data in csv format, which is updated daily by the Tokyo Metropolitan Government, is directly read, there is no need to download the csv file one by one.
If you copy the following Python code to your own execution environment (Jupyter Notebook etc.), you can draw the latest information graph every time.
Also, I added a link to the Japanese national version of csv data later in this article, so I think it will be easier for you to acquire skills if you practice using it.
The Python code in this article has been tested using Jupyter Lab on a Windows 10 machine with Anaconda installed.
The data graphed this time is the following csv data. The results up to the previous day are updated daily. Tokyo Metropolitan Government_New Coronavirus Positive Patient Announcement Details (CSV Format)
The following is the homepage with links to csv data. Details of Tokyo Metropolitan Government_New Coronavirus Positive Patient Announcement
Now let's draw a graph in Python using csv data.
First, use the Python code described below to connect to the Tokyo homepage, acquire the latest data (csv format), and convert it to pandas DataFlame.
The point here is that the csv file is not saved in the local folder, but directly converted to pandas DataFlame (df). This saves you the trouble of opening a browser and downloading the latest version of the csv file, which is updated daily, just by running the code below.
import requests
import pandas as pd
import io
#Import csv directly into pandas dataframe
url = 'https://stopcovid19.metro.tokyo.lg.jp/data/130001_tokyo_covid19_patients.csv'
r = requests.get(url).content
df = pd.read_csv(io.StringIO(r.decode('utf-8')))
df
When the data is loaded successfully, the contents of DataFrame (df) should be displayed.
I will draw a graph using the DataFrame (df) read above. First, the horizontal axis is the date and the vertical axis is the bar graph of the number of infected people. Let's continue to execute the following code.
((5/15 postscript)) Since the order of the original csv data is no longer in chronological order, I added a line of code to sort the data in the order of published_date near the center of the code below.
#Matplotlib for drawing graphs.Import pyplot and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#Draw a graph
plt.figure(figsize=(13,7)) #Define the size of the graph
sns.set(font='Yu Gothic', font_scale = 1.2) #Specify the font because Japanese characters are garbled
df = df.sort_values('Published_date') #Published_dateの順番にデータを並び替える(5/15 postscript)
sns.countplot(data=df, x='Published_date') #Create an aggregate graph of the number of infected people using Seaborn.
plt.title('COVID-19 Changes in the number of newly infected people @ Tokyo')
plt.xticks(rotation=90, fontsize=10) #Since the date and time of the x-axis overlap, rotate 90 ° and display
plt.ylabel('Number of infected people(Man)') #Y-axis label'Number of infected people'Displayed as
Did you draw a graph like the one below? It feels like it has converged, but I wonder what will happen in the future. .. ..
By the way, using the same bar graph, if you try to divide the horizontal axis by day of the week,
#Draw a graph of the number of infected people by day of the week
sns.countplot(data=df, x="Day of the week") #Draw a graph
plt.title('Number of newly infected people by day of the week @ Tokyo') #Show graph title
plt.ylabel('Number of infected people(Man)') #Show title on vertical axis
It was easy to draw, but the order of the days of the week is strange.
To sort the days of the week, rewrite as follows.
#Rearrange the horizontal axis of the graph and draw the graph again
list_weekday = ['Month','fire','water','wood','Money','soil','Day'] #Make a list showing the order of the horizontal axis
sns.countplot(data=df, x="Day of the week",order=list_weekday) #Draw a graph
plt.title('Number of newly infected people by day of the week @ Tokyo') #Show graph title
plt.ylabel('Number of infected people(Man)') #Show title on vertical axis
It was safely sorted by day of the week. It seems that the number of Fridays and Saturdays on weekends is high, and the number of Sundays and Mondays is low.
Next, the male-female ratio is ...
#Draw a graph of the number of infected people by gender
sns.countplot(data=df, x="patient_sex") #Draw a graph
plt.title('Number of new infections by gender @ Tokyo') #Show graph title
plt.ylabel('Number of infected people(Man)') #Show title on vertical axis
As reported every day, there are more men here, but ... Rather, it was found that the data included items "under investigation" and "unknown" in addition to "male" and "female". These unexpected discoveries are common in data analysis, Just in case, let's aggregate the patient_gender data with pivot_table. You can aggregate from the original data in the following one line.
#patient_Aggregate gender data
df.pivot_table(index='patient_sex',aggfunc='size').sort_values(ascending=False)
I think that the following tabulation results (number of each item) will appear.
In other words, in the patient_gender item, In addition to "male" and "female", it seems that six "unknown" and one "under investigation" are mixed.
It's a common story that unexpected items are included when analyzing data, so It is very important to remember not only graph visualization but also data aggregation and preprocessing techniques.
Next, by age group ...
list_age = ['Under 10 years old','10's','20's','30s','Forties','50s','60s','70s','80s','90s','100 years and over','unknown']
sns.countplot(data=df, x="patient_Age", order=list_age)
plt.xticks(rotation=90)
plt.ylabel('Number of infected people(Man)')
Looking at it this way, it seems that the number of infected people in their 20s and 30s is large for the population, not to mention the proportion of elderly people in their 60s and above. (It may be better to express that the ratio of people in their 40s and 50s is small for the population.)
For reference, the graph [^ 1] of the population of Tokyo by age group (as of January 1, 2nd year of Reiwa) is shown below. [^ 1]: From Tokyo's households and population (by town and age) based on the Basic Resident Register
Age | Total population | Male population | Female population |
---|---|---|---|
Under 10 years old | 1,048,921 | 536,920 | 512,001 |
10's | 1,029,680 | 526,065 | 503,615 |
20's | 1,557,966 | 779,053 | 778,913 |
30s | 1,842,086 | 939,710 | 902,376 |
Forties | 2,177,935 | 1,108,561 | 1,069,374 |
50s | 1,832,946 | 946,158 | 886,788 |
60s | 1,373,395 | 688,654 | 684,741 |
70s | 1,414,012 | 645,774 | 768,238 |
80s | 794,805 | 304,309 | 490,496 |
90s and over | 185,849 | 47,609 | 138,240 |
unknown | 1 | 0 | 1 |
And the graph below compares the number of infected people per 100,000 people by dividing the number of infected people by age group by the population by age group. I was a little surprised. .. .. It seems that people in their 90s and above are overwhelming, followed by those in their 20s, 30s, and 40s to 80s.
And if you divide it into men and women.
:boy_tone1: | :girl_tone1: |
---|---|
This is also a surprising result. I was wondering if there were many infected people in their 20s, but it was women who tended to have more infected people in their 20s and 30s. I don't know the cause, but it's a little worrisome result.
And if you look at the heatmaps by age and date, ...
#Published_Date and patient_Create a pivot table with a column of ages
df_pivot = df[['Published_date','patient_Age']].pivot_table(index='Published_date',columns='patient_Age',aggfunc='size')
#patient_List each item of the age (used on the vertical axis of the heat map)
list_age = ['Under 10 years old','10's','20's','30s','Forties','50s','60s','70s','80s','90s','100 years and over','unknown']
plt.figure(figsize=(6,16)) #Define the size of the graph
plt.yticks(fontsize = 10) #Define y-axis font size
sns.heatmap(df_pivot[list_age], annot = True, annot_kws={"size": 10}, linewidth = .1) #Draw heatmap
It looks like that, but it feels like "that's why". .. .. (-_-;) Since it seems that other information can be extracted, I will continue the analysis little by little.
By the way, I did not check the contents of the raw data (csv) at all, but since the csv data has been converted to DataFrame (df) with the code at the beginning, let's display the contents of the data again with the following command. Let's do it.
df
There are 4,883 lines of data (as of May 12, 2020), but it seems that there are many nans that indicate blanks. To be on the safe side, let's take a look at the unique values contained in each column. Try running the code below.
#Data frame containing csv data(df)Extract the column name of and the unique value stored in each column.
for i in df.columns: #Repeat for each column
print('Column name:' + i) #Print the name of the column
print('Number of unique values:' + str(len(df[i].unique()))) #Count the number of unique values in each column
print('Unique value:' + str(df[i].unique())) #Extract unique values for each column
print('///////////////////////////////////////////') #Separator
Since the result is long, I folded it below and stored it.
At least for the following columns everything seems to be blank (nan).
In addition, the "national local government code" and "prefecture name" all have the same value, which makes no sense in data analysis. It is desirable to remove such unnecessary data from the data in advance. Create a new data frame (df_extract) by extracting only the necessary items. Execute the following code.
#Trim unnecessary columns (extract only necessary columns)
df_extract = df[['No','Published_date','Day of the week','patient_residence','patient_Age','patient_sex','Discharged flag']]
df_extract = df_extract.set_index('No') #Set the "No" column to index.
df_extract
This made me feel pretty refreshed. I think that trimming work, which properly judges and excludes unnecessary data when analyzing data, is also a very important skill.
This time it was data from Tokyo, Jag Japan Co., Ltd. has released the national version of csv data. https://dl.dropboxusercontent.com/s/6mztoeb6xf78g5w/COVID-19.csv There is a lot of data and I think it is just right for practicing data analysis using Python. The procedure is almost the same, so if you are interested, why don't you try it yourself?
Below is a pivot graph drawn in Excel using the same data. In fact, you can easily do almost the same thing with Excel, including the heatmap introduced in this article. I also love Python, and I have a lot of feelings about Python, but when I think about what data analysis is for and who it is for, what I can do with Excel is what I can do with Excel. Every day I think that the basic style should not be done in Python.
Thank you for reading through to the end.
I will continue to update it to improve my skills.
Recommended Posts