Introduction
We are creating a Python package CovsirPhy that allows you to easily download and analyze COVID-19 data (such as the number of PCR positives).
Introductory article:
The English version of the document is Covsir Phy: COVID-19 analysis with phase-dependent SIRs, Kaggle: COVID-19 data with SIR model.
** This time, I will explain how to download the actual data of COVID-19. ** ** English edition:
CovsirPhy can be installed by the following method! Please use Python 3.7 or above, or Google Colaboratory.
--Stable version: pip install covsirphy --upgrade
--Development version: pip install" git + https://github.com/lisphilar/covid19-sir.git#egg=covsirphy "
#For data display
from pprint import pprint
# CovsirPhy
import covsirphy as cs
cs.__version__
# '2.8.2'
Execution environment | |
---|---|
OS | Windows Subsystem for Linux |
Python | version 3.8.5 |
The tables and graphs in this article were created using the data as of 9/11/2020.
You can download the data in the following 4 lines.
data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
oxcgrt_data = data_loader.oxcgrt()
The following 3 types of data are automatically saved in the "input" directory (folder) from COVID-19 Data Hub [^ 1]. Data shaping is also good.
--Time-series data for each country / region regarding the number of infected / recovered / dead --Population data of each country / region -Oxford Covid-19 Government Response Tracker (OxCGRT): Data quantifying the status of measures taken by each country against COVID-19
Data formatting is done on the CovsirPhy side, but the data download itself depends on the official package covid19dh
of COVID-19 Data Hub. We also work with developers [^ 2] to prevent errors, but if anything happens CovsirPhy issue page Please contact us from!
data_loader = cs.DataLoader("input")
You can change the directory name. The default of the first argument is "input" and can be omitted.
# verbose=True:Display the source of data at the time of download
jhu_data = data_loader.jhu(verbose=True)
type(jhu_data)
# -> <class 'covsirphy.cleaning.jhu_data.JHUData'>
Originally, I used the method name "jhu" because I downloaded the data from Johns Hopkins University directly.
The data source [^ 3] can be confirmed from the instance of DataLoader
.
[^ 3]: COVID-19 Data Hub is secondary data! Based on the data of Johns Hopkins University, the database side performs preprocessing such as missing value processing. Thank you very much.
# COVID-19 Data Hub Information-> (Output result omitted)
print(jhu_data.citation)
#List of data citation sources-> (Output result omitted)
print(data_loader.covid19dh_citation)
#View downloaded data(pandas.DataFrame) -> (Output result omitted)
jhu_data.raw.tail()
By JHUData.cleaned ()
, the data of date / country name / region name / total number of confirmed cases (number of PCR positives) / current number of infected people / total number of deaths / total number of recoverers is stored in data frame format ( You can get it with pandas.DataFrame
).
jhu_data.cleaned().tail()
Date | Country | Province | Confirmed | Infected | Fatal | Recovered | |
---|---|---|---|---|---|---|---|
211098 | 2020-09-07 | Colombia | Vichada | 14 | 0 | 0 | 14 |
211099 | 2020-09-08 | Colombia | Vichada | 14 | 0 | 0 | 14 |
211100 | 2020-09-09 | Colombia | Vichada | 14 | 0 | 0 | 14 |
211101 | 2020-09-10 | Colombia | Vichada | 14 | 0 | 0 | 14 |
211102 | 2020-09-11 | Colombia | Vichada | 14 | 0 | 0 | 14 |
Depending on the country, both the value for the whole country and the value for each region are registered, so it is not possible to obtain the correct aggregated data for each country with jhu_data.cleaned (). Groupby ("Country "). Sum ()
. Therefore, we have prepared a method JHUData.subset (country, province)
that retrieves data for a specific country or region. The country and region names columns are omitted from the output.
#Select only country name-> (Output result omitted)
jhu_data.subset(country="Japan")
#ISO3 code is OK for country name-> (Output result omitted)
jhu_data.subset(country="JPN")
#Select local name
jhu_data.subset(country="JPN", province="Tokyo").tail()
Date | Confirmed | Infected | Fatal | Recovered | |
---|---|---|---|---|---|
172 | 2020-09-07 | 21849 | 2510 | 372 | 18967 |
173 | 2020-09-08 | 22019 | 2470 | 378 | 19171 |
174 | 2020-09-09 | 22168 | 2349 | 379 | 19440 |
175 | 2020-09-10 | 22444 | 2478 | 379 | 19587 |
176 | 2020-09-11 | 22631 | 2439 | 380 | 19812 |
Note: This is the 4th data (Tokyo / country / volunteer domestic organization / COVID-19 Data Hub) and may differ from the figures announced by the Tokyo Metropolitan Government.
If you want to create a time series graph, please use the cs.line_plot ()
function (the function may be deprecated and classified, so we are considering it).
cs.line_plot(
subset_df.set_index("Date").drop("Confirmed", axis=1),
title="Japan/Tokyo: cases over time",
filename=None, #Set the file name when outputting to a file
y_integer=True, #Change the y-axis to an integer value. Do not use x10 etc.
)
In addition, we have prepared a method JHUData.total ()
to get the total value of the whole world. With percentage data.
jhu_data.total().tail()
Date | Confirmed | Infected | Fatal | Recovered | Fatal per Confirmed | Recovered per Confirmed | Fatal per (Fatal or Recovered) |
---|---|---|---|---|---|---|---|
2020-09-07 | 2.71499e+07 | 8.06515e+06 | 890441 | 1.81943e+07 | 0.0163986 | 0.335071 | 0.0466573 |
2020-09-08 | 2.73868e+07 | 8.10302e+06 | 895203 | 1.83886e+07 | 0.0163437 | 0.33572 | 0.0464225 |
2020-09-09 | 2.76653e+07 | 8.15167e+06 | 901058 | 1.86126e+07 | 0.016285 | 0.336388 | 0.0461758 |
2020-09-10 | 2.7954e+07 | 8.2298e+06 | 906678 | 1.88175e+07 | 0.0162173 | 0.33658 | 0.0459678 |
2020-09-11 | 2.79547e+07 | 8.22937e+06 | 906696 | 1.88187e+07 | 0.0162172 | 0.336592 | 0.045966 |
population_data = data_loader.population()
print(type(population_data))
# -> <class 'covsirphy.cleaning.population.PopulationData'>
You can get ISO3 code / country / region / date / population data with PopulationData.cleaned ()
. Also, use PopulationData.value (country, province)
to get the value for each country / region.
#Get formatted data in data frame format->Output result omitted
population_data.cleaned().tail()
#Select only country name-> int
population_data.value(country="Japan")
#ISO3 code is OK for country name-> int
population_data.value(country="JPN")
#Select local name-> int
population_data.value(country="JPN", province="Tokyo")
Population values can be updated with the PopulationData.update (value, country, province)
method.
#Before update-> 13942856
population_data.value(country="Japan", province="Tokyo")
#update
# https://www.metro.tokyo.lg.jp/tosei/hodohappyo/press/2020/06/11/07.html
population_data.update(14_002_973, "Japan", province="Tokyo")
#After update-> 14002973
population_data.value("Japan", province="Tokyo")
oxcgrt_data = data_loader.oxcgrt()
print(type(oxcgrt_data))
# -> <class 'covsirphy.cleaning.oxcgrt.OxCGRTData'>
You can get the data of ISO3 code / country name / date / each index by ʻOxCGRTData.cleaned (). Regional data is not included. ʻOxCGRTData.subset (country)
can also not specify a region name.
#Get formatted data in data frame format->Output result omitted
oxcgrt_data.cleaned().tail()
#Only country name can be selected
oxcgrt_data.subset(country="Japan")
#ISO3 code is OK for country name
oxcgrt_data.subset(country="JPN")
Date | School_closing | Workplace_closing | Cancel_events | Gatherings_restrictions | Transport_closing | Stay_home_restrictions | Internal_movement_restrictions | International_movement_restrictions | Information_campaigns | Testing_policy | Contact_tracing | Stringency_index | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
247 | 2020-09-07 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 3 | 2 | 2 | 1 | 30.56 |
248 | 2020-09-08 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 3 | 2 | 2 | 1 | 30.56 |
249 | 2020-09-09 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 3 | 2 | 2 | 1 | 30.56 |
250 | 2020-09-10 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 3 | 2 | 2 | 1 | 30.56 |
251 | 2020-09-11 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 3 | 2 | 2 | 1 | 30.56 |
This time, I explained how to get each data using CovsirPhy. I did my best to get it easily with a short code, so please use it! We welcome your feedback.
Next time, I will write an article about the explanation of the analysis method using actual data. In addition to the usage examples, I would like to describe the technical background as much as possible. Thank you!
Thank you for your hard work!
Recommended Posts