Introduction
We are creating a Python package CovsirPhy that allows you to easily download and analyze COVID-19 data (such as the number of PCR positives).
Introductory article:
** This time, we would like to introduce S-R trend analysis (an analysis method for trends in the spread of infection). ** **
The English version of the document is Covsir Phy: COVID-19 analysis with phase-dependent SIRs, [Kaggle: COVID-19 data with SIR model]( Please refer to https://www.kaggle.com/lisphilar/covid-19-data-with-sir-model).
CovsirPhy can be installed by the following method! Please use Python 3.7 or above, or Google Colaboratory.
--Stable version: pip install covsirphy --upgrade
--Development version: pip install" git + https://github.com/lisphilar/covid19-sir.git#egg=covsirphy "
import covsirphy as cs
cs.__version__
# '2.8.2'
Execution environment | |
---|---|
OS | Windows Subsystem for Linux |
Python | version 3.8.5 |
The tables and graphs in this article were created using the data as of 9/11/2020. Click here for the code [^ 2] to download the actual data from COVID-19 Data Hub [^ 1]:
data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
Regarding the number of infected people in Japan, a value different from the value [^ 3] of Ministry of Health, Labor and Welfare HP is output. It seems. If you want to use the value of MHLW, download the data from COVID-19 dataset in Japan with the following code. Please replace. If ʻImportError / ModuleNotFoundError: requests and aiohttp occur at that time, please install by
pip install requests aiohttp` (during the cause investigation, requests should be included in the dependent package ...).
[^ 3]: When the "number of positives", "number of people discharged or canceled" and "number of deaths" defined by the Ministry of Health, Labor and Welfare are treated as confirmed cases, recovered, recovered, and fatal. ..
japan_data = data_loader.japan()
jhu_data.replace(japan_data)
# Citation -> str
print(japan_data.citation)
Before going into the explanation of S-R trend analysis, let's look at the actual data (Japanese data as an example) using Scenario.records ()
.
snl = cs.Scenario(jhu_data, population_data, country="Japan")
#When not displaying the graph
snl.records(show_figure=False)
Date | Confirmed | Infected | Fatal | Recovered | |
---|---|---|---|---|---|
209 | 2020-09-07 | 71856 | 7957 | 1363 | 62536 |
210 | 2020-09-08 | 72234 | 7575 | 1377 | 63282 |
211 | 2020-09-09 | 72726 | 7233 | 1393 | 64100 |
212 | 2020-09-10 | 73221 | 6980 | 1406 | 64835 |
213 | 2020-09-11 | 73901 | 6899 | 1412 | 65590 |
#Display the graph
# filename:Default value=None (Do not save as an image file)
_ = snl.records(filename=None)
When dealing with infectious disease data using simultaneous ordinary differential equations such as the SIR model, it is common to perform analysis under one of the following assumptions.
--Model parameters are constant from the beginning to the end of the epidemic --Model parameters fluctuate daily and follow Poisson distribution etc.
However, when analyzing COVID-19, I thought that it was necessary to customize rather than adopt either one.
First of all, in this epidemic, each country and each individual has taken some measures to control the epidemic from around February and March 2020 when the infection began to spread to the world. Parameters such as infection rate fluctuate from time to time and cannot be assumed to be constant until the end (clear from the above-mentioned actual data and SIR-F model waveform [^ 4]).
[^ 4]: [CovsirPhy] COVID-19 Python package for data analysis: SIR-F model
On the other hand, if the parameters fluctuate every day, the analysis will depend on each value of the acquired data. Especially in the case of COVID-19, it is necessary to grasp the outline of the epidemic instead of the daily data because the inspection system was not established at the initial stage and there is a possibility that it has infectivity even if it is asymptomatic. I thought.
Therefore, CovsirPhy decided to take advantage of both by subdividing the parameters into periods that seem to be constant. The "period during which the parameters are constant" is called "Phase". From the start to the end of the epidemic, there are multiple Phases with different parameters. The code will be described in another article, but I think that the execution reproduction number $ R_ {\ mathrm {t}} $ of the SIR-F model, for example, changes stepwise as shown in the graph.
So how do you set the Phase? Since it takes some time to calculate the parameters [^ 5], I would like to find the branch point (date when the parameter value changes) of Phase without calculating the parameters.
[^ 5]: It takes about 2 minutes for parallel calculation of 10 phases and 8 CPUs. You can wait for the analysis of one country, but it is time-consuming and difficult considering that the data increases every day and the data of many countries may be analyzed.
Using S-R trend analysis, it is possible to find the turning point of the phase from the transition of the number of susceptibility holders $ S $ and the number of recoverers $ R $. I will omit the process until I come up with it because it will be long, but [Kaggle: COVID-19 data with SIR model, SR trend analysis](https://www.kaggle.com/lisphilar/covid-19-data-with- sir-model # SR-trend-analysis), so please refer to it.
In the simultaneous ordinary differential equations of SIR model [^ 6] and SIR-F model [^ 4], it is as follows for $ S $ and $ R $.
[^ 6]: [CovsirPhy] COVID-19 Python package for data analysis: SIR model
\begin{align*}
& \frac{\mathrm{d}S}{\mathrm{d}T}= - N^{-1}\beta S I \\
& \frac{\mathrm{d}R}{\mathrm{d}T}= \gamma I \\
\end{align*}
Let's divide $ \ frac {\ mathrm {d} S} {\ mathrm {d} T} $ by $ \ frac {\ mathrm {d} R} {\ mathrm {d} T} $.
\begin{align*}
\cfrac{\mathrm{d}S}{\mathrm{d}R} &= - \cfrac{\beta}{N \gamma} S \\
\end{align*}
If you integrate this and put $ a = \ frac {\ beta} {N \ gamma} $ [^ 7],
[^ 7]: Pioneer in analyzing SR planes in SIR model: Balkew, Teshome Mogessie, "The SIR Model When S (t) is a Multi-Exponential Function." (2010) .Electronic Theses and Dissertations.Paper 1747 .
S_{(R)} = N e^{-a R}
Divide both sides by logarithm,
\log S_{(R)} = - a R + \log N
** When the model parameters are constant, draw a semi-log graph with the x-axis as the number of recoverers $ R $ and the y-axis as the number of sensitivity holders $ S $, and it will be a straight line! ** **
(Graph: From the Kaggle Notebook mentioned above, the difference in the slope of the straight line when the parameters are changed)
It can be executed in one line by the Scenario.trend ()
method. A semi-logarithmic graph of S-R is displayed. Phase is named Initial (0th) phase, 1st phase, 2nd phase, ... [^ 8].
[^ 8]: Related technology: [Python] Convert natural numbers to ordinal numbers
# show_figure:Display the graph
# filename:Default value=None (Do not save as an image file)
snl.trend(show_figure=True, filename=None)
A list of start and end dates for each Phase can be obtained in data frame format using the Scenario.summary ()
method.
snl.summary()
Type | Start | End | Population | |
---|---|---|---|---|
0th | Past | 06Feb2020 | 21Apr2020 | 126529100 |
1st | Past | 22Apr2020 | 03Jul2020 | 126529100 |
2nd | Past | 04Jul2020 | 23Jul2020 | 126529100 |
3rd | Past | 24Jul2020 | 31Jul2020 | 126529100 |
4th | Past | 01Aug2020 | 10Aug2020 | 126529100 |
5th | Past | 11Aug2020 | 21Aug2020 | 126529100 |
6th | Past | 22Aug2020 | 29Aug2020 | 126529100 |
7th | Past | 30Aug2020 | 11Sep2020 | 126529100 |
Since the value of the number of susceptibility holders $ S $ changes depending on the value of the total population, it is displayed together as an important analysis condition. In addition, although not shown in this article, parameter estimates can be listed using the Scenario.summary ()
method. If there is a Phase for which the actual data is not registered, the Type
column of the corresponding line will be" Future ".
Although omitted this time, the above Phase setting is performed by Scenario.add ()
, Scenario.combine ()
, Scenario.delete ()
, Scenario.separate ()
, Scenario.clear ()
. It is also possible to edit.
I'm using the ruptures package to find the number of branch points in Phase and calculate the number of recoverers that will be the branch points. I originally used fbprophet, which is famous as a time series analysis package, but there was a problem that the number of branch points had to be set manually. Ilyass Tabial of Colaborator told me [^ 9], and I made it possible to automatically estimate the number of branch points using ruptures.
From covsirphy.ChangeFinder implementation code, the main steps are as follows.
Next time, I will explain how to estimate the parameters for each Phase.
Thank you for your hard work!
Recommended Posts