Introduction
We are creating a Python package CovsirPhy that allows you to easily download and analyze COVID-19 data (such as the number of PCR positives).
Introductory article:
** This time, we will introduce Parameter estimation (parameter estimation such as SIR-F model). ** **
The English version of the document is Covsir Phy: COVID-19 analysis with phase-dependent SIRs, [Kaggle: COVID-19 data with SIR model]( Please refer to https://www.kaggle.com/lisphilar/covid-19-data-with-sir-model).
CovsirPhy can be installed by the following method! Please use Python 3.7 or above, or Google Colaboratory.
--Stable version: pip install covsirphy --upgrade
--Development version: pip install" git + https://github.com/lisphilar/covid19-sir.git#egg=covsirphy "
import covsirphy as cs
cs.__version__
# '2.8.2'
Execution environment | |
---|---|
OS | Windows Subsystem for Linux / Ubuntu |
Python | version 3.8.5 |
The tables and graphs in this article were created using the data as of 9/12/2020. Click here for the code [^ 2] to download the actual data from COVID-19 Data Hub [^ 1]:
data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
In addition, please confirm the actual data and execute S-R trend analysis [^ 3] with the following code.
[^ 3]: [CovsirPhy] COVID-19 Python package for data analysis: S-R trend analysis
# (Optional)Acquisition of data from the Ministry of Health, Labor and Welfare
japan_data = data_loader.japan()
jhu_data.replace(japan_data)
print(japan_data.citation)
#Instantiation of analysis class
snl = cs.Scenario(jhu_data, population_data, country="Japan")
#Confirmation of actual data
snl.records(filename=None)
# S-R trend analysis
snl.trend(filename=None)
#Check Phase settings
snl.summary()
Actual data graph:
S-R trend analysis:
Phase setting:
Type | Start | End | Population | |
---|---|---|---|---|
0th | Past | 06Feb2020 | 21Apr2020 | 126529100 |
1st | Past | 22Apr2020 | 04Jul2020 | 126529100 |
2nd | Past | 05Jul2020 | 23Jul2020 | 126529100 |
3rd | Past | 24Jul2020 | 01Aug2020 | 126529100 |
4th | Past | 02Aug2020 | 14Aug2020 | 126529100 |
5th | Past | 15Aug2020 | 29Aug2020 | 126529100 |
6th | Past | 30Aug2020 | 12Sep2020 | 126529100 |
By S-R trend analysis, it was possible to divide into "Phase" (period during which the parameter is constant). In this article, the parameter values are estimated using the data of each "Phase" (for example, the data of 2020/2/6 --2020/4/21 for the 0th phase).
I will write another article about the estimation mechanism. The parameter values are proposed using the ʻoptunapackage, the numerical solution is calculated by
scipy.integrate.solve_ivp`, and the parameter set with little error from the actual data is selected.
How to read the result will be described later, but you can execute and get the result list in the following two lines. This time, I used SIR-F model [^ 4].
[^ 4]: [CovsirPhy] COVID-19 Python package for data analysis: SIR-F model
# Parameter estimation with SIR-F model
snl.estimate(cs.SIRF)
#Get parameter list
snl.summary()
#Example of standard output (processing time varies depending on the number of CPUs, etc.)
#Details below: Latest Phase=After estimating including tau in the 6th phase, fix tau and 0-5th parameter estimation
<SIR-F model: parameter estimation>
Running optimization with 8 CPUs...
6th phase (30Aug2020 - 12Sep2020): finished 704 trials in 0 min 25 sec
5th phase (15Aug2020 - 29Aug2020): finished 965 trials in 1 min 0 sec
3rd phase (24Jul2020 - 01Aug2020): finished 965 trials in 1 min 0 sec
1st phase (22Apr2020 - 04Jul2020): finished 913 trials in 1 min 0 sec
4th phase (02Aug2020 - 14Aug2020): finished 969 trials in 1 min 0 sec
0th phase (06Feb2020 - 21Apr2020): finished 853 trials in 1 min 0 sec
2nd phase (05Jul2020 - 23Jul2020): finished 964 trials in 1 min 0 sec
Completed optimization. Total: 1 min 27 sec
Type | Start | End | Population | ODE | Rt | theta | kappa | rho | sigma | tau | 1/alpha2 [day] | 1/gamma [day] | alpha1 [-] | 1/beta [day] | RMSLE | Trials | Runtime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0th | Past | 06Feb2020 | 21Apr2020 | 126529100 | SIR-F | 5.54 | 0.0258495 | 0.0002422 | 0.0322916 | 0.00543343 | 480 | 1376 | 61 | 0.026 | 10 | 1.17429 | 804 | 1 min 0 sec |
1st | Past | 22Apr2020 | 04Jul2020 | 126529100 | SIR-F | 0.41 | 0.0730748 | 0.000267108 | 0.0118168 | 0.0264994 | 480 | 1247 | 12 | 0.073 | 28 | 1.11459 | 861 | 1 min 0 sec |
2nd | Past | 05Jul2020 | 23Jul2020 | 126529100 | SIR-F | 2.01 | 0.000344333 | 7.92419e-05 | 0.0467789 | 0.023201 | 480 | 4206 | 14 | 0 | 7 | 0.0331522 | 910 | 1 min 0 sec |
3rd | Past | 24Jul2020 | 01Aug2020 | 126529100 | SIR-F | 1.75 | 0.00169155 | 4.05087e-05 | 0.0459332 | 0.0260965 | 480 | 8228 | 12 | 0.002 | 7 | 0.0201773 | 923 | 1 min 0 sec |
4th | Past | 02Aug2020 | 14Aug2020 | 126529100 | SIR-F | 1.46 | 0.000634554 | 0.000116581 | 0.0325815 | 0.0221345 | 480 | 2859 | 15 | 0.001 | 10 | 0.0751473 | 909 | 1 min 0 sec |
5th | Past | 15Aug2020 | 29Aug2020 | 126529100 | SIR-F | 0.8 | 0.00244294 | 9.30884e-05 | 0.0272693 | 0.0337857 | 480 | 3580 | 9 | 0.002 | 12 | 0.0420563 | 907 | 1 min 0 sec |
6th | Past | 30Aug2020 | 12Sep2020 | 126529100 | SIR-F | 0.69 | 5.36037e-05 | 0.000467824 | 0.0219379 | 0.0312686 | 480 | 712 | 10 | 0 | 15 | 0.0132161 | 724 | 0 min 25 sec |
It's long sideways, so let's look at it in order. First is the estimated value of the parameter.
# cs.SIRF.PARAMETERS: SIR-F model parameter name list
cols = ["Start", "End", "ODE", "tau", *cs.SIRF.PARAMETERS]
snl.summary(columns=cols)
Start | End | ODE | tau | theta | kappa | rho | sigma | |
---|---|---|---|---|---|---|---|---|
0th | 06Feb2020 | 21Apr2020 | SIR-F | 480 | 0.0258495 | 0.0002422 | 0.0322916 | 0.00543343 |
1st | 22Apr2020 | 04Jul2020 | SIR-F | 480 | 0.0730748 | 0.000267108 | 0.0118168 | 0.0264994 |
2nd | 05Jul2020 | 23Jul2020 | SIR-F | 480 | 0.000344333 | 7.92419e-05 | 0.0467789 | 0.023201 |
3rd | 24Jul2020 | 01Aug2020 | SIR-F | 480 | 0.00169155 | 4.05087e-05 | 0.0459332 | 0.0260965 |
4th | 02Aug2020 | 14Aug2020 | SIR-F | 480 | 0.000634554 | 0.000116581 | 0.0325815 | 0.0221345 |
5th | 15Aug2020 | 29Aug2020 | SIR-F | 480 | 0.000644851 | 0.000383424 | 0.0274104 | 0.0337156 |
6th | 30Aug2020 | 12Sep2020 | SIR-F | 480 | 5.36037e-05 | 0.000467824 | 0.0219379 | 0.0312686 |
SIR-F model[^4]:
\begin{align*}
\mathrm{S} \overset{\beta I}{\longrightarrow} \mathrm{S}^\ast \overset{\alpha_1}{\longrightarrow}\ & \mathrm{F} \\
\mathrm{S}^\ast \overset{1 - \alpha_1}{\longrightarrow}\ & \mathrm{I} \overset{\gamma}{\longrightarrow} \mathrm{R} \\
& \mathrm{I} \overset{\alpha_2}{\longrightarrow} \mathrm{F} \\
\end{align*}
$ \ Alpha_2 $, $ \ beta $, $ \ gamma $ has "time" as a dimension. This "time" is $ f (T + \ Delta T) --f (T) = x (T) \ Delta T $, which is a discretized equation of simultaneous ordinary differential equations $ f'(T) = x (T) $. It depends on \ Delta T $. Since data is acquired every other day, $ \ Delta T $ is less than one day (= 1440 min), but I do not know the specific value. It is difficult to compare the dimensional parameters $ \ alpha_2 $, $ \ beta $, $ \ gamma $ of different countries, which may vary by country or region.
Therefore, we set the variable $ \ tau $, which corresponds to $ \ Delta T $, to make the dimensional parameters dimensionless.
\begin{align*}
(S, I, R, F) = & N \times (x, y, z, w) \\
(T, \alpha_1, \alpha_2, \beta, \gamma) = & (\tau t, \theta, \tau^{-1}\kappa, \tau^{-1}\rho, \tau^{-1}\sigma) \\
1 \leq \tau & \leq 1440 \\
\end{align*}
At this time [^ 4],
\begin{align*}
& 0 \leq (x, y, z, w, \theta, \kappa, \rho, \sigma) \leq 1 \\
\end{align*}
After estimating the dimensionless parameters as $ \ tau $ using the latest Phase data so that the value of $ \ tau $ is constant in the same country, the same value is calculated as $ \ tau $ when estimating the parameters of other Phases. I am trying to use it as. This time it was 480 min as shown in the table. I'm programming to use a divisor of 1440 min per day to simplify the calculation.
Now, let's graph the transition of each dimensionless parameter.
\begin{align*}
& \frac{\mathrm{d}x}{\mathrm{d}t}= - \rho x y \\
& \frac{\mathrm{d}y}{\mathrm{d}t}= \rho (1-\theta) x y - (\sigma + \kappa) y \\
& \frac{\mathrm{d}z}{\mathrm{d}t}= \sigma y \\
& \frac{\mathrm{d}w}{\mathrm{d}t}= \rho \theta x y + \kappa y \\
\end{align*}
When Susceptible comes into contact with Infected, the probability of getting infected $ \ rho $ changes:
snl.history(target="rho", filename=None)
It is interpreted that the effect of the state of emergency (area limited 2020/4/7, nationwide 2020/4/16, nationwide cancellation 2020/5/25) appeared in the latter half of April and was maintained until the beginning of July. I will. After that, it jumped up and gradually decreased. Details need to be discussed, but it is a parameter that directly reflects the effects of measures such as three-cs avoidance.
Probability of transition from Infected to Recovered $ \ sigma $ transition:
snl.history(target="sigma", filename=None)
After a large rise in the latter half of April, it is on an upward trend while repeating rising / falling. This parameter reflects the medical care provision system and the development / supply status of new drugs.
Changes in infected mortality rate $ \ kappa $:
snl.history(target="kappa", filename=None)
It seems that it fluctuates greatly, but since the absolute value of the value is small, it can be considered that it is kept constant to some extent (verification by comparison with other countries is necessary). It is necessary to put in place a medical system and bring $ \ kappa $ as close to 0 as possible with a sufficient supply of new drugs.
Percentage of infected people who received a definitive diagnosis who died at the time of the definitive diagnosis Changes in $ \ theta $:
snl.history(target="theta", filename=None)
In the early stages of infection, the medical care provision system itself was extremely tight, so it is difficult to interpret, but it is thought that the value will increase if the examination is delayed and appropriate treatment cannot be performed.
The dimensional parameters are also included in the table. To make it easier to interpret, the unit is [day], which is the reciprocal except for the dimensionless $ \ alpha_1 $.
cols = ["Start", "End", "ODE", "tau", *cs.SIRF.DAY_PARAMETERS]
fh.write(snl.summary(columns=cols).to_markdown())
Start | End | ODE | tau | alpha1 [-] | 1/alpha2 [day] | 1/beta [day] | 1/gamma [day] | |
---|---|---|---|---|---|---|---|---|
0th | 06Feb2020 | 21Apr2020 | SIR-F | 480 | 0.026 | 1376 | 10 | 61 |
1st | 22Apr2020 | 04Jul2020 | SIR-F | 480 | 0.073 | 1247 | 28 | 12 |
2nd | 05Jul2020 | 23Jul2020 | SIR-F | 480 | 0 | 4206 | 7 | 14 |
3rd | 24Jul2020 | 01Aug2020 | SIR-F | 480 | 0.002 | 8228 | 7 | 12 |
4th | 02Aug2020 | 14Aug2020 | SIR-F | 480 | 0.001 | 2859 | 10 | 15 |
5th | 15Aug2020 | 29Aug2020 | SIR-F | 480 | 0.002 | 3580 | 12 | 9 |
6th | 30Aug2020 | 12Sep2020 | SIR-F | 480 | 0 | 712 | 15 | 10 |
The dimensional parameter is omitted because it shows the same progress as the dimensionless parameter, but the graph can be obtained with the following code.
#For beta->Graph omitted
snl.history(target="1/beta [day]", filename=None)
The basic / effective reproduction number Rt of SIR-F model [^ 4] is defined as follows.
\begin{align*}
R_t = \rho (1 - \theta) (\sigma + \kappa)^{-1} = \beta (1 - \alpha_1) (\gamma + \alpha_2)^{-1}
\end{align*}
Transition:
#List
cols = ["Start", "End", "ODE", "tau", "Rt"]
snl.summary(columns=cols)
#Graph
snl.history(target="Rt", filename="rt.jpg ")
Start | End | ODE | Rt | |
---|---|---|---|---|
0th | 06Feb2020 | 21Apr2020 | SIR-F | 5.54 |
1st | 22Apr2020 | 04Jul2020 | SIR-F | 0.41 |
2nd | 05Jul2020 | 23Jul2020 | SIR-F | 2.01 |
3rd | 24Jul2020 | 01Aug2020 | SIR-F | 1.75 |
4th | 02Aug2020 | 14Aug2020 | SIR-F | 1.46 |
5th | 15Aug2020 | 29Aug2020 | SIR-F | 0.8 |
6th | 30Aug2020 | 12Sep2020 | SIR-F | 0.69 |
Since $ Rt> 1 $ is one indicator of the spread of infection, the horizontal line $ Rt = 1 $ is displayed.
The table also includes the RMSLE score, which measures the accuracy of the parameters, the number of parameter sets proposed by the ʻoptuna` package to estimate the parameters, and the execution time.
cols = ["Start", "End", "RMSLE", "Trials", "Runtime"]
snl.summary(columns=cols)
Start | End | RMSLE | Trials | Runtime | |
---|---|---|---|---|---|
0th | 06Feb2020 | 21Apr2020 | 1.17429 | 690 | 1 min 1 sec |
1st | 22Apr2020 | 04Jul2020 | 1.11459 | 764 | 1 min 0 sec |
2nd | 05Jul2020 | 23Jul2020 | 0.0331522 | 810 | 1 min 1 sec |
3rd | 24Jul2020 | 01Aug2020 | 0.0201773 | 816 | 1 min 1 sec |
4th | 02Aug2020 | 14Aug2020 | 0.0751473 | 808 | 1 min 0 sec |
5th | 15Aug2020 | 29Aug2020 | 0.0420563 | 804 | 1 min 0 sec |
6th | 30Aug2020 | 12Sep2020 | 0.0132161 | 658 | 0 min 25 sec |
The definition formula of RMSLE score (Root Mean Squared Log Error) [^ 5] is as follows. It can be said that the closer it is to 0, the better the actual data is reflected. Although omitted, the verification of the estimation method itself is performed using theoretical data (data on the number of patients theoretically created from the SIR-F model formula and parameter set example).
\begin{align*}
& \sqrt{\cfrac{1}{n}\sum_{i=1}^{n}(log_{10}(A_{i} + 1) - log_{10}(P_{i} + 1))^2}
\end{align*}
$ A $ is the actual data and $ P $ is the predicted value. When $ i = 1, 2, 3, 4 (= n) $, $ A_i $ and $ P_i $ are the actual data / predicted values of $ S, I, R, F $, respectively.
Since it is difficult to get an image with just the value, I made a graph. First, about the 0th phase, which has the largest RMSLE value. The top graph shows the difference between the actual data and the predicted value, and the second, third and fourth show both the actual data and the predicted value for each variable.
snl.estimate_accuracy(phase="0th", filename=None)
There is some error. It seems better to split the 0th phase using Scenario.separate ()
etc. (a separate article will be created).
On the other hand, in the 6th phase, which has the lowest RMSLE score, the actual data and the predicted value often overlap.
snl.estimate_accuracy(phase="6th", filename=None)
This time, I explained how to estimate the parameters of each Phase. More than half a year has passed since the epidemic started, and I think we are at a stage where parameter verification, future parameter prediction, and scenario analysis are important. It seems that the attention of COVID-19 as a theme of data science is decreasing, but as the data is accumulated, it has become possible to perform deeper analysis than in the early days when we focused on creating dashboards. It was.
Thank you for your hard work this time as well!
Recommended Posts