[CovsirPhy] COVID-19 Python Package for Data Analysis: Parameter estimation

Introduction

We are creating a Python package CovsirPhy that allows you to easily download and analyze COVID-19 data (such as the number of PCR positives).

Introductory article:

** This time, we will introduce Parameter estimation (parameter estimation such as SIR-F model). ** **

The English version of the document is Covsir Phy: COVID-19 analysis with phase-dependent SIRs, [Kaggle: COVID-19 data with SIR model]( Please refer to https://www.kaggle.com/lisphilar/covid-19-data-with-sir-model).

1. Execution environment

CovsirPhy can be installed by the following method! Please use Python 3.7 or above, or Google Colaboratory.

--Stable version: pip install covsirphy --upgrade --Development version: pip install" git + https://github.com/lisphilar/covid19-sir.git#egg=covsirphy "

import covsirphy as cs
cs.__version__
# '2.8.2'

	Execution environment
OS	Windows Subsystem for Linux / Ubuntu
Python	version 3.8.5

2. Preparation

The tables and graphs in this article were created using the data as of 9/12/2020. Click here for the code [^ 2] to download the actual data from COVID-19 Data Hub [^ 1]:

data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()

In addition, please confirm the actual data and execute S-R trend analysis [^ 3] with the following code.

[^ 3]: [CovsirPhy] COVID-19 Python package for data analysis: S-R trend analysis

# (Optional)Acquisition of data from the Ministry of Health, Labor and Welfare
japan_data = data_loader.japan()
jhu_data.replace(japan_data)
print(japan_data.citation)
#Instantiation of analysis class
snl = cs.Scenario(jhu_data, population_data, country="Japan")
#Confirmation of actual data
snl.records(filename=None)
# S-R trend analysis
snl.trend(filename=None)
#Check Phase settings
snl.summary()

Actual data graph:

S-R trend analysis:

Phase setting:

	Type	Start	End	Population
0th	Past	06Feb2020	21Apr2020	126529100
1st	Past	22Apr2020	04Jul2020	126529100
2nd	Past	05Jul2020	23Jul2020	126529100
3rd	Past	24Jul2020	01Aug2020	126529100
4th	Past	02Aug2020	14Aug2020	126529100
5th	Past	15Aug2020	29Aug2020	126529100
6th	Past	30Aug2020	12Sep2020	126529100

3. Execution example

By S-R trend analysis, it was possible to divide into "Phase" (period during which the parameter is constant). In this article, the parameter values are estimated using the data of each "Phase" (for example, the data of 2020/2/6 --2020/4/21 for the 0th phase).

I will write another article about the estimation mechanism. The parameter values are proposed using the ʻoptunapackage, the numerical solution is calculated byscipy.integrate.solve_ivp`, and the parameter set with little error from the actual data is selected.

How to read the result will be described later, but you can execute and get the result list in the following two lines. This time, I used SIR-F model [^ 4].

[^ 4]: [CovsirPhy] COVID-19 Python package for data analysis: SIR-F model

# Parameter estimation with SIR-F model
snl.estimate(cs.SIRF)
#Get parameter list
snl.summary()

#Example of standard output (processing time varies depending on the number of CPUs, etc.)
#Details below: Latest Phase=After estimating including tau in the 6th phase, fix tau and 0-5th parameter estimation
<SIR-F model: parameter estimation>
Running optimization with 8 CPUs...
        6th phase (30Aug2020 - 12Sep2020): finished  704 trials in 0 min 25 sec
        5th phase (15Aug2020 - 29Aug2020): finished  965 trials in 1 min  0 sec
        3rd phase (24Jul2020 - 01Aug2020): finished  965 trials in 1 min  0 sec
        1st phase (22Apr2020 - 04Jul2020): finished  913 trials in 1 min  0 sec
        4th phase (02Aug2020 - 14Aug2020): finished  969 trials in 1 min  0 sec
        0th phase (06Feb2020 - 21Apr2020): finished  853 trials in 1 min  0 sec
        2nd phase (05Jul2020 - 23Jul2020): finished  964 trials in 1 min  0 sec
Completed optimization. Total: 1 min 27 sec

	Type	Start	End	Population	ODE	Rt	theta	kappa	rho	sigma	tau	1/alpha2 [day]	1/gamma [day]	alpha1 [-]	1/beta [day]	RMSLE	Trials	Runtime
0th	Past	06Feb2020	21Apr2020	126529100	SIR-F	5.54	0.0258495	0.0002422	0.0322916	0.00543343	480	1376	61	0.026	10	1.17429	804	1 min 0 sec
1st	Past	22Apr2020	04Jul2020	126529100	SIR-F	0.41	0.0730748	0.000267108	0.0118168	0.0264994	480	1247	12	0.073	28	1.11459	861	1 min 0 sec
2nd	Past	05Jul2020	23Jul2020	126529100	SIR-F	2.01	0.000344333	7.92419e-05	0.0467789	0.023201	480	4206	14	0	7	0.0331522	910	1 min 0 sec
3rd	Past	24Jul2020	01Aug2020	126529100	SIR-F	1.75	0.00169155	4.05087e-05	0.0459332	0.0260965	480	8228	12	0.002	7	0.0201773	923	1 min 0 sec
4th	Past	02Aug2020	14Aug2020	126529100	SIR-F	1.46	0.000634554	0.000116581	0.0325815	0.0221345	480	2859	15	0.001	10	0.0751473	909	1 min 0 sec
5th	Past	15Aug2020	29Aug2020	126529100	SIR-F	0.8	0.00244294	9.30884e-05	0.0272693	0.0337857	480	3580	9	0.002	12	0.0420563	907	1 min 0 sec
6th	Past	30Aug2020	12Sep2020	126529100	SIR-F	0.69	5.36037e-05	0.000467824	0.0219379	0.0312686	480	712	10	0	15	0.0132161	724	0 min 25 sec

4. Parameter estimates

It's long sideways, so let's look at it in order. First is the estimated value of the parameter.

# cs.SIRF.PARAMETERS: SIR-F model parameter name list
cols = ["Start", "End", "ODE", "tau", *cs.SIRF.PARAMETERS]
snl.summary(columns=cols)

	Start	End	ODE	tau	theta	kappa	rho	sigma
0th	06Feb2020	21Apr2020	SIR-F	480	0.0258495	0.0002422	0.0322916	0.00543343
1st	22Apr2020	04Jul2020	SIR-F	480	0.0730748	0.000267108	0.0118168	0.0264994
2nd	05Jul2020	23Jul2020	SIR-F	480	0.000344333	7.92419e-05	0.0467789	0.023201
3rd	24Jul2020	01Aug2020	SIR-F	480	0.00169155	4.05087e-05	0.0459332	0.0260965
4th	02Aug2020	14Aug2020	SIR-F	480	0.000634554	0.000116581	0.0325815	0.0221345
5th	15Aug2020	29Aug2020	SIR-F	480	0.000644851	0.000383424	0.0274104	0.0337156
6th	30Aug2020	12Sep2020	SIR-F	480	5.36037e-05	0.000467824	0.0219379	0.0312686

SIR-F model[^4]:

\begin{align*}
\mathrm{S} \overset{\beta I}{\longrightarrow} \mathrm{S}^\ast \overset{\alpha_1}{\longrightarrow}\ & \mathrm{F}    \\
\mathrm{S}^\ast \overset{1 - \alpha_1}{\longrightarrow}\ & \mathrm{I} \overset{\gamma}{\longrightarrow} \mathrm{R}    \\
& \mathrm{I} \overset{\alpha_2}{\longrightarrow} \mathrm{F}    \\
\end{align*}

$ \ Alpha_2 $, $ \ beta $, $ \ gamma $ has "time" as a dimension. This "time" is $ f (T + \ Delta T) --f (T) = x (T) \ Delta T $, which is a discretized equation of simultaneous ordinary differential equations $ f'(T) = x (T) $. It depends on \ Delta T $. Since data is acquired every other day, $ \ Delta T $ is less than one day (= 1440 min), but I do not know the specific value. It is difficult to compare the dimensional parameters $ \ alpha_2 $, $ \ beta $, $ \ gamma $ of different countries, which may vary by country or region.

Therefore, we set the variable $ \ tau $, which corresponds to $ \ Delta T $, to make the dimensional parameters dimensionless.

\begin{align*}
(S, I, R, F) = & N \times (x, y, z, w)    \\
(T, \alpha_1, \alpha_2, \beta, \gamma) = & (\tau t, \theta, \tau^{-1}\kappa, \tau^{-1}\rho, \tau^{-1}\sigma)    \\
1 \leq \tau & \leq 1440    \\
\end{align*}

At this time [^ 4],

\begin{align*}
& 0 \leq (x, y, z, w, \theta, \kappa, \rho, \sigma) \leq 1  \\
\end{align*}

After estimating the dimensionless parameters as $ \ tau $ using the latest Phase data so that the value of $ \ tau $ is constant in the same country, the same value is calculated as $ \ tau $ when estimating the parameters of other Phases. I am trying to use it as. This time it was 480 min as shown in the table. I'm programming to use a divisor of 1440 min per day to simplify the calculation.

Now, let's graph the transition of each dimensionless parameter.

\begin{align*}
& \frac{\mathrm{d}x}{\mathrm{d}t}= - \rho x y  \\
& \frac{\mathrm{d}y}{\mathrm{d}t}= \rho (1-\theta) x y - (\sigma + \kappa) y  \\
& \frac{\mathrm{d}z}{\mathrm{d}t}= \sigma y  \\
& \frac{\mathrm{d}w}{\mathrm{d}t}= \rho \theta x y + \kappa y  \\
\end{align*}

transition of rho

When Susceptible comes into contact with Infected, the probability of getting infected $ \ rho $ changes:

snl.history(target="rho", filename=None)

It is interpreted that the effect of the state of emergency (area limited 2020/4/7, nationwide 2020/4/16, nationwide cancellation 2020/5/25) appeared in the latter half of April and was maintained until the beginning of July. I will. After that, it jumped up and gradually decreased. Details need to be discussed, but it is a parameter that directly reflects the effects of measures such as three-cs avoidance.

Transition of sigma

Probability of transition from Infected to Recovered $ \ sigma $ transition:

snl.history(target="sigma", filename=None)

After a large rise in the latter half of April, it is on an upward trend while repeating rising / falling. This parameter reflects the medical care provision system and the development / supply status of new drugs.

Transition of kappa

Changes in infected mortality rate $ \ kappa $:

snl.history(target="kappa", filename=None)

It seems that it fluctuates greatly, but since the absolute value of the value is small, it can be considered that it is kept constant to some extent (verification by comparison with other countries is necessary). It is necessary to put in place a medical system and bring $ \ kappa $ as close to 0 as possible with a sufficient supply of new drugs.

Transition of theta

Percentage of infected people who received a definitive diagnosis who died at the time of the definitive diagnosis Changes in $ \ theta $:

snl.history(target="theta", filename=None)

In the early stages of infection, the medical care provision system itself was extremely tight, so it is difficult to interpret, but it is thought that the value will increase if the examination is delayed and appropriate treatment cannot be performed.

5. Transition of dimensional parameters

The dimensional parameters are also included in the table. To make it easier to interpret, the unit is [day], which is the reciprocal except for the dimensionless $ \ alpha_1 $.

cols = ["Start", "End", "ODE", "tau", *cs.SIRF.DAY_PARAMETERS]
fh.write(snl.summary(columns=cols).to_markdown())

	Start	End	ODE	tau	alpha1 [-]	1/alpha2 [day]	1/beta [day]	1/gamma [day]
0th	06Feb2020	21Apr2020	SIR-F	480	0.026	1376	10	61
1st	22Apr2020	04Jul2020	SIR-F	480	0.073	1247	28	12
2nd	05Jul2020	23Jul2020	SIR-F	480	0	4206	7	14
3rd	24Jul2020	01Aug2020	SIR-F	480	0.002	8228	7	12
4th	02Aug2020	14Aug2020	SIR-F	480	0.001	2859	10	15
5th	15Aug2020	29Aug2020	SIR-F	480	0.002	3580	12	9
6th	30Aug2020	12Sep2020	SIR-F	480	0	712	15	10

The dimensional parameter is omitted because it shows the same progress as the dimensionless parameter, but the graph can be obtained with the following code.

#For beta->Graph omitted
snl.history(target="1/beta [day]", filename=None)

6. Changes in the number of effective reproductions

The basic / effective reproduction number Rt of SIR-F model [^ 4] is defined as follows.

\begin{align*}
R_t = \rho (1 - \theta) (\sigma + \kappa)^{-1} = \beta (1 - \alpha_1) (\gamma + \alpha_2)^{-1}
\end{align*}

Transition:

#List
cols = ["Start", "End", "ODE", "tau", "Rt"]
snl.summary(columns=cols)
#Graph
snl.history(target="Rt", filename="rt.jpg ")

	Start	End	ODE	Rt
0th	06Feb2020	21Apr2020	SIR-F	5.54
1st	22Apr2020	04Jul2020	SIR-F	0.41
2nd	05Jul2020	23Jul2020	SIR-F	2.01
3rd	24Jul2020	01Aug2020	SIR-F	1.75
4th	02Aug2020	14Aug2020	SIR-F	1.46
5th	15Aug2020	29Aug2020	SIR-F	0.8
6th	30Aug2020	12Sep2020	SIR-F	0.69

Since $ Rt> 1 $ is one indicator of the spread of infection, the horizontal line $ Rt = 1 $ is displayed.

7. Parameter accuracy

The table also includes the RMSLE score, which measures the accuracy of the parameters, the number of parameter sets proposed by the ʻoptuna` package to estimate the parameters, and the execution time.

cols = ["Start", "End", "RMSLE", "Trials", "Runtime"]
snl.summary(columns=cols)

	Start	End	RMSLE	Trials	Runtime
0th	06Feb2020	21Apr2020	1.17429	690	1 min 1 sec
1st	22Apr2020	04Jul2020	1.11459	764	1 min 0 sec
2nd	05Jul2020	23Jul2020	0.0331522	810	1 min 1 sec
3rd	24Jul2020	01Aug2020	0.0201773	816	1 min 1 sec
4th	02Aug2020	14Aug2020	0.0751473	808	1 min 0 sec
5th	15Aug2020	29Aug2020	0.0420563	804	1 min 0 sec
6th	30Aug2020	12Sep2020	0.0132161	658	0 min 25 sec

The definition formula of RMSLE score (Root Mean Squared Log Error) [^ 5] is as follows. It can be said that the closer it is to 0, the better the actual data is reflected. Although omitted, the verification of the estimation method itself is performed using theoretical data (data on the number of patients theoretically created from the SIR-F model formula and parameter set example).

\begin{align*}
& \sqrt{\cfrac{1}{n}\sum_{i=1}^{n}(log_{10}(A_{i} + 1) - log_{10}(P_{i} + 1))^2}
\end{align*}

$ A $ is the actual data and $ P $ is the predicted value. When $ i = 1, 2, 3, 4 (= n) $, $ A_i $ and $ P_i $ are the actual data / predicted values of $ S, I, R, F $, respectively.

Since it is difficult to get an image with just the value, I made a graph. First, about the 0th phase, which has the largest RMSLE value. The top graph shows the difference between the actual data and the predicted value, and the second, third and fourth show both the actual data and the predicted value for each variable.

snl.estimate_accuracy(phase="0th", filename=None)

There is some error. It seems better to split the 0th phase using Scenario.separate () etc. (a separate article will be created).

On the other hand, in the 6th phase, which has the lowest RMSLE score, the actual data and the predicted value often overlap.

snl.estimate_accuracy(phase="6th", filename=None)

8. Postscript

This time, I explained how to estimate the parameters of each Phase. More than half a year has passed since the epidemic started, and I think we are at a stage where parameter verification, future parameter prediction, and scenario analysis are important. It seems that the attention of COVID-19 as a theme of data science is decreasing, but as the data is accumulated, it has become possible to perform deeper analysis than in the early days when we focused on creating dashboards. It was.

Thank you for your hard work this time as well!