[CovsirPhy] COVID-19 Python package for data analysis: S-R trend analysis

Introduction

We are creating a Python package CovsirPhy that allows you to easily download and analyze COVID-19 data (such as the number of PCR positives).

Introductory article:

** This time, we would like to introduce S-R trend analysis (an analysis method for trends in the spread of infection). ** **

The English version of the document is Covsir Phy: COVID-19 analysis with phase-dependent SIRs, [Kaggle: COVID-19 data with SIR model]( Please refer to https://www.kaggle.com/lisphilar/covid-19-data-with-sir-model).

1. Execution environment

CovsirPhy can be installed by the following method! Please use Python 3.7 or above, or Google Colaboratory.

--Stable version: pip install covsirphy --upgrade --Development version: pip install" git + https://github.com/lisphilar/covid19-sir.git#egg=covsirphy "

import covsirphy as cs
cs.__version__
# '2.8.2'

	Execution environment
OS	Windows Subsystem for Linux
Python	version 3.8.5

The tables and graphs in this article were created using the data as of 9/11/2020. Click here for the code [^ 2] to download the actual data from COVID-19 Data Hub [^ 1]:

data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()

Regarding the number of infected people in Japan, a value different from the value [^ 3] of Ministry of Health, Labor and Welfare HP is output. It seems. If you want to use the value of MHLW, download the data from COVID-19 dataset in Japan with the following code. Please replace. If ʻImportError / ModuleNotFoundError: requests and aiohttp occur at that time, please install by pip install requests aiohttp` (during the cause investigation, requests should be included in the dependent package ...).

[^ 3]: When the "number of positives", "number of people discharged or canceled" and "number of deaths" defined by the Ministry of Health, Labor and Welfare are treated as confirmed cases, recovered, recovered, and fatal. ..

japan_data = data_loader.japan()
jhu_data.replace(japan_data)
# Citation -> str
print(japan_data.citation)

2. Check the data

Before going into the explanation of S-R trend analysis, let's look at the actual data (Japanese data as an example) using Scenario.records ().

snl = cs.Scenario(jhu_data, population_data, country="Japan")
#When not displaying the graph
snl.records(show_figure=False)

	Date	Confirmed	Infected	Fatal	Recovered
209	2020-09-07	71856	7957	1363	62536
210	2020-09-08	72234	7575	1377	63282
211	2020-09-09	72726	7233	1393	64100
212	2020-09-10	73221	6980	1406	64835
213	2020-09-11	73901	6899	1412	65590

#Display the graph
# filename:Default value=None (Do not save as an image file)
_ = snl.records(filename=None)

3. Purpose of S-R trend analysis

When dealing with infectious disease data using simultaneous ordinary differential equations such as the SIR model, it is common to perform analysis under one of the following assumptions.

--Model parameters are constant from the beginning to the end of the epidemic --Model parameters fluctuate daily and follow Poisson distribution etc.

However, when analyzing COVID-19, I thought that it was necessary to customize rather than adopt either one.

First of all, in this epidemic, each country and each individual has taken some measures to control the epidemic from around February and March 2020 when the infection began to spread to the world. Parameters such as infection rate fluctuate from time to time and cannot be assumed to be constant until the end (clear from the above-mentioned actual data and SIR-F model waveform [^ 4]).

[^ 4]: [CovsirPhy] COVID-19 Python package for data analysis: SIR-F model

On the other hand, if the parameters fluctuate every day, the analysis will depend on each value of the acquired data. Especially in the case of COVID-19, it is necessary to grasp the outline of the epidemic instead of the daily data because the inspection system was not established at the initial stage and there is a possibility that it has infectivity even if it is asymptomatic. I thought.

Therefore, CovsirPhy decided to take advantage of both by subdividing the parameters into periods that seem to be constant. The "period during which the parameters are constant" is called "Phase". From the start to the end of the epidemic, there are multiple Phases with different parameters. The code will be described in another article, but I think that the execution reproduction number $ R_ {\ mathrm {t}} $ of the SIR-F model, for example, changes stepwise as shown in the graph.

So how do you set the Phase? Since it takes some time to calculate the parameters [^ 5], I would like to find the branch point (date when the parameter value changes) of Phase without calculating the parameters.

[^ 5]: It takes about 2 minutes for parallel calculation of 10 phases and 8 CPUs. You can wait for the analysis of one country, but it is time-consuming and difficult considering that the data increases every day and the data of many countries may be analyzed.

4. How it works

Using S-R trend analysis, it is possible to find the turning point of the phase from the transition of the number of susceptibility holders $ S $ and the number of recoverers $ R $. I will omit the process until I come up with it because it will be long, but [Kaggle: COVID-19 data with SIR model, SR trend analysis](https://www.kaggle.com/lisphilar/covid-19-data-with- sir-model # SR-trend-analysis), so please refer to it.

In the simultaneous ordinary differential equations of SIR model [^ 6] and SIR-F model [^ 4], it is as follows for $ S $ and $ R $.

[^ 6]: [CovsirPhy] COVID-19 Python package for data analysis: SIR model

\begin{align*}
& \frac{\mathrm{d}S}{\mathrm{d}T}= - N^{-1}\beta S I  \\
& \frac{\mathrm{d}R}{\mathrm{d}T}= \gamma I  \\
\end{align*}

Let's divide $ \ frac {\ mathrm {d} S} {\ mathrm {d} T} $ by $ \ frac {\ mathrm {d} R} {\ mathrm {d} T} $.

\begin{align*}
\cfrac{\mathrm{d}S}{\mathrm{d}R} &= - \cfrac{\beta}{N \gamma} S  \\
\end{align*}

If you integrate this and put $ a = \ frac {\ beta} {N \ gamma} $ [^ 7],

[^ 7]: Pioneer in analyzing SR planes in SIR model: Balkew, Teshome Mogessie, "The SIR Model When S (t) is a Multi-Exponential Function." (2010) .Electronic Theses and Dissertations.Paper 1747 .

S_{(R)} = N e^{-a R}

Divide both sides by logarithm,

\log S_{(R)} = - a R + \log N

** When the model parameters are constant, draw a semi-log graph with the x-axis as the number of recoverers $ R $ and the y-axis as the number of sensitivity holders $ S $, and it will be a straight line! ** **

(Graph: From the Kaggle Notebook mentioned above, the difference in the slope of the straight line when the parameters are changed)

5. Code example

It can be executed in one line by the Scenario.trend () method. A semi-logarithmic graph of S-R is displayed. Phase is named Initial (0th) phase, 1st phase, 2nd phase, ... [^ 8].

[^ 8]: Related technology: [Python] Convert natural numbers to ordinal numbers

# show_figure:Display the graph
# filename:Default value=None (Do not save as an image file)
snl.trend(show_figure=True, filename=None)

A list of start and end dates for each Phase can be obtained in data frame format using the Scenario.summary () method.

snl.summary()

	Type	Start	End	Population
0th	Past	06Feb2020	21Apr2020	126529100
1st	Past	22Apr2020	03Jul2020	126529100
2nd	Past	04Jul2020	23Jul2020	126529100
3rd	Past	24Jul2020	31Jul2020	126529100
4th	Past	01Aug2020	10Aug2020	126529100
5th	Past	11Aug2020	21Aug2020	126529100
6th	Past	22Aug2020	29Aug2020	126529100
7th	Past	30Aug2020	11Sep2020	126529100

Since the value of the number of susceptibility holders $ S $ changes depending on the value of the total population, it is displayed together as an important analysis condition. In addition, although not shown in this article, parameter estimates can be listed using the Scenario.summary () method. If there is a Phase for which the actual data is not registered, the Type column of the corresponding line will be" Future ".

Although omitted this time, the above Phase setting is performed by Scenario.add (), Scenario.combine (), Scenario.delete (), Scenario.separate (), Scenario.clear (). It is also possible to edit.

6. Implementation

I'm using the ruptures package to find the number of branch points in Phase and calculate the number of recoverers that will be the branch points. I originally used fbprophet, which is famous as a time series analysis package, but there was a problem that the number of branch points had to be set manually. Ilyass Tabial of Colaborator told me [^ 9], and I made it possible to automatically estimate the number of branch points using ruptures.

From covsirphy.ChangeFinder implementation code, the main steps are as follows.

Get the S, R values for each date
Calculate log10 (S)
Set the value of log10 (S) for R = 1, 2, 3. .: If it is not in the data, fill in the values before and after.
Sampling to reduce the amount of calculation: Sampling number = number of original data
Using the Ruptures package, find the set (R, log10 (S)) of points where log10 (S) changes significantly with respect to the amount of change in the number of recoverers.
Collate log10 (S) of the set of change points with the original data to obtain the date of each change point (approximate value if it is not in the original data. Log10 (S) because the number of confirmed cases has a larger variation than R. It was used)
Calculate the start and end dates of each Phase from the date of change

Afterword

Next time, I will explain how to estimate the parameters for each Phase.

Thank you for your hard work!