Introduction

I will post to Quiita after a long time. Recently, I started to analyze statistical data on the pandemic of the new coronavirus (as a personal lifework, not a job?). And I have posted some articles on my blog. -Situation of each country from the viewpoint of the lifesaving rate of the new corona: What is the policy taken by that developed country with an extremely low lifesaving rate? │ YUUKOU's experience value -[Understanding the transition of the lifesaving rate of the new Corona: US strength, critical UK, Netherlands, China whose transition is too beautiful │ YUUKOU's experience value](https://yuukou-exp.plus/covid19-rescue-ratio -timeline-analysis-20200401 /)

For example, a chart that plots the time-series transition of the lifesaving rate is posted as the result of data analysis. (Although the counting criteria for infected people differ from country to country, the data shows that Japan has excellent medical practice in the world.)

This time, I would like to share the code of preparation for analyzing the new coronavirus statistical data provided by Johns Hopkins University.

-Public data of Johns Hopkins University

With this code, you'll be able to generate a data frame for the new coronavirus statistics and be ready to work on your data analysis.

We hope that you will make a small contribution if you use it.

Data download & processing

Johns Hopkins University publishes statistical data (and in chronological order!) Of new coronavirus infections worldwide on github. --Repost: Public data from Johns Hopkins University

The overall flow of processing is to use ʻurllibto get the data and then process it. Statistical data published by Johns Hopkins University includes three things:confirmed, deaths, and recovered`. In addition, there are records that record the particle size up to the regional unit of each country. This time, we will summarize and analyze by country.

However, there is one caveat. Even though it is a time series, there are dozens of columns for each date lined up in the column direction, so we have to convert it into an easy-to-use structure.

For example, this is the data frame. (In the case of the number of confirmed infections) You can see that the columns that look like dates are lined up.

By structurally transforming time series columns in the row direction and aggregating them by country, you can settle into an orthodox data frame that is easy to handle.

This time, I implemented it on Jupyter Notebook. So, I think that it will work if you paste the code posted in the entry as it is and execute it in order from the top.

Crawler class implementation

Define a crawler class. The name is just that. It seems that it will be reused in other notebooks, so I made it a class for the time being.

import urllib
import json
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import io
from dateutil.parser import parse
from tqdm import tqdm, tqdm_notebook

class Crowler():

  def __init__(self):
    """
Crawler class

    """
    self._ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) '\
      'AppleWebKit/537.36 (KHTML, like Gecko) '\
      'Chrome/55.0.2883.95 Safari/537.36 '

  def fetch(self, url):
    """
Specify the URL and execute the HTTP request.

    :param url:
    :return:Request result(html)
    """
    req = urllib.request.Request(url, headers={'User-Agent': self._ua})
    return urllib.request.urlopen(req)

various settings

Define the crawler instance declaration and the URL of each data source.

#Crawler instance
cr = Crowler()

#Time series data of infected person transition
url_infection = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'

#Time series data of fatalities
url_deaths = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'

#Healer time series data
url_recover = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'

Get each data source

Crawl the three data sources and convert them into data frames once.

url_map = {'infection': url_infection,
           'deaths': url_deaths,
           'recover': url_recover}
df_house = {}

for _k, _url in url_map.items():
    _body_csv = cr.fetch(_url)
    df_house[_k] = pd.read_csv(_body_csv)

df_house is a dictionary that stores three data frames. The contents are as follows.

--Data frame of confirmed number of infected people

--Data frame of fatalities

--Data frame of the number of healers

Table structure conversion

Preparing a function to convert to date type

Time series columns have a format like 3/27/20 and cannot be converted as they are with Python's dateutil.parser.parse. It's muddy, but once we have a function to convert it to the standard YYYY-mm-dd format.

def transform_date(s):
    """
    '3/15/20'Format date'2020-03-15'like'YYYY-mm-dd'Convert to format
    """
    _chunk = str(s).split('/')
    return '20{year}-{month:02d}-{day:02d}'.format(year=_chunk[2], month=int(_chunk[0]), day=int(_chunk[1]))

Convert each data frame

Converts time series columns into rows in each of the three data frames. Convert the column named date to have a time series.

df_buffer_house = {}
for _k, _df in df_house.items():
    df_buffer_house[_k] = {'Province/State':[], 
                           'Country/Region':[],
                           'date': [],
                           _k: []}
    _col_dates = _df.columns[4:]
    for _k_date in tqdm(_col_dates):
        for _idx, _r in _df.iterrows():
            df_buffer_house[_k]['Province/State'].append(_r['Province/State'])
            df_buffer_house[_k]['Country/Region'].append(_r['Country/Region'])
            df_buffer_house[_k]['date'].append(transform_date(_k_date))
            df_buffer_house[_k][_k].append(_r[_k_date])

When executed on Jupyter Notebook, the conversion will proceed while displaying the progress bar as shown below.

100%|██████████████████████████████████████████| 72/72 [00:05<00:00, 12.37it/s]
100%|██████████████████████████████████████████| 72/72 [00:05<00:00, 12.89it/s]
100%|██████████████████████████████████████████| 72/72 [00:05<00:00, 13.27it/s]

The structure of the three data frames has become much better, so all I have to do is combine them, but there is a caveat.

In the number of infections (ʻinfection) and the number of deaths ( deaths), multiple Province / States are recorded, but in the number of cures (recover), it is recorded as a country unit. There is. Example) Canada`

Therefore, it is necessary to aggregate each data frame by country and then combine them.

df_integrated = pd.DataFrame()
col_integrated = ['Country/Region', 'date']
df_chunk = {}
for _k, _df_dict in df_buffer_house.items():
    _df_raw = pd.DataFrame.from_dict(_df_dict)
    # 'Country/Region'Aggregate by
    _df_grouped_buffer = {'Country/Region':[], 'date':[] , _k:[]}
    for _idx, _grp in tqdm(_df_raw.groupby(col_integrated)):
        _df_grouped_buffer['Country/Region'].append(_idx[0])
        _df_grouped_buffer['date'].append(_idx[1])
        _df_grouped_buffer[_k].append(_grp[_k].sum())
    df_chunk[_k] = pd.DataFrame.from_dict(_df_grouped_buffer)    
    
df_integrated = df_chunk['infection'].merge(df_chunk['deaths'], on=col_integrated, how='outer')
df_integrated = df_integrated.merge(df_chunk['recover'], on=col_integrated, how='left')

I will do it.

100%|██████████████████████████████████| 13032/13032 [00:08<00:00, 1621.81it/s]
100%|██████████████████████████████████| 13032/13032 [00:08<00:00, 1599.91it/s]
100%|██████████████████████████████████| 13032/13032 [00:07<00:00, 1647.02it/s]

Operation check

Let's see if the Canada mentioned in the previous example has been converted to proper data.

Sounds okay! There was no sign that there were many missing records in Nan, and we were able to confirm that the numbers were changing in chronological order!

Analysis example using converted statistical data

I would like to introduce an example of analysis code using the statistical data of the new coronavirus obtained by this conversion.

Calculation of lifesaving rate and infection termination

Calculation of lifesaving rate

The lifesaving rate is defined here as the ratio of the number of patients who have been cured (Total Recovered Cases) to the number of patients who have completed treatment (Closed Cases) (*).

Patients who have completed treatment (Closed Cases) are classified as follows. (1) Recovered Cases (2) Death Cases

Resuce Ratio = \ frac {Total Recovered (number of patients cured)} {Closed Cases (number of patients who have completed treatment)}

Calculation of infection termination

It is a number that shows how close the infection in each country is. It shows the ratio of how many patients have been treated to the total number of infected people.

Phase Position = \ frac {Closed Case (number of patients who have completed treatment)} {Total Case (cumulative number of infected people)}

Phase Position takes a value between 0.0 and 1.0. The closer it is to 0.0, the earlier the infection phase. The closer it is to 1.0, the more the infection phase is in the final stages.

Calculation code example

df_grouped = df_integrated
df_grouped['date'] = pd.to_datetime(df_grouped['date'])

#Calculation of lifesaving rate
df_grouped['rescue_ratio'] = df_grouped['recover']/(df_grouped['recover'] + df_grouped['deaths'])
df_grouped['rescue_ratio'] = df_grouped['rescue_ratio'].fillna(0)

#Calculation of infection termination
#Number of patients who have completed treatment=Number of patients cured+Number of patients who died
df_grouped['phase_position'] = (df_grouped['recover'] + df_grouped['deaths'])/df_grouped['infection']

Confirmation of calculation result

Let's check the calculation results using the United States as an example. Then the following data frame will be displayed. 　

The United States is still in the early stages of infection, and although lifesaving rates are picking up, it can be seen that the situation is still severe.

Summary and introduction of analysis entries

So, I have introduced the code for preparation for analyzing the statistical data of the new coronavirus. Statistical data from Johns Hopkins University is one of the data sources that are currently attracting attention in the world, so I hope that you will actively disseminate information by trial and error of various analytical ideas. think!

So, let's start with the beginning, and I would like to conclude by introducing the new corona analysis entry I wrote.

-Situation of each country from the viewpoint of the lifesaving rate of the new corona: What is the policy taken by that developed country with an extremely low lifesaving rate? │ YUUKOU's experience value -[Understanding the transition of the lifesaving rate of the new Corona: US strength, critical UK, Netherlands, China whose transition is too beautiful │ YUUKOU's experience value](https://yuukou-exp.plus/covid19-rescue-ratio -timeline-analysis-20200401 /) -Results of quantifying the infection phases of the new coronavirus and countries around the world: The United States is dangerous │ YUUKOU's experience value

I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University