There is a lot of information about coronaviruses on the market these days, but I think it's difficult to judge whether the information is correct if you don't know who posted it.

** To get truly correct information, you should analyze the primary information yourself as much as possible **. In this article, we will compare the severity rates by age group using Positive Patient Attribute Data published by Hokkaido. ..

Data read

Use Python's Pandas for analysis. First, import Pandas.

import pandas as pd

Then load the data.

df = pd.read_csv("https://www.harp.lg.jp/opendata/dataset/1369/resource/3132/010006_hokkaido_covid19_patients.csv", encoding="shift-jis")

You can check the read data with the head method.

df.head()

Categorization of age and severity

Now, this time we will compare the severity of each age group. First, let's see how the current data is categorized.

Let's start with the age.

df["patient_Age"].value_counts()

Undisclosed 231
20s 223
70s 219
60s 202
50s 193
40s 176
80s 163
30s 157
90s 75
Teen 33
Less than 10 16
100s 5
Under 10 years old 4
Elderly 1
Name:patient_Age, dtype: int64

Generally, it is divided by "-generation", but there are some notation fluctuations (under 10 and under 10 years old) and age unknown (elderly and undisclosed).

Since it is difficult to analyze with existing categories, define categories here and assign new categories to each data.

First, define which of the new categories the original category fits into.

age_dict = {
    "Less than 10": "Teens and younger",
    "Under 10 years old": "Teens and younger",
    "10's": "10's以下",
    "20's": "20's",
    "30s": "30s",
    "Forties": "Forties",
    "50s": "50s",
    "60s": "60s",
    "70s": "70s",
    "80s": "80s",
    "90s": "90s以上",
    "100s": "90s and over",
    "Undisclosed": "unknown",
    "Senior citizens": "unknown"
}

Then add a new category column to the DataFrame.

df["Age category"] = [age_dict[key] for key in df["patient_Age"]]

Based on the age category defined here, the number of severely ill people will be counted.

Similarly, for the patient status, check the original category and define the new category.

df["patient_Status"].value_counts()

Mild conversation possible 1004
−              108
Undisclosed 102
Asymptomatic 102
Asymptomatic conversation possible 97
Mild 88
Mild, conversation possible 54
Moderate conversation possible 35
Mild / conversation possible 30
Moderate 29
Severe 13
Severe conversation not possible 9
Rest on bed, conversation possible 7
Asymptomatic, conversation possible 5
Serious injury: No conversation 3
Positive after death 2
Moderate conversation not possible 2
No symptom, conversation possible 2
Rest on the bed, conversation possible 1
Negative confirmed 1
Mild high fever 1
Under investigation 1
Degree of communication 1
Moderate / conversation possible 1
Name:patient_Status, dtype: int64

stat_dict = {
    "Severe": "3.Severe",
    "Severe conversation not possible": "3.Severe",
    "Serious injury, no conversation": "3.Severe",
    "Moderate conversation possible": "2.Moderate",
    "Moderate": "2.Moderate",
    "Moderate conversation not possible": "2.Moderate",
    "Moderate / conversation possible": "2.Moderate",
    "Mild conversation possible": "1.Mild",
    "Mild": "1.Mild",
    "Mild, conversation possible": "1.Mild",
    "Mild / conversation possible": "1.Mild",
    "Mild high fever": "1.Mild",
    "Asymptomatic conversation possible": "0.No symptoms",
    "Asymptomatic": "0.No symptoms",
    "Asymptomatic, conversation possible": "0.No symptoms",
    "No symptom, conversation possible": "0.No symptoms",
    "−": "unknown",
    "Undisclosed": "unknown",
    "Rest on the bed, conversation possible": "unknown",
    "Turned positive after death": "unknown",
    "Degree of communication": "unknown",
    "Negative confirmed": "unknown",
    "investigating": "unknown",
    "Rest on the bed, conversation possible": "unknown"
}

df["State category"] = [stat_dict[key] for key in df["patient_Status"]]

This completes the assignment of age and state categories. You can check how it was actually assigned with the head method.

df.head()

Aggregation of the number of severely ill persons by age

Now that the categories are ready, let's start counting the number of patients by status category. We adopted crosstab by crosstab for aggregation.

//Japaneseization of matplotlib
pip install japanize-matplotlib
import japanize_matplotlib
import seaborn as sns
sns.set(font="IPAexGothic")

pd.crosstab(df["Age category"], df["State category"]).apply(
    lambda x: x/sum(x), axis=1
).plot(
    kind="bar",
    logy=True,
    rot=45,
    figsize=(8,4),
    color=["grey", "grey", "orange", "red", "grey"]
).legend(loc="upper left")

Since the number of moderate and severe cases is small (less than 10%) overall, the y-axis is displayed logarithmically.

It is often said that coronavirus remains mild in young people and tends to become severe in elderly people, but when actually aggregated, this tendency is certainly seen.

There are almost no moderate or severe cases until the 30s, and the proportion of severely ill cases clearly increases in proportion to the age from the 40s to the 80s **.

Summary

This time, using the coronavirus-positive person attribute data in Hokkaido, it was confirmed that the severity rate increases in proportion to age.

In this way, you can obtain more accurate knowledge by ** analyzing the primary data published by the national and prefectural governments by yourself **.

Open data is not always correct, but why not try the method introduced here as one of the ways to get as accurate information as possible quickly.

that's all.

Analyzing the age-specific severity of coronavirus

Data read

Categorization of age and severity

Aggregation of the number of severely ill persons by age

Summary