You can drop the data here. https://catalog.data.metro.tokyo.lg.jp/dataset/t000010d0000000068/resource/c2d997db-1450-43fa-8037-ebb11ec28d4c Although it is in csv format and there are many columns that do not contain anything, I think that the data is clean and easy to handle. At the time of writing this, there seems to be data up to 7/9.
This time, I created an analysis environment on Jupyter set up with Docker. It's quite appropriate, but please understand that I'm just reusing the ones I used elsewhere (I don't require this much).
FROM python:3.8.2
USER root
EXPOSE 9999
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1
RUN mkdir /code
WORKDIR /code
RUN apt-get update && apt-get -y install locales default-mysql-client && \
localedef -f UTF-8 -i ja_JP ja_JP.UTF-8
ENV LANG ja_JP.UTF-8
ENV LANGUAGE ja_JP:ja
ENV LC_ALL ja_JP.UTF-8
ENV TZ JST-9
ENV TERM xterm
ADD ./requirements_python.txt /code
RUN pip install --upgrade pip
RUN pip install -r /code/requirements_python.txt
WORKDIR /root
RUN jupyter notebook --generate-config
RUN echo c.NotebookApp.port = 9999 >> ~/.jupyter/jupyter_notebook_config.py
RUN echo c.NotebookApp.token = \'jupyter\' >> ~/.jupyter/jupyter_notebook_config.py
CMD jupyter lab --no-browser --ip=0.0.0.0 --allow-root
requirement_python.txt
glob2
json5
jupyterlab
numpy
pandas
pyOpenSSL
scikit-learn
scipy
setuptools
tqdm
urllib3
matplotlib
xlrd
From here, we will describe the implementation.
First, load the required packages and files. This time I only use matplotlib and pandas. Regarding the publication date, change it to the datetime type at this timing.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
df_patient = pd.read_csv('./data/130001_tokyo_covid19_patients.csv')
df_patient['Published_date'] = pd.to_datetime(df_patient['Published_date'])
df_patient.head()
This data processing will be done as follows.
This is not so difficult with group by. Also, leave only the necessary columns at this timing.
df_patient_day = df_patient.groupby(['Published_date','patient_Age']).count().reset_index()[['Published_date','patient_Age','No']]
df_patient_day
Also, if there is Japanese in the description of the age, the characters will be garbled in the matplotlib part (it is troublesome to handle by reusing the environment as described above), so replace it as follows.
genes_dict = {'Under 10 years old':'under 10',\
'10's': '10', \
'20's': '20', \
'30s': '30', \
'Forties': '40', \
'50s': '50', \
'60s': '60', \
'70s': '70', \
'80s': '80', \
'90s': '90', \
'100 years and over': 'over 100', \
"'-": '-',
'unknown': 'unknown'
}
df_patient_day['patient_Age'] = [genes_dict[x] for x in df_patient_day['patient_Age'].values.tolist()]
df_patient_day
There is a problem in the above case, and if the number of newly infected people does not exist on that day and that age, there is no data and it will be a problem when taking a moving average later, so here this data is in the above age range. × Create a direct product of the entire date range and combine it with the above DataFrame (please let me know if you know a better way here!).
genes = ['under 10',\
'10', \
'20', \
'30', \
'40', \
'50', \
'60', \
'70', \
'80', \
'90', \
'over 100', \
'-',
'unknown'
]
days = pd.date_range(start=df_patient['Published_date'].min(), end=df_patient['Published_date'].max(), freq='D')
data = [[x, y] for x in days for y in genes]
df_data = pd.DataFrame(data, columns=['Published_date', 'patient_Age'])
df_data = pd.merge(df_data, df_patient_day, on=['Published_date', 'patient_Age'], how='left').fillna(0)
df_data = df_data.rename(columns={'No':'Number of people'})
df_data
Take a moving average for each age group. You can easily get a moving average with the pandas function. For a 7-day moving average, just do rolling (7). If you want to take the average, do rolling (7) .mean (). And since the first 6 days will be nan, delete it with dropna (). This time, for later implementation, I will make it a DataFrame for each age and store it in the dictionary. That's all there is to it!
result_diff = {}
for x in genes:
df = df_data[df_data['patient_Age'] == x]
df = pd.Series(df['Number of people'].values.tolist(), index=df['Published_date'].values)
result_diff[x] = df.rolling(7).mean().dropna()
Finally visualize.
fig, axe = plt.subplots()
for x in genes:
df_diff = result_diff[x]
axe.plot(df_diff.index, df_diff.values, label=x)
axe.legend()
axe.set_ylim([0,65])
Finally, I will display the result.
From the 20s to the 50s, it was confirmed that the ages have changed in order of younger age. Even so, how to increase the number of people in their twenties is amazing. The announcement by Tokyo was not a lie.
I don't mean to say what the factors are here, but you can quickly check the content of the report with public data like this, so why not try it as a practice as well? There are still many things that can be investigated by comparing this with the actual population distribution, and I think it is a good teaching material for actually practicing data processing.
Recommended Posts