I would like to take a more concrete look at what the multiple regression analysis can tell us. Therefore, we will create sample data from familiar statistical data and perform multiple regression analysis with it to try interpretation. In creating the sample data this time, we used the following two data sources.
➀ Government statistics portal site "Government statistics general window e-Stat" Prefectural data https://www.e-stat.go.jp/regional-statistics/ssdsview/prefectures ➁ Ministry of Health, Labor and Welfare "New Coronavirus Infection" Status of test-positive persons in each prefecture https://www.mhlw.go.jp/content/10906000/000646813.pdf
First, let the objective variable be the rate of people infected with the new coronavirus. As explanatory variables that may influence this, we have prepared the following seven variables as indicators related to the so-called "three dense" and "people's activities" that are thought to lead to the spread of infection.
Indicator name | Index calculation formula | Survey year |
---|---|---|
Densely Inhabited District Population Ratio(%) | Densely Inhabited District Population(Man) / 総Man口(Man) | 2015 |
Day / night population ratio(%) | Daytime population/Night population | 2015 |
Employment ratio(%) | Number of employees(Man) / 総Man口(Man) | 2015 |
Restaurant / accommodation business employment ratio(%) | Number of employees(Restaurant / accommodation business)(Man) / Number of employees(Man) | 2005 |
Travel behavior rate(%) | Travel behavior rate 10歳以上(%) | 2016 |
Foreign guest ratio(%) | Total number of foreign guests(Man) /Total number of guests(Man) | 2018 |
Single household ratio(%) | Number of single households(Household) / Household数(Household) | 2018 |
Infected person rate per 100,000 population(%) | Number of infected people Positive date(Man) / 総Man口(Man) | As of July 5, 2020 |
import numpy as np #Numerical calculation
import pandas as pd #Data frame manipulation
from sklearn import linear_model #Linear model of machine learning
#Specify the URL and read the csv file
url = 'https://raw.githubusercontent.com/yumi-ito/sample_data/master/covid19_factors_prefecture.csv'
df = pd.read_csv(url)
#Check the contents by displaying the first 5 lines of data
df.head()
I have sample data (covid19_factors_prefecture.csv) on GitHub, so I'm reading from there. There is a "densely inhabited district", but this densely inhabited district (DID) is an urban area defined by a certain standard based on statistical data. So what is an urban area? It is a particularly densely populated area, which in a broad sense is an urban area. Roughly speaking, how concentrated is the population of a prefecture in the city? For example, in Hokkaido, if the population density is simply set, it will be scattered because the area is large, but the population ratio of the densely populated area is 75.2%, and 3 out of 4 people live in the city area, which shows that the density is remarkable.
#Get summary statistics for each column
df.describe()
Use the pandas describe
function.
Regarding the "infection rate per 100,000 population", it goes without saying that the maximum value is 46.34% in Tokyo and the minimum value is 0.00% in Iwate prefecture.
I would like to take a look at the top 5 of each variable.
It goes without saying that Tokyo is generally the top prefecture, but I'm curious that the two Hokuriku prefectures, Ishikawa and Toyama, are ranked in the infection rate. Both prefectures are inconspicuous in terms of the number of infected people, but the ratio of infected people to the prefecture's population is high. By the way, the actual number is 300 in Ishikawa prefecture and 228 in Toyama prefecture. In Ishikawa Prefecture, the ratio of restaurant / accommodation workers to all workers in the prefecture is 5.8%, which is the same as that of Nagano Prefecture, which is the fifth highest in Japan. The intention of adopting this variable is a little expanded interpretation, but the idea is that tourism-related economic activities such as accommodation and accompanying food and drink are active, and there are many opportunities for human contact. In that sense, the prominence of Okinawa Prefecture can be nodded. In addition, in Hokkaido, coupled with the large number of infected people, there were many reports of tourism impacts in the media, but it is true that the ratio of foreigners to the total number of overnight guests in one year is It is the 4th place after Kyoto prefecture with 25.3%.
Now, let's do multiple regression analysis.
#Extract only the explanatory variable and store it in the variable X
X = df.loc[:, 'Densely Inhabited District Population Ratio':'Single household ratio']
#Extract only the objective variable and store it in variable Y
Y = df["Infected person rate per 100,000 population"]
#Instantiate a linear model
model = linear_model.LinearRegression()
#Pass data to generate model
model.fit(X,Y)
#Get the value of the coefficient and store it in the variable coefficient
coefficient = model.coef_
#Convert to data frame with column name and index name
df_coefficient = pd.DataFrame(coefficient, columns=["Partial regression coefficient"], index=[X.columns])
df_coefficient
The partial regression coefficient represents the ** magnitude of the effect of each explanatory variable on the objective variable **. First, the "restaurant / accommodation employment ratio" is by far the largest, followed by the "traveler ratio" and the "day / night population ratio." If it involves accommodation, it's mostly either sightseeing or business. In other words, it can be said that the fact that many people come from outside the prefecture and stay for a certain period of time, and therefore the proportion of people engaged in restaurants and lodging businesses is high, has a great influence on the infection rate. Furthermore, if many people go in and out of the prefecture for travel or commuting to school, the number of infected people will also increase, so in short, "restriction of movement" will be effective in preventing infection.
#Get intercept
model.intercept_
The intercept (intersection with the Y-axis) was calculated by the ʻintercept_` function, which revealed the multiple regression equation.
#Get the coefficient of determination
model.score(X, Y)
Finally, calculate the coefficient of determination $ R ^ 2 $ with the score
function to check the“ goodness ”of the multiple regression equation.
In other words, to what extent can this multiple regression equation explain the actual causal relationship?
To be honest, I thought that I wanted it to exceed 0.8, but I think it is necessary to consider the composition of the explanatory variables (because I made it quickly).
Recommended Posts