** What is regression analysis? ** A way to know how much the explanatory variable x (cause) affects the objective variable y (result). Simple regression analysis is used when there is only one explanatory variable x, and multiple regression analysis is used when there are multiple explanatory variables x.
** Theoretical model of regression equation ** y = α + βx + u Objective variable = intercept + slope * explanatory variable + error term
Simple regression analysis can be done in Excel, but this time I tried to verify it in Python for practice. (I wrote it after checking the reference materials and receiving guidance from a university professor, but there may be mistakes. I would appreciate it if you could point out: pray :)
** What you want to verify ** This time, we will examine "how much the increase or decrease in the number of direct flights from China, South Korea, Taiwan, and Hong Kong affects the number of visitors to Japan." The objective variable is "the number of visitors to Japan from Asian countries" and the explanatory variable is only "the number of direct flights from Asian countries". In addition to the number of direct flights, exchange rates, natural disasters, security, etc. are also considered to be factors that increase or decrease the number of visitors to Japan, so I think that multiple regression analysis is more suitable for verification, but I would like to verify it again next time.
** Data to use ** --Ministry of Land, Infrastructure, Transport and Tourism Japan Tourism Agency "Accommodation Travel Statistics Survey" 2015-2018 (http://www.mlit.go.jp/kankocho/siryou/toukei/shukuhakutoukei.html) --Ministry of Land, Infrastructure, Transport and Tourism "International Flight Status" 2015-2018 Summer and Winter Timetables (https://www.mlit.go.jp/koku/koku_fr19_000005.html)
I made the following Excel sheet by taking the above two data into a kettle. The number of direct flights from Asian countries and the number of visitors to Japan are summarized by prefecture. 0 is entered for areas where there are no direct flights or where there is no airport in the first place.
Use pandas
to read the data and store it in a data file.
Enter the number of direct flights in x and the number of visitors to Japan in y.
linear-regression.py
import pandas as pd
df = pd.read_excel('2016_summer_original.xlsx', sheet_name='Sheet2', encoding='utf-8')
x = df[['Korea']]
y = df[['Number of visitors to Japan']]
Perform simple regression analysis using scikit-learn
and graph with matplotlib
.
linear-regression.py
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
##Draw a regression line
model_lr = LinearRegression()
model_lr.fit(x, y)
plt.plot(x, y, 'o')
plt.plot(x, model_lr.predict(x), linestyle="solid")
plt.show()
Also make descriptive statistics with stats model
.
linear-regression.py
import statsmodels.api as sm
#Show descriptive statistics
x_add_const = sm.add_constant(x)
model_sm = sm.OLS(y, x_add_const).fit()
print(model_sm.summary())
The following is the execution result. It's an instant kill: laughing:
Let's compare the results of China and Hong Kong in the summer of 2015.
China Model: y = 942.76x + 21142.86 P>|t|:0.000 R-squared:0.405
Hong Kong Model: y = 961.33x + 4053.08 P>|t|:0.000 R-squared:0.654
At the very least, the significance of the analysis results and the explanatory power of the formula should be seen in the P value
and R2
of the descriptive statistics.
The P value
is the probability of rejecting the null hypothesis (the opposite hypothesis to what you want to claim). If it is below 5%, it is statistically significant.
The coefficient of determination R2` is an index that measures how well the estimated regression line fits into the observed data. The closer the value is to 1, the better the fit. In the above figure, if the blue dot is close to the orange line, the fit is good.
From the regression model, China seems to increase by 943 for each additional direct flight. The result is significant because the P value is 0, but the explanation of the formula is low. On the other hand, in Hong Kong, the number of people increases by 961 for each flight, which shows that it is significant and the formula is explainable.
Since the published direct flight data is from 2015 to 2018, the analysis target is limited to the period when the data exists, and since it is not monthly data, it is not possible to analyze continuous changes. It was a pity. I didn't mention it in the article because it's not the main one, but this time it was more difficult to collect and preprocess data than to analyze: sweat_smile: Next time, I would like to verify it by multiple regression analysis.
I tried to explain how to analyze data with Python for beginners [# 1 How to perform simple regression analysis with Scikit-learn](https://medium.com/@yamasaKit/scikit-learn%E3%81%A7%E5%8D%98%E5%9B%9E%E5% B8% B0% E5% 88% 86% E6% 9E% 90% E3% 82% 92% E8% A1% 8C% E3% 81% 86% E6% 96% B9% E6% B3% 95-f6baa2cb761e) How to read the results of simple regression analysis [Excel data analysis tool] [Regression analysis series 2] (Video) [Shinichi Kurihara and Atsushi Maruyama "Statistics Picture Book" Ohmsha](https://www.amazon.co.jp/%E7%B5%B1%E8%A8%88%E5%AD%A6%E5%9B% B3% E9% 91% 91-% E6% A0% 97% E5% 8E% 9F-% E4% BC% B8% E4% B8% 80 / dp / 427422080X)