[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (1)

Introduction

After taking the Statistical Test Level 2 in November 2019, I started studying with the intention of aiming for Level 1 as well. From what I hear, it seems necessary to hold down multivariate analysis in order to take quasi-first grade. So, I will read the reference book and learn multivariate analysis, but I think that it is best to move my hands to deepen my understanding, so I will try various analysis methods using Python. I will decide.

However, the title was set to [Statistical test 2nd grade / quasi 1st grade] because it is a little involved in the problem of reading the analysis result of the statistical processing software (R), which I had somehow passed through in the 2nd grade study. There are questions such as "the number of independent variables + 1" reduces the degree of freedom of the regression coefficient of multiple regression analysis, and "the significance level is xx% and the coefficient is not 0". </ del> It is related to the content to be introduced.

In addition, I will not use statistical libraries here, only libraries such as numpy and pandas (because using it in a black box does not help understanding ...).

Reference book

Katsuya Hasegawa, "Multivariate Analysis from Zero", Gijutsu-Hyoronsha (2004) Chapter 2

Subject

This time, we will process and use the data of "Usage status by route" of JR East. The following comma-separated data was created by formatting the route name and operating kilometers, average number of passengers (FY2018), and passenger transportation revenue (FY2018) in the PDF file. (The "Ofunato Line", "Kesennuma Line", and "Yamada Line" with incomplete data are excluded.)

jreast2018.csv


Line name,Business km,Average number of passers,Passenger transportation revenue
Yamanote Line,20.6,1134963,111014
Saikyo Line,5.5,752645,15032
Tokaido Main Line,169.2,367765,238493
Yokohama Line,42.6,231706,38214
Sobu Main Line,145.4,207099,117700
Keiyo Line,54.3,181483,38201
Negishi Line,22.1,177099,16618
Nambu Line,45,163148,29946
Chuo Main Line,247.8,160328,170577
Takasaki Line,74.7,116646,34095
Musashino Line,105.5,114847,43177
Tohoku Line,572,84567,183688
Joban Line,351,71035,100757
Ome Line,37.2,63151,8989
Yokosuka Line,23.9,62379,6107
Kawagoe Line,30.6,56191,6629
Sotobo Line,93.3,34851,12038
Sagami Line,33.3,29643,4372
Itsukaichi Line,11.1,24712,1025
Senseki Line,49,20497,4466
Uchibo Line,119.4,20483,8544
Ito Line,16.9,16907,1849
Hakushin Line,27.3,15978,1686
Narita Line,119.1,15183,8901
Tsurumi Line,9.7,14133,569
Shinonoi Line,66.7,12465,4717
Ryomo Line,84.4,11346,3176
Hachiko Line,92,9338,3612
Shinetsu Main Line,175.3,9233,6260
Senzan Line,58,9046,2176
Togane Line,13.8,8075,373
Tazawako Line,75.6,7023,4421
Mito Line,50.2,7011,1255
Echigo Line,83.8,6021,1862
Nikko,40.5,5806,863
Joetsu Line,164.4,5389,3464
Ou Line,484.5,4983,14111
Aterazawa Line,24.3,3342,244
Oito Line,70.1,3140,937
Yahiko Line,17.4,2338,150
Agatsuma Line,55.3,2327,615
Uetsu Main Line,271.7,2194,2731
Oga Line,26.4,1877,146
Ban-Etsusai Line,175.6,1745,1198
Suigun Line,147,1666,760
Karasuyama Line,20.4,1457,83
Ban-Etsuto Line,85.6,1385,371
Ishinomaki,44.7,1255,190
Kashima Line,17.4,1221,87
Koumi Line,78.9,1194,420
Kururi Line,32.2,1094,96
Rikuuto Line,94.1,906,296
Hachinohe Line,64.9,883,269
Kamaishi Line,90.2,764,308
Gono Line,147.2,631,352
Iiyama Line,96.7,598,199
Ominato Line,58.4,578,168
Tsugaru Line,55.8,464,112
Hanawa Line,106.9,382,124
Yonesaka Line,90.7,379,105
Rikuusai Line,43,345,67
Kitakami Line,61.1,311,70
Tadami Line,135.2,280,140

Save this file as ** jreast2018.csv with ** character code UTF-8.

Practice

Below, we have confirmed the operation with Python 3.6.8.

Data entry

First of all, you have to import the data. Import CSV format data using pandas.

import numpy as np
import pandas as pd
df = pd.read_csv("jreast2018.csv")
print(df) #Check the contents of the imported data
Line name Business km Average number of passengers Passenger transportation revenue
0 Yamanote Line 20.6  1134963  111014
1 Saikyo Line 5.5   752645   15032
2 Tokaido Main Line 169.2   367765  238493
3 Yokohama Line 42.6   231706   38214
4 Sobu Main Line 145.4   207099  117700
..    ...    ...      ...     ...
58 Hanawa Line 106.9      382     124
59 Yonesaka Line 90.7      379     105
60 Rikuusai Line 43.0      345      67
61 Kitakami Line 61.1      311      70
62 Tadami Line 135.2      280     140

Find the coefficient of the regression line (Section 2-1)

First, let the independent variable (explanatory variable) $ x $ be the "average number of passers-by" and the dependent variable (objective variable) $ y $ be the "passenger transportation revenue", and draw a straight line $ y = ax + b $ that expresses the relationship between the two. I think about pulling. When drawing a straight line, consider a straight line that minimizes the sum of squares of the residuals of each data point (the difference between the observed value (realized value) of the dependent variable and the theoretical value (predicted value)). It's the so-called "least squares method". For each data point $ (x_1, y_1), ..., (x_n, y_n) $, if the observed value of the dependent variable is $ y_i $ and the theoretical value is $ \ hat {y} _i = ax_i + b $ The sum of squares of the residuals $ L $ is

\begin{align}
L &= \sum_{i=1}^n (y_i - \hat{y}_i)^2 \\
&= \sum_{i=1}^n (y_i - ax_i - b)^2
\end{align}

It will be. When $ L $ takes the minimum value, the partial differential coefficient is 0 for $ a and b $ respectively.

\begin{align}
\frac{\partial L}{\partial a} &= \sum_{i=1}^n 2(y_i - ax_i - b)(-x_i) = 0 \\
\frac{\partial L}{\partial b} &= \sum_{i=1}^n 2(y_i - ax_i - b)(-1) = 0 \\
\end{align}

If you organize this for $ a and b $

\begin{align}
\left(\sum_{i=1}^n x_i^2\right) a + \left( \sum_{i=1}^n x_i \right) b &= \sum_{i=1}^n x_i y_i \\
\left( \sum_{i=1}^n x_i \right) a + nb &= \sum_{i=1}^n y_i \\
\end{align}

When solved as a system of equations for $ a and b $

\begin{align}
a  &= \frac{n\sum_{i=1}^n x_i y_i - \sum_{i=1}^n x_i \sum_{i=1}^n y_i}{n \sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2} \\
 b &=  \frac{ \sum_{i=1}^n x_i^2 \sum_{i=1}^n y_i - \sum_{i=1}^n x_i \sum_{i=1}^n x_i y_i}{n \sum_{i=1}^n x_i^2 - \left( \sum_{i=1}^n x_i \right)^2} \\
\end{align}

It can be obtained. It's hard to calculate these $ a and b $ by hand (even if you use a calculator), but you can easily calculate them using a program.

n = len(df)
X = df["Average number of passers"]
y = df["Passenger transportation revenue"]
a = (n * (x*y).sum() - x.sum() * y.sum()) / (n * (x**2).sum() - x.sum() ** 2)
b = ((x**2).sum() * y.sum() - x.sum() * (x*y).sum()) / (n * (x**2).sum() - x.sum() ** 2)
print(a) # Output: 0.12665116273608049
print(b) # Output: 11411.585376160469

From this, it can be seen that the annual passenger transportation revenue tends to increase by about 0.127 million yen (= 127,000 yen) for each increase in the average number of passengers passing by. Let's draw a graph as well.

import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.scatter(x, y, color="tab:blue")
ax.plot(x, a*x + b, color="tab:orange")
fig.savefig("output.png ")

output.png

Well, I don't feel like I've analyzed it very well. As far as the graph is seen, the data does not seem to be very linear. The point on the lower right where the x-axis value is about 750,000 is the "Saikyo Line". If you look closely at the original data, the average number of passengers passing through is large, but the operating kilometers are as short as 5.5km [^ saikyo], so passenger transportation revenue is small.

[^ saikyo]: The "Saikyo Line" here does not refer to the "Saikyo Line" (Osaki-Omiya) in the passenger information, but only to the section (Ikebukuro-Akabane) whose official route name is "Akabane Line". Is shown in the footnote (* C) on the source page.

By the way, what was the "average number of people passing by"? (From "Usage Status by Route")

○ "Average number of passing passengers" represents the number of passengers per 1km per day, and is calculated by the following calculation based on the "Railway Business Performance Report" that we report to the Ministry of Land, Infrastructure, Transport and Tourism every year. I am. [Average number of passengers passing through] = [Passenger transporter kilometers within the year for each route * A] ÷ [Business kilometers within the year for the relevant route] ÷ [Business days within the year]

Since it is "passenger transporter kilometers" (number of users x distance) that actually leads to income, "average number of passers-by x business kilometers x number of business days" (total usage distance throughout the year) is used as an independent variable. Is considered more appropriate. There are 365 days in 2018, so let's calculate the number of business days as 365.

x = df["Average number of passers"] * df["Business km"] * 365
y = df["Passenger transportation revenue"]
n = len(df)
a = (n * (x*y).sum() - x.sum() * y.sum()) / (n * (x**2).sum() - x.sum() ** 2)
b = ((x**2).sum() * y.sum() - x.sum() * (x*y).sum()) / (n * (x**2).sum() - x.sum() ** 2)
print(a) # Output: 1.082862145743171e-05
print(b) # Output: 331.32038547946183
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.scatter(x, y, color="tab:blue")
ax.plot(x, a*x + b, color="tab:orange")
fig.savefig("output.png ")

output.png

it is a good feeling. Assuming that the annual passenger kilometer is $ x $ and the annual passenger transportation revenue (million yen) is $ y $, the theoretical value of $ y $ $ \ hat {y} $ can be calculated as follows.

\hat{y} = (1.083 \times 10^{-5}) x + 331.3

It is calculated that the income will increase by about 10.83 yen every time the railroad is used for 1km. JR (3 companies in Honshu)'s regular fare to kilometer (fare per 1km) is calculated at 16.20 yen (mutual departure and arrival within the main line, if it is 300km or less, excluding tax), but it is regular [^ fare] If you buy a ticket, the fare will be discounted, so the income you will earn from using 1km will be less than 16.20 yen. In addition, there are actually various fare calculation patterns such as short distances of 10km or less, long distances of over 300km, and local transportation lines with few users, and the data does not go straight.

[^ fare]: [JR East: Passenger Business Rules> Volume 2 Passenger Business-Chapter 3 Passenger Fares / Fares-Section 2 Regular Passenger Fares](https://www.jreast.co.jp/ryokaku/02_hen / 03_syo / 02_setsu / index.html)

Find the coefficient of determination (Section 2-2)

Now, let's find the coefficient of determination $ R ^ 2 $ to see how well this regression equation fits. As the fluctuation that the observed value $ y_i $ has

--Total variation: Fluctuation from the mean of all data points (sum of squared deviations) --Regression variation: Fluctuation from the theoretical value obtained by the regression equation, that is, variation explained by the regression equation (sum of squares of deviation) --Residual fluctuation: Fluctuation not explained by regression equation (sum of squares of deviation)

Considering the sum of all data points, (total variation) = (regression variation) + (residual variation) holds. Here, the coefficient of determination $ R ^ 2 $ is the ratio of regression variation to total variation. The larger the coefficient of determination (closer to 1), the better the regression equation fits the data.

TSS = ((y - y.mean()) ** 2).sum()     #Total variation
ESS = ((a*x+b - y.mean()) ** 2).sum() #Regression variation
RSS = ((y - a*x+b) ** 2).sum()        #Residual fluctuation
R_2 = ESS / TSS                       #Coefficient of determination
print(TSS) # 139066190955.65082
print(ESS) # 138362318781.54144
print(RSS) #    731535019.9636117
print(R_2) # 0.9949385816259694

The coefficient of determination is 0.995, which shows that it fits well. I don't think it's that strange because the original fare calculation is based on a method that is proportional to distance. It should be (total variation) = (regression variation) + (residual variation), but it seems that there is an error due to information loss because data with different numbers of digits was handled in a mess.

The coefficient of determination obtained here is actually equal to the square of the correlation coefficient of the independent variable and the dependent variable. Let's try to find and confirm the (sample) correlation coefficient.

cov_xy = ((x - x.mean()) * (y - y.mean())).sum() / (n - 1) #Specimen covariance
var_x  = ((x - x.mean()) ** 2).sum() / (n - 1)             #Independent variable sample variance
var_y  = ((y - y.mean()) ** 2).sum() / (n - 1)             #Sample variance of the dependent variable
corr   = cov_xy / np.sqrt(var_x * var_y)                   #Correlation coefficient
print(corr ** 2) # 0.994938581625969

Certainly, it seems to match the coefficient of determination obtained earlier (it is not exactly equal due to calculation error).

Because it has become long

Next time. Now let's consider the case where there are two or more independent variables (multivariate), and then look at how to handle the regression equation statistically. Next → [Statistical Test Level 2 / Level 1] Regression Analysis Practice in Python (2) --Qiita

Recommended Posts

[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (2)
[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (1)
Regression analysis in Python
Statistical test grade 2 probability distribution learned in Python ②
Statistical test grade 2 probability distribution learned in Python ①
Simple regression analysis in Python
First simple regression analysis in Python
Statistical test (multiple test) in Python: scikit_posthocs
[Statistical test 2nd grade] Discrete probability distribution
2. Multivariate analysis spelled out in Python 7-3. Decision tree [regression tree]
2. Multivariate analysis spelled out in Python 2-1. Multiple regression analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 1-2. Simple regression analysis (algorithm)
Association analysis in Python
2. Multivariate analysis spelled out in Python 5-3. Logistic regression analysis (stats models)
Algorithm in Python (primality test)
Python Statistical Techniques-Statistical Analysis Against Python-
Axisymmetric stress analysis in Python
Online linear regression in Python
Set python test in jenkins
2. Multivariate analysis spelled out in Python 6-2. Ridge regression / Lasso regression (scikit-learn) [Ridge regression vs. Lasso regression]
2. Multivariate analysis spelled out in Python 2-3. Multiple regression analysis [COVID-19 infection rate]
EEG analysis in Python: Python MNE tutorial
Write selenium test code in python
Planar skeleton analysis in Python (2) Hotfix
Simple regression analysis implementation in Keras
Logistic regression analysis Self-made with python
2. Multivariate analysis spelled out in Python 6-1. Ridge regression / Lasso regression (scikit-learn) [multiple regression vs. ridge regression]
2. Multivariate analysis spelled out in Python 8-2. K-nearest neighbor method [Weighting method] [Regression model]