Last night, I summarized [Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library. From tonight, I will finally use them to get into the main subject. Tonight I will summarize descriptive statistics and simple regression analysis. I will supplement the explanations in this book. 【Caution】 ["Data Scientist Training Course at the University of Tokyo"](https://www.amazon.co.jp/%E6%9D%B1%E4%BA%AC%E5%A4%A7%E5%AD%A6%E3 % 81% AE% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3% 83 % 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E8% 82% B2% E6% 88% 90% E8% AC% 9B% E5% BA% A7-Python% E3% 81 % A7% E6% 89% 8B% E3% 82% 92% E5% 8B% 95% E3% 81% 8B% E3% 81% 97% E3% 81% A6% E5% AD% A6% E3% 81% B6 % E3% 83% 87% E2% 80% 95% E3% 82% BF% E5% 88% 86% E6% 9E% 90-% E5% A1% 9A% E6% 9C% AC% E9% 82% A6% I will read E5% B0% 8A / dp / 4839965250 / ref = tmm_pap_swatch_0? _ Encoding = UTF8 & qid = & sr =) and summarize the parts that I have some doubts or find useful. Therefore, I think the synopsis will be straightforward, but please read it, thinking that the content has nothing to do with this book.
Statistical analysis methods are divided into descriptive statistics and inference statistics.
"A method to grasp the characteristics of the collected data, organize it in an easy-to-understand manner, and make it easy to see. Calculate the characteristics of the data by calculating the mean, standard deviation, etc., classify the data, and express it using figures and graphs. Descriptive statistics are what you do. "
"The idea of inference statistics is to perform a precise analysis using a model based on a probability distribution from only partial data, and infer the whole to obtain statistics." "It is also used to predict the future from historical data. This chapter describes simple regression analysis, which is the basis of inference statistics. More complex inference statistics will be dealt with in the next four chapters."
import numpy as np
import scipy as sp
import pandas as pd
from pandas import Series, DataFrame
import matplotlib as mpl
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
from sklearn import linear_model
$ sudo pip3 install scikit-learn
As shown below, it seems that it can also be used with Rasipi4.
$ python3
Python 3.7.3 (default, Jul 25 2020, 13:03:44)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from sklearn import linear_model
>>>
For the time being, python3-sklearn-doc was not found, but it seems that it could be installed under Debian / Ubuntu.
$ sudo apt-get install python3-sklearn python3-sklearn-lib
...abridgement 3-2-1-5 From the following site, get the data student.zip with the following program. https://archive.ics.uci.edu/ml/machine-learning-databases/00356/student.zip
import requests, zipfile
from io import StringIO
import io
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00356/student.zip'
r = requests.get(url, stream = True)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
The following four files have been expanded. student.txt student-mat.csv student-merge.R student-pcr.csv
Connect to the above import and execute the following
student_data_math = pd.read_csv('./chap3/student-mat.csv')
print(student_data_math.head())
Data; You can check the delimiter.
school;sex;age;address;famsize;Pstatus;Medu;Fedu;Mjob;Fjob;reason;guardian;traveltime;studytime;failures;schoolsup;famsup;paid;activities;nursery;higher;internet;romantic;famrel;freetime;goout;Dalc;Walc;health;absences;G1;G2;G3
0 GP;"F";18;"U";"GT3";"A";4;4;"at_home";"teacher...
1 GP;"F";17;"U";"GT3";"T";1;1;"at_home";"other";...
2 GP;"F";15;"U";"LE3";"T";1;1;"at_home";"other";...
3 GP;"F";15;"U";"GT3";"T";4;2;"health";"services...
4 GP;"F";16;"U";"GT3";"T";3;3;"other";"other";"h...
Change the read to; and reload.
student_data_math = pd.read_csv('./chap3/student-mat.csv', sep =';')
print(student_data_math.head())
It looked beautiful.
school sex age address famsize Pstatus Medu Fedu ... goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 ... 4 1 1 3 6 5 6 6
1 GP F 17 U GT3 T 1 1 ... 3 1 1 3 4 5 5 6
2 GP F 15 U LE3 T 1 1 ... 2 2 3 3 10 7 8 10
3 GP F 15 U GT3 T 4 2 ... 2 1 1 5 2 15 14 15
4 GP F 16 U GT3 T 3 3 ... 2 1 2 5 4 6 10 10
[5 rows x 33 columns]
print(student_data_math.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 school 395 non-null object
1 sex 395 non-null object
2 age 395 non-null int64
3 address 395 non-null object
4 famsize 395 non-null object
5 Pstatus 395 non-null object
6 Medu 395 non-null int64
7 Fedu 395 non-null int64
8 Mjob 395 non-null object
9 Fjob 395 non-null object
10 reason 395 non-null object
11 guardian 395 non-null object
12 traveltime 395 non-null int64
13 studytime 395 non-null int64
14 failures 395 non-null int64
15 schoolsup 395 non-null object
16 famsup 395 non-null object
17 paid 395 non-null object
18 activities 395 non-null object
19 nursery 395 non-null object
20 higher 395 non-null object
21 internet 395 non-null object
22 romantic 395 non-null object
23 famrel 395 non-null int64
24 freetime 395 non-null int64
25 goout 395 non-null int64
26 Dalc 395 non-null int64
27 Walc 395 non-null int64
28 health 395 non-null int64
29 absences 395 non-null int64
30 G1 395 non-null int64
31 G2 395 non-null int64
32 G3 395 non-null int64
dtypes: int64(16), object(17)
memory usage: 102.0+ KB
Looking at the contents of cat student.txt, it seems that this data has the following contents.
$ cat student.txt
# Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
1 school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
2 sex - student's sex (binary: "F" - female or "M" - male)
3 age - student's age (numeric: from 15 to 22)
4 address - student's home address type (binary: "U" - urban or "R" - rural)
5 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
6 Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)
7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9 Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
10 Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
11 reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
12 guardian - student's guardian (nominal: "mother", "father" or "other")
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
# these grades are related with the course subject, Math or Portuguese:
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)
Additional note: there are several (382) students that belong to both datasets .
These students can be identified by searching for identical attributes
that characterize each student, as shown in the annexed R file.
・ Quantitative data The data is represented by continuous values to which the four arithmetic operations can be applied, and the ratio is meaningful. Example: number of people, amount of money, etc. ・ Qualitative data It is discontinuous data to which the four arithmetic operations cannot be applied, and is used to express the state. Example; ranking, category, etc.
Gender is qualitative data
print(student_data_math['sex'].head())
0 F
1 F
2 F
3 F
4 F
Name: sex, dtype: object
The number of absentees is quantitative data
print(student_data_math['absences'].head())
0 6
1 4
2 10
3 2
4 4
Name: absences, dtype: int64
print(student_data_math.groupby('sex')['age'].mean())
sex
F 16.730769
M 16.657754
Name: age, dtype: float64
Women study.
print(student_data_math.groupby('sex')['studytime'].mean())
sex
F 2.278846
M 1.764706
Name: studytime, dtype: float64
fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
y1 = student_data_math['absences']
ax1.hist(y1, bins = 10, range =(0.0,max(y1)))
ax1.set_ylabel('count')
ax1.set_xlabel('absences')
plt.grid(True)
plt.show()
print('Average value{}'.format(student_data_math['absences'].mean()))
print('Median{}'.format(student_data_math['absences'].median()))
print('Mode{}'.format(student_data_math['absences'].mode()))
Mean 5.708860759493671
Median 4.0
Mode 0 0
dtype: int64
Enlarge the figure above to verify it in the figure.
fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
y1 = student_data_math['absences']
ax1.hist(y1, bins = 30, range =(0.0,30)) #,max(y1)
x0 = student_data_math['absences'].mean()
ax1.plot(x0+0.5, 70, 'red', marker = 'o',markersize=10,label ='mean')
x0 = student_data_math['absences'].median()
ax1.plot(x0+0.5, 70, 'blue', marker = 'o',markersize=10,label ='median')
x0 = student_data_math['absences'].mode()
ax1.plot(x0+0.5, 70, 'black', marker = 'o',markersize=10,label ='mode')
ax1.legend()
ax1.set_ylabel('count')
ax1.set_xlabel('absences')
plt.grid(True)
plt.show()
Definition formula Variance $ σ ^ 2 $
σ^2 = \frac{1}{n}\Sigma_{i=1}^{n}(x_i-\bar{x})^2
Standard deviation $ σ $ std(standered deviation)
σ = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}(x_i-\bar{x})^2}
print('Distributed{}'.format(student_data_math['absences'].var(ddof=0)))
print('standard deviation{}'.format(student_data_math['absences'].std(ddof = 0)))
print('standard deviation{}'.format(np.sqrt(student_data_math['absences'].var())))
Variance 63.887389841371565
Standard deviation 7.99295876640006
Standard deviation 8.00309568710818
Plot the mean ± standard deviation.
fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
y1 = student_data_math['absences']
ax1.hist(y1, bins = 30, range =(0.0,30)) #,max(y1)
x0 = student_data_math['absences'].mean()
ax1.plot(x0+0.5, 70, 'red', marker = 'o',markersize=10,label ='mean')
x1 = student_data_math['absences'].std(ddof=0)
ax1.plot(x0+x1+0.5, 70, 'blue', marker = 'o',markersize=10,label ='mean+std')
ax1.plot(x0-x1+0.5, 70, 'black', marker = 'o',markersize=10,label ='mean-std')
ax1.legend()
ax1.set_ylabel('count')
ax1.set_xlabel('absences')
plt.grid(True)
plt.show()
Percentile values are ranked when the total number is 100 25th percentile, 25th percentile, 1st quartile The 75th is the 75th percentile, the third quartile 50th percentile, median
print('Summary statistics', student_data_math['absences'].describe())
Summary statistic count 395.000000
mean 5.708861
std 8.003096
min 0.000000
25% 0.000000
50% 4.000000
75% 8.000000
max 75.000000
Name: absences, dtype: float64
25th percentile; describe (4)
75th percentile: describe (6)
Difference; describe (6)-describe (4)
print ('75-25 Percentile', student_data_math ['absences'] .describe () [6]-student_data_math ['absences'] .describe () [4])
75-25 Percentile 8.0
fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
y1 = student_data_math['absences']
ax1.hist(y1, bins = 30, range =(0.0,30)) #,max(y1)
x0 = student_data_math['absences'].median()
ax1.plot(x0+0.5, 70, 'red', marker = 'o',markersize=10,label ='median')
x1 = student_data_math['absences'].describe()[4]
ax1.plot(x1+0.5, 70, 'blue', marker = 'o',markersize=10,label ='25percentile')
x1 = student_data_math['absences'].describe()[6]
ax1.plot(x1+0.5, 70, 'black', marker = 'o',markersize=10,label ='75percentile')
ax1.legend()
ax1.set_ylabel('count')
ax1.set_xlabel('absences')
plt.grid(True)
plt.show()
print('Full column summary statistics', student_data_math.describe())
Full column summary statistics
age Medu Fedu traveltime ... absences G1 G2 G3
count 395.000000 395.000000 395.000000 395.000000 ... 395.000000 395.000000 395.000000 395.000000
mean 16.696203 2.749367 2.521519 1.448101 ... 5.708861 10.908861 10.713924 10.415190
std 1.276043 1.094735 1.088201 0.697505 ... 8.003096 3.319195 3.761505 4.581443
min 15.000000 0.000000 0.000000 1.000000 ... 0.000000 3.000000 0.000000 0.000000
25% 16.000000 2.000000 2.000000 1.000000 ... 0.000000 8.000000 9.000000 8.000000
50% 17.000000 3.000000 2.000000 1.000000 ... 4.000000 11.000000 11.000000 11.000000
75% 18.000000 4.000000 3.000000 2.000000 ... 8.000000 13.000000 13.000000 14.000000
max 22.000000 4.000000 4.000000 4.000000 ... 75.000000 19.000000 19.000000 20.000000
[8 rows x 16 columns]
Box plot is (minimum value, number 1). 1 quartile, median, 3rd quartile, maximum) is expressed by a box and a whiskers as follows.
fig, (ax1,ax2) = plt.subplots(2, 1, figsize=(8,2*4))
y1 = student_data_math['G1']
ax1.hist(y1, bins = 30, range =(0.0,max(y1))) #,max(y1)
x0 = student_data_math['G1'].median()
ax1.plot(x0+0.5, 60, 'red', marker = 'o',markersize=10,label ='median')
x1 = student_data_math['G1'].describe()[4]
ax1.plot(x1+0.5, 60, 'blue', marker = 'o',markersize=10,label ='25percentile')
x1 = student_data_math['G1'].describe()[6]
ax1.plot(x1+0.5, 60, 'black', marker = 'o',markersize=10,label ='75percentile')
ax2.boxplot(y1)
ax2.set_xlabel('G1')
ax2.set_ylabel('count')
ax1.legend()
ax1.set_ylabel('count')
ax1.set_xlabel('G1')
plt.grid(True)
plt.show()
fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
y1 = [student_data_math['G1'],student_data_math['G2'],student_data_math['G3'],student_data_math['absences']]
ax1.boxplot(y1,labels=['G1', 'G2', 'G3', 'absences'])
ax1.set_xlabel('category')
ax1.set_ylabel('count')
ax1.legend()
ax1.set_ylabel('count')
ax1.set_xlabel('category')
plt.grid(True)
plt.show()
The coefficient of variation CV is the standard deviation σ / mean $ \ bar {x} $ The coefficient of variation does not depend on the scale, and the degree of dispersion can be seen.
print(student_data_math.std()/student_data_math.mean())
age 0.076427
Medu 0.398177
Fedu 0.431565
traveltime 0.481668
studytime 0.412313
failures 2.225319
famrel 0.227330
freetime 0.308725
goout 0.358098
Dalc 0.601441
Walc 0.562121
health 0.391147
absences 1.401873
G1 0.304266
G2 0.351086
G3 0.439881
dtype: float64
fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
x = student_data_math['G1']
y = student_data_math['G3']
ax1.plot(x,y, 'o')
ax1.set_xlabel('G1-grade')
ax1.set_ylabel('G3-grade')
ax1.legend()
plt.grid(True)
plt.show()
Those who had a high G1-Grade also have a high G3-Grade. However, there are some people who have 0 G3-Grade. This is an outlier, but there are various reasons for it, and there is a debate about whether to exclude it. So, what about the number of days attended by people with G3-Grade 0, draw the following correlation diagram.
fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
x = student_data_math['G3']
y = student_data_math['absences']
ax1.plot(x,y, 'o')
ax1.set_xlabel('G3-grade')
ax1.set_ylabel('absences')
ax1.legend()
plt.grid(True)
plt.show()
The result is that people with a G3-Grade of 0 are absent 0. Something is wrong. Actually, I can imagine that I stopped halfway and did not count. Furthermore, the correlation between G1-Grade and the number of absentees is as follows. In the first place, at the time of G2-Grade, some people are 0 Grade. And even if you look at the correlation between G2-Grade and G3-Grade, you can see that some people have fallen to 0 Grade, and that the number of such people is gradually increasing. And it seems that the deceased are out of those with low scores. Therefore, it is more important to analyze various data than to rush to the conclusion with one graph.
The definition formula is as follows
S_{xy}=\frac{1}{n}\Sigma_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})
That is, the diagonal term is the variance defined above. So what does the off-diagonal term mean? According to the reference, if $ x $ and $ y $ have an inherently linear relationship, the equation of the straight line derived by the least squares method is as follows. 【reference】 Linear regression analysis / least squares means covariance / correlation coefficient
y=\frac{S_{xy}}{\sigma^2_{x}}x + \bar y−\frac{S_{xy}}{\sigma^2_x}\bar x \\
When transformed,\\
\frac{y-\bar y}{\sigma_y}=\frac{S_{xy}}{\sigma_x\sigma_y}\frac{x-\bar x}{\sigma_x}
That is, the slope of the linear equation standardized by the standard deviation and the mean value is as follows.
r_{xy}=\frac{S_{xy}}{\sigma_x\sigma_y}
In other words, it is a value obtained by standardizing the covariance with the standard deviation, and this is the definition formula of the so-called correlation coefficient $ r_ {xy} $.
Here, we will find the covariance and the correlation coefficient. The covariance is the off-diagonal and the diagonal is the variance of G1 and G3.
print(np.cov(student_data_math['G1'],student_data_math['G3']))
[[11.01705327 12.18768232]
[12.18768232 20.9896164 ]]
The front is the correlation coefficient and the second term is the p-value.
print(sp.stats.pearsonr(student_data_math['G1'],student_data_math['G3']))
(0.801467932017414, 9.001430312277865e-90)
The correlation matrix is calculated below.
print(np.corrcoef(student_data_math['G1'],student_data_math['G3']))
[[1. 0.80146793]
[0.80146793 1. ]]
Dalc; Weekday alcohol intake Walc; Weekend Alcohol Intake And draw a scatter plot to see if there is a correlation between the scores of G1 and G3. Result; seems unlikely
g = sns.pairplot(student_data_math[['Dalc','Walc','G1','G3']])
g.savefig('seaborn_pairplot_g.png')
【reference】 Create a pair plot diagram (scatter plot matrix) with Python, pandas, seaborn No correlation between Walc and G3 scores
print(np.corrcoef(student_data_math['Walc'],student_data_math['G3']))
[[ 1. -0.05193932]
[-0.05193932 1. ]]
No variation for each group
print(student_data_math.groupby('Walc')['G3'].mean())
Walc
1 10.735099
2 10.082353
3 10.725000
4 9.686275
5 10.142857
Name: G3, dtype: float64
age Medu Fedu traveltime ... absences G1 G2 G3
count 649.000000 649.000000 649.000000 649.000000 ... 649.000000 649.000000 649.000000 649.000000
mean 16.744222 2.514638 2.306626 1.568567 ... 3.659476 11.399076 11.570108 11.906009
std 1.218138 1.134552 1.099931 0.748660 ... 4.640759 2.745265 2.913639 3.230656
min 15.000000 0.000000 0.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000
25% 16.000000 2.000000 1.000000 1.000000 ... 0.000000 10.000000 10.000000 10.000000
50% 17.000000 2.000000 2.000000 1.000000 ... 2.000000 11.000000 11.000000 12.000000
75% 18.000000 4.000000 3.000000 2.000000 ... 6.000000 13.000000 13.000000 14.000000
max 22.000000 4.000000 4.000000 4.000000 ... 32.000000 19.000000 19.000000 19.000000
df =student_data_math.merge(student_data_por,left_on=['school','sex','age','address','famsize','Pstatus','Medu','Fedu','Mjob','Fjob','reason','nursery','internet'], right_on=['school','sex','age','address','famsize','Pstatus','Medu','Fedu','Mjob','Fjob','reason','nursery','internet'], suffixes=('_math', '_por'))
print(df.head())
school sex age address famsize Pstatus ... Walc_por health_por absences_por G1_por G2_por G3_por
0 GP F 18 U GT3 A ... 1 3 4 0 11 11
1 GP F 17 U GT3 T ... 1 3 2 9 11 11
2 GP F 15 U LE3 T ... 3 3 6 12 13 12
3 GP F 15 U GT3 T ... 1 5 0 14 14 14
4 GP F 16 U GT3 T ... 2 5 0 11 13 13
[5 rows x 53 columns]
gm = sns.pairplot(df[['G1_math','G3_math','G1_por','G3_por']])
gm.savefig('seaborn_pairplot_gm.png')
Correlation between math and por seems to be high Variance seems to be smaller in por than in math It is also supported by the following results.
print(np.corrcoef(df['G1_math'],df['G3_math']))
[[1. 0.8051287]
[0.8051287 1. ]]
print(np.corrcoef(df['G3_math'],df['G3_por']))
[[1. 0.48034936]
[0.48034936 1. ]]
print(np.cov(df['G1_math'],df['G3_math']))
[[11.2169202 12.63919693]
[12.63919693 21.9702354 ]]
print(np.cov(df['G3_math'],df['G3_por']))
[[21.9702354 6.63169394]
[ 6.63169394 8.67560567]]
"Next to descriptive statistics, let's learn the basics of regression analysis." "Regression analysis is an analysis that predicts numbers .... I've graphed the student data above. From this scatter plot, I can see that G1 and G3 are likely to be related."
fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
ax1.plot(student_data_math['G1'],student_data_math['G3'],'o')
ax1.set_xlabel('G1_Grade')
ax1.set_ylabel('G3_Grade')
ax1.grid(True)
plt.show()
"In the regression problem, we assume a relational expression from the given data and find the coefficient that best fits the data. Specifically, we predict the G3 grade based on the G1 grade that we know in advance. That is, there is a target variable G3 (called the objective variable), and the variable G1 (called the explanatory variable) that explains it is used for prediction. In regression analysis, one explanatory variable and one explanatory variable are used. The former is called simple regression and the latter is called multiple regression analysis. In this chapter, we will explain simple regression analysis. "
"Here, we will explain how to solve the regression problem by a method called linear simple regression, which assumes that the output and input have a linear relationship in simple regression analysis."
import pandas as pd
from sklearn import linear_model
reg = linear_model.LinearRegression()
student_data_math = pd.read_csv('./chap3/student-mat.csv', sep =';')
x = student_data_math.loc[:,['G1']].values
y = student_data_math['G3'].values
reg.fit(x,y)
print('Regression coefficient;',reg.coef_)
print('Intercept;',reg.intercept_)
Regression coefficient;[1.10625609]
Intercept;-1.6528038288004616
fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
ax1.plot(student_data_math['G1'],student_data_math['G3'],'o')
ax1.plot(x,reg.predict(x))
ax1.set_xlabel('G1_Grade')
ax1.set_ylabel('G3_Grade')
ax1.grid(True)
plt.show()
R^2 = 1- \frac{\Sigma_{i=1}^{n}(y_i-f(x_i))^2}{\Sigma_{i=1}^{n}(y_i-\bar y)^2}
The above equation is called the coefficient of determination, and $ R ^ 2 = 1 $ is the maximum value, and the closer it is to 1, the better the model.
print('Coefficient of determination;',reg.score(x,y))
Coefficient of determination; 0.64235084605227
df0 = student_data_math[student_data_math['sex'].isin(['M'])]
df = df0.sort_values(by=['G1'])
df['Ct']=np.arange(1,len(df)+1)
x = df['Ct']
print(x)
y = df['G1'].cumsum()
print(y)
fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
ax1.plot(x/max(x),y/max(y))
ax1.set_xlabel('peoples')
ax1.set_ylabel('G1_Grade.cumsum')
ax1.grid(True)
plt.show()
248 1
144 2
164 3
161 4
153 5
...
113 183
129 184
245 185
42 186
47 187
Name: Ct, Length: 187, dtype: int32
248 3
144 8
164 13
161 18
153 23
...
113 2026
129 2044
245 2062
42 2081
47 2100
Name: G1, Length: 187, dtype: int64
M F reference M;G1 vs peaples F;G1 vs peaples
Recommended Posts