[Linear regression] About the number of explanatory variables and the coefficient of determination (adjusted degrees of freedom)

at first

The author describes the explanatory variables of linear regression. "Variables unrelated to the objective variable should not be used for learning" (* 1) I have the image. However, while studying linear regression analysis, I learned that the coefficient of determination has the following properties.

"When an explanatory variable is added, the coefficient of determination cannot be lower than that before the addition. (Therefore, the accuracy comparison of models with different numbers of explanatory variables uses the" adjusted coefficient of determination ". ) ”(* 2)

Knowing its properties, the author wanted to verify the following [Summary].

Overview

Regarding the author's assumption (* 1) and the property of the coefficient of determination (* 2), I tried to verify the following two points (①, ②) (by actually making a model). This article describes the verification method and results.

① Comparison of coefficient of determination → When an explanatory variable unrelated to the objective variable is added, its coefficient of determination does not fall below the coefficient of determination before the addition.

(2) Comparison of adjusted coefficient of determination → In the case of the coefficient of determination adjusted for degrees of freedom, isn't the value higher "before adding" a variable unrelated to the objective variable?

Premise

[Linear regression] A method of expressing the objective variable as a "linear combination of explanatory variables" with respect to the existing explanatory variables and objective variables. Since it is expressed by a linear combination, the expressed result (estimated value) is a straight line graph. Normdist_regression.png

[Coefficient of determination] The created linear regression model is an index showing "how good the fit is for the data" and is expressed by the following formula.  $ R^{2}=1-\dfrac {\sum e^{2}}{\sum \left( y-\overline {y}\right) ^{2}} $

$ {\ sum e ^ {2}}: Residual variation (sum of squares of difference between objective variable and estimated value) $ $ {\ sum \ left (y-\ overline {y} \ right) ^ {2}}: Total variation (sum of squares of the difference between the objective variable and its mean) $

[Degree-of-freedom adjusted coefficient of determination] An index that adds the number of explanatory variables (k) to the coefficient of determination. Like the coefficient of determination, it indicates "how good the fit is for the data".  $ {\widehat {R}^{2}}=1-\dfrac {n-1}{n-k}\left( 1-R^{2}\right) $

$ {n}: Number of data $ $ {k}: Number of explanatory variables $

Verification details

The two verification points (① and ②) described in the above [Summary] are verified by the following methods.

[Data to prepare] (1) Explanatory variable (related to the objective variable) → Created using random numbers. The objective variable is created by linear combination of this variable. (2) Explanatory variable (regardless of objective variable) → Created using random numbers. (3) Objective variable → Create by linear combination of the explanatory variables in (1) above.

① Comparison of coefficient of determination → Check the difference between the coefficient of determination of <model 1> and <model 2> below (model 2-model 1). <Model 1> (n) Model created with variables (1 to n) of the above variable (1) <Model 2> (n x m) For one [Model 1] above Model created with "explanatory variables used for the model" and "above variables (2) (1 to m)"

(2) Comparison of adjusted coefficient of determination → For the above verification (1), set the "coefficient of determination" to the "coefficient of determination adjusted for degrees of freedom" and confirm.

Verification procedure

(1) Explanatory variable (related to the objective variable)   → col_rel_[n](n=0〜10) (2) Explanatory variable (regardless of objective variable)   → col_no_rel_[m](m=0〜10) (3) Objective variable   → target

[2] Model creation → Create <Model 1> and <Model 2> above.

<Model 1> (n)   → nothing_ [n] _ [m]

[3] Calculate the coefficient of determination and the adjusted coefficient of determination → Hold both values as dict type (as value). Each dict name is as follows.

Coefficient of determination: dict_R2 Coefficient of determination adjusted for degrees of freedom: dict_adjusted_R2

[4] Combine the aggregated results into one data frame [5] Create two line graphs → The vertical axis is the coefficient of determination (adjusted degrees of freedom), The horizontal axis is "the number of the above variables (2) used for model learning".

Verification code

experiment.jpynb


import random

import numpy as np
import pandas as pd

import matplotlib.pyplot as  plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

#[Parameter]
#Seed value
random.seed(10)

#Number of data
data_num = 10 ** 4

#Number of items(rel:Related to the objective variable, no_rel:Not related to objective variable)
group_num = 10

dict_group_num = dict()
dict_group_num["rel"] = group_num #Related to the objective variable
dict_group_num["no_rel"] = group_num #Not related to objective variable

#[Data creation]
#Explanatory variable
col_format = 'col_{group}_{index}'

all_data = pd.DataFrame({
    col_format.format(
        group = key
        , index = i
    ):[random.randint(0 , 1000) for _ in range(10 ** 4)] 
    for key , val in dict_group_num.items()
    for i in range(val)
})

#Objective variable
w_list = [random.randint(1 , 10) for i in range(dict_group_num["rel"])]
target_series = pd.Series(np.zeros(data_num))
for i in range(dict_group_num["rel"]):
    w = w_list[i]
    col = col_format.format(
        group = "rel"
        , index = i
    )
    
    #Verification
    print('-' * 50)
    print(target_series.head())
    print("w=" , w)
    print(all_data[col].head())
    print(w * all_data[col].head())    
    
    target_series = target_series + w * all_data[col]

    #Verification
    display(target_series.head())
    
all_data['target'] = target_series

#[Model creation]
#All combinations of explanatory variables
dict_features = {}
key_format = "{type_}_{n}_{m}"

#1.「col_rel_x」(n)
type_ = "nothing"
for n in range(dict_group_num["rel"]):
    cols = [col_format.format(
                group = "rel"
                , index = i
            ) for i in range(n+1)]
    
    key = key_format.format(
        type_ = type_
        , n = n + 1
        , m = 0
    )
    dict_features[key] = cols

#2.「col_rel_x」(n)And "col_no_rel_x」(m)
type_ = "contain"
for n in range(dict_group_num["rel"]):
    cols_rel = [col_format.format(
                group = "rel"
                , index = i
            ) for i in range(n+1)]
    for m in range(dict_group_num["no_rel"]):
        cols_no_rel = [col_format.format(
                    group = "no_rel"
                    , index = i
                ) for i in range(m+1)]
        cols = cols_rel + cols_no_rel
        key = key_format.format(
            type_ = type_
            , n = n + 1
            , m = m + 1
        )
        dict_features[key] = cols

#Verification
type_ = "nothing"
print("-" * 50 , type_)
_dict = {key:val for key , val in dict_features.items() if key[:len(type_)] == type_}
print(list(_dict.items())[:5])

type_ = "contain"
print("-" * 50 , type_)
_dict = {key:val for key , val in dict_features.items() if key[:len(type_)] == type_}
print(list(_dict.items())[:5])

#Modeling
dict_models = {}
for key , feature_cols in dict_features.items():
    #Divide into explanatory variables and objective variables
    train_X = all_data[feature_cols] 
    train_y = all_data['target']
    #Modeling
    model = LinearRegression()
    model.fit(train_X , train_y)
    
    dict_models[key] = model
    
    #Verification
    print("-" * 50)
    print("key={}".format(key))
    print(model.intercept_)
    print(model.coef_)

#Verification
list(dict_models.keys())[:5]

#Coefficient of determination, adjusted coefficient of determination
dict_R2 = {}
dict_adjusted_R2 = {}

for key , feature_cols in dict_models.items():
    #Model acquisition
    model = dict_models[key]
    
    #Extract all data(numpy)
    feature_cols = dict_features[key]
    X = np.array(all_data[feature_cols])
    y = np.array(all_data["target"])
    
    #Coefficient of determination
    R2 = model.score(X , y)
    dict_R2[key] = R2
    
    #Coefficient of determination adjusted for degrees of freedom
    n = data_num
    k = int(key.split("_")[1]) + int(key.split("_")[2])
    adjusted_R2 = 1 - ((n-1)/(n-k)) * (1 - R2)
    dict_adjusted_R2[key] = adjusted_R2

#【inspection result】
R2_df = pd.DataFrame({
    "key":list(dict_R2.keys())
    , "R2":list(dict_R2.values())
})
adjusted_R2_df = pd.DataFrame({
    "key":list(dict_adjusted_R2.keys())
    , "adjusted_R2":list(dict_adjusted_R2.values())
})
result_df = pd.merge(R2_df , adjusted_R2_df , on="key" , how='outer')
result_df['rel_num'] = result_df["key"].str.split("_" , expand=True)[1].astype(int)
result_df['no_rel_num'] = result_df["key"].str.split("_" , expand=True)[2].astype(int)

#Verification
print(len(R2_df))
print(len(adjusted_R2_df))
print(len(result_df))
display(R2_df.head())
display(adjusted_R2_df.head())
display(result_df.head())

#[Graph creation]
#Coefficient of determination
value = "R2"
fig = plt.figure(figsize=(10,10))

for i in range(dict_group_num["rel"]):
    rel_num = i + 1
    df = result_df.query("rel_num == {}".format(rel_num))
    base = df.query("no_rel_num == 0")[value]
    x = df["no_rel_num"]
    y = df[value].apply(lambda x:x - base)
    
    #Verification
#     print("-" * 50)
#     print("base={}".format(base))
#     print("x={}".format(x))
#     print("y={}".format(y))
    
    plt.plot(x, y, marker = 'o' 
             , label="rel_num={}".format(rel_num))

plt.title("Diff of {}".format(value))
plt.legend(loc="upper left" , bbox_to_anchor=(1, 1))
plt.xlabel("no_rel_num")
plt.ylabel(value)
plt.grid()
plt.show()

#Save graph
fig.savefig("plot_R2.png ")

#Coefficient of determination adjusted for degrees of freedom
value = "adjusted_R2"
fig = plt.figure(figsize=(10,10))

for i in range(dict_group_num["rel"]):
    rel_num = i + 1
    df = result_df.query("rel_num == {}".format(rel_num))
    base = df.query("no_rel_num == 0")[value]
    x = df["no_rel_num"]
    y = df[value].apply(lambda x:x - base)
    
    #Verification
#     print("-" * 50)
#     print("base={}".format(base))
#     print("x={}".format(x))
#     print("y={}".format(y))
    
    plt.plot(x, y, marker = 'o' 
             , label="rel_num={}".format(rel_num))

plt.title("Diff of {}".format(value))
plt.legend(loc="upper left" , bbox_to_anchor=(1, 1))
plt.xlabel("no_rel_num")
plt.ylabel(value)
plt.grid()
plt.show()

#Save graph
fig.savefig("plot_adjusted_R2.png ")

inspection result

[Graph 1] Coefficient of determination Regardless of the number of explanatory variables (rel_num) of "related to the objective variable", the coefficient of determination increases monotonically as the number of "unrelated to the objective variable" explanatory variables to be added increases. In other words, it was confirmed that it was in line with theory. plot_R2.png [Graph 2] Coefficient of determination adjusted for degrees of freedom About the number of explanatory variables of "related to the objective variable" When there are 1 to 4, it is better to add the explanatory variable "not related to the objective variable". (1 to 4 are not enough to explain, and it seems better to add even a little information.) When it is 9 to 10, it is almost the same. (It seems that 9 to 10 are enough to explain the objective variable.) On the other hand, in the case of 5 to 8, the coefficient of determination is higher when the explanatory variable "irrelevant to the objective variable" is not added. It was confirmed that it was as expected by the author. plot_adjusted_R2.png

Summary

When comparing models using the coefficient of determination, one must pay attention to the number of explanatory variables used for training. As can be confirmed from the verification results, the coefficient of determination is high when the explanatory variables are "large in number including the other" (even when variables that are not suitable for use are added). Or equal. Therefore, in order to make a more appropriate comparison in that case, it seems better to use the "degree-of-freedom adjusted coefficient of determination".

Recommended Posts

[Linear regression] About the number of explanatory variables and the coefficient of determination (adjusted degrees of freedom)
About the Normal Equation of Linear Regression
Review the concept and terminology of regression
Evaluation method of machine learning regression problem (mean square error and coefficient of determination)
About the behavior of copy, deepcopy and numpy.copy
About the * (asterisk) argument of python (and itertools.starmap)
"Linear regression" and "Probabilistic version of linear regression" in Python "Bayesian linear regression"
relation of the Fibonacci number series and the Golden ratio
A python implementation of the Bayesian linear regression class
Think about the next generation of Rack and WSGI
Personal notes about the integration of vscode and anaconda