1. Summary

When I was doing my thesis, I faced something like "I have to keep the correlation coefficient between two features below a certain level, but there are 100 candidates for features." I couldn't find anything similar on the net, probably because it was a basic thing that wasn't technically anything. I will write an article for the time being. The details of the contents are written after "2. Code structure".

1.1 Problems when the correlation between features is high

When analyzing with a statistical or machine learning model, if a combination of features with high correlation is included,

――I can't expect the persuasive power of the analysis results --Weighting is not stable for each learning

Such adverse effects (multilinearity) will occur. A quick method is to "delete one of each combination of highly correlated features".

1.2 Trade-off between "number of features" and "low correlation"

However, if the features are deleted without thinking about anything, the number of features may be extremely reduced. For example, in the case where a plurality of highly correlated combinations are included, there is a possibility that the features having a greater correlation with other features are left and the features must be reduced more than necessary. In short, "I want to find a combination that has a correlation coefficient below a certain level and has the largest number of features."

1.3 Code and practical examples

Here are some code examples and some practical examples.

Code example

The flow is to define the third function using the first two functions, and use the third function when executing the process. The outline of the third function no_high_corr (df, threshold) is as follows.

function	argument	Explanation
no_high_corr( )	df, threshold	df: Feature data, pandas.DataFrame threshold: Correlation coefficient level, 0 or more and 1 or less Return value: The combination with the largest number of features when the correlation coefficient between the two features is less than or equal to threshold, pandas.DataFrame

The first two functions are summarized in (2.) of "4. Main".

`main_in`


import pandas as pd
import numpy as np

def corr_loc(corr_df,threshold):
    count_sums = pd.Series(np.zeros(corr_df.shape[1]))
    for i in range(corr_df.shape[1]):
        count_j = 0
        for j in range(corr_df.shape[1]):
            if corr_df.iloc[j,i] > threshold:
                count_j += 1
            else:
                pass
        count_sums.iloc[i] = count_j
    print(count_sums)
    return count_sums.idxmax()


def corr_max(corr_df,threshold):
    count_sums = pd.Series(np.zeros(corr_df.shape[1]))
    for i in range(corr_df.shape[1]):
        count_j = 0
        for j in range(corr_df.shape[1]):
            if corr_df.iloc[j,i] > threshold:
                count_j += 1
            else:
                pass
        count_sums.iloc[i] = count_j
    print(count_sums)
    return count_sums.max()


def no_high_corr(df, threshold):
    corrmat = df.corr()
    a = corrmat.abs()
    b = corr_loc(a,threshold)
    c = corr_max(a,threshold)
    while c > 1:
        A = a.drop(a.columns[b],axis=1)
        B = A.drop(A.index[b])
        a = B
        b = corr_loc(B,threshold)
        c = corr_max(B, threshold)
    return df.loc[:,a.columns]

Practical example

An implementation example for the features of the sklearn dataset "Boston Home Price". The combination in which the correlation coefficient between each feature is 0.5 or less and the number of features is maximized is output.

`main_in`


import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

boston = load_boston()
df_features = pd.DataFrame(boston.data, columns = boston.feature_names)
print('==============Oridinary data==============')
print(df_features.head())

low_corr_features = no_high_corr(df_features,0.5)
print('==============Corrected data==============')
print(low_corr_features)

The following is the output. From the original 13 features (Ordinary data), 6 features (Corrected data) were selected.

`main_out`


==============Oridinary data==============
      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  
0     15.3  396.90   4.98  
1     17.8  396.90   9.14  
2     17.8  392.83   4.03  
3     18.7  394.63   2.94  
4     18.7  396.90   5.33  

==============Corrected data==============
     CHAS     DIS  RAD  PTRATIO       B  LSTAT
0     0.0  4.0900  1.0     15.3  396.90   4.98
1     0.0  4.9671  2.0     17.8  396.90   9.14
2     0.0  4.9671  2.0     17.8  392.83   4.03
3     0.0  6.0622  3.0     18.7  394.63   2.94
4     0.0  6.0622  3.0     18.7  396.90   5.33
..    ...     ...  ...      ...     ...    ...
501   0.0  2.4786  1.0     21.0  391.99   9.67
502   0.0  2.2875  1.0     21.0  396.90   9.08
503   0.0  2.1675  1.0     21.0  396.90   5.64
504   0.0  2.3889  1.0     21.0  393.45   6.48
505   0.0  2.5050  1.0     21.0  396.90   7.88

2. Code structure

2.1 Overall flow

The combination of the maximum number of features whose correlation is below a certain level is acquired by the following flow.

Get the correlation matrix
Count the correlation coefficient above a certain level in each column
Delete the features with the largest number of correlation coefficients from the rows and columns of the correlation matrix.
Repeat steps 2, 3, and 4 until the correlation coefficient is above a certain level and becomes 0 in all columns.
Output a new data table with the remaining features

2.2 Other

--Basically composed around Pandas.

3. Data preparation

Five features (A, B, C, D, E) $ \ times $ 10 data are prepared in pd.DataFrame for configuration.

`script_1_in`


demo_data = pd.DataFrame(np.random.randint(-1000, 1000,(10,5)),columns=['A','B','C','D','E'])
print(demo_data)

`script_1_out`


     A    B    C    D    E
0 -644  225    8  509 -980
1  809  993  882 -144 -462
2 -501 -505  972 -657  194
3 -980  862  886 -163 -444
4 -757 -254  186 -506 -178
5 -171 -317  973 -237  760
6  831  265  461    0  214
7  814 -466  610 -668  112
8 -281  832 -753  963  306
9  578 -557 -962    3  435

4. Main

Process according to 1 to 5 of "2.1 Overall flow".

(1.) Get the correlation matrix

The correlation matrix is calculated by pd.DataFrame.corr () as follows.

`Script_2_in`


corrmat = demo_data.corr()
print(corrmat)

`Script_2_out`


          A         B         C         D         E
A  1.000000 -0.093300 -0.070089 -0.112714  0.312305
B -0.093300  1.000000  0.048836  0.559999 -0.472975
C -0.070089  0.048836  1.000000 -0.638188 -0.117121
D -0.112714  0.559999 -0.638188  1.000000 -0.153192
E  0.312305 -0.472975 -0.117121 -0.153192  1.000000

(2.) Count the correlation coefficient above a certain level in each column

Define a function that counts the correlation coefficient in each column and returns the column number of the maximum value (corr_loc (corr_df, threshold)) and a function that returns the maximum value (corr_max (corr_df, threshold)). \ * 1 to \ * 4 are common and the return values * 5 and * 6 are different.

function	argument	Explanation
corr_loc( )	corr_df, threshold	corr_df: Correlation matrix with absolute values of each component, pandas.DataFrame threshold: Correlation coefficient level, 0 or more and 1 or less Return value: Column number of the feature with the highest correlation coefficient greater than threshold
corr_max( )	corr_df, threshold	corr_df: Correlation matrix with absolute values of each component, pandas.DataFrame threshold: Correlation coefficient level, 0 or more and 1 or less Return value: Number when the correlation coefficient is greater than threshold

`script_3_in`


#Function that returns the column number of the maximum value
def corr_loc(corr_df,threshold):
    # *1 Empty table to record counts above a certain correlation coefficient in column i
    count_sums = pd.Series(np.zeros(corr_df.shape[1]))
    # *2 Loop for counting the number of correlation coefficients above a certain level for each i column
    for i in range(corr_df.shape[1]):
        # *3 Count if row j of column i is above a certain correlation coefficient, pass if below a certain level
        count_j = 0
        for j in range(corr_df.shape[1]):
            if corr_df.iloc[j,i] > threshold:
                count_j += 1
            else:
                pass
        # *4 Record the number of correlation coefficients in column i above a certain level
        count_sums.iloc[i] = count_j
    print(count_sums)
    # *5 Correlation coefficient Returns the column number with the largest count above a certain level
    return count_sums.idxmax()

#Function that returns the maximum value
def corr_max(corr_df,threshold):
    # *1
    count_sums = pd.Series(np.zeros(corr_df.shape[1]))
    # *2
    for i in range(corr_df.shape[1]):
        # *3
        count_j = 0
        for j in range(corr_df.shape[1]):
            if corr_df.iloc[j,i] > threshold:
                count_j += 1
            else:
                pass
        # *4
        count_sums.iloc[i] = count_j
    print(count_sums)
    # *6 Correlation coefficient Returns the maximum value of counts above a certain level
    return count_sums.max()

number	Annotation
*1	An empty table that records counts above a certain correlation coefficient for each i column
*2	Loop for counting the number of correlation coefficients above a certain level for each i column
*3	Count if j row of column i is above a certain correlation coefficient, pass if below a certain value
*4	Record the number of i columns
*5	Correlation coefficient Returns the column number with the largest count above a certain level
*6	Correlation coefficient Returns the maximum value of counts above a certain level

(3.) Delete the feature with the largest number of correlation coefficients from the row / column
& emsp; & (4.) Repeat until there are no correlation coefficients above a certain level.

Using pd.DataFrame.drop (), the feature quantity with the largest number of correlation coefficients larger than a certain level is repeatedly deleted from each row and column of the correlation matrix. The loop ends when all columns have no correlation coefficient greater than a certain level. In the following example, the absolute value of the correlation coefficient is set to 0.2 as a constant level. However, in the correlation matrix, each column contains the correlation with itself ($ = 1 $), so be careful about the end condition of the loop.

`script_3_in`


# *7 Since we want to delete high positive and negative correlation coefficients, set each component of the correlation matrix to an absolute value.
a = corrmat.abs()
# *8 This time, the absolute value of the correlation coefficient is 0.Remove anything greater than 2
b = corr_loc(a,0.2)
c = corr_max(a,0.2)
# *9 Repeat as long as the number of combinations with a certain correlation coefficient or more is greater than one
while c > 1:
    A = a.drop(a.columns[b],axis=1)
    B = A.drop(A.index[b])
    a = B
    b = corr_loc(B,0.2)
    c = corr_max(B, 0.2)
print(a)

`script_3_out`


          D         E
D  1.000000  0.153192
E  0.153192  1.000000

number	Annotation
*7	Since we want to delete high positive and negative correlation coefficients, set each component of the correlation matrix to an absolute value.
*8	This time the absolute value of the correlation coefficient is 0.Remove anything greater than 2
*9	Repeat as long as the number of combinations with a certain correlation coefficient or more is greater than one

(5.) Output a new data table with the remaining features

Extract combinations with a correlation coefficient of 0.2 or less from the original data (demo_data).

`script_4_in`


# *List of 10 column names (a.coulumns) passed and extracted
no_high_corr = demo_data.loc[:,a.columns]
print(no_high_corr)

`script_4_out`


     D    E
0  509 -980
1 -144 -462
2 -657  194
3 -163 -444
4 -506 -178
5 -237  760
6    0  214
7 -668  112
8  963  306
9    3  435

number	Annotation
*10	List of column names (a.coulumns) passed and extracted

[Python] Correlation is below a certain level ・ Maximum number of features

1. Summary

1.1 Problems when the correlation between features is high

1.2 Trade-off between "number of features" and "low correlation"

1.3 Code and practical examples

Code example

main_in

Practical example

main_in

main_out

2. Code structure

2.1 Overall flow

2.2 Other

3. Data preparation

script_1_in

script_1_out

4. Main

(1.) Get the correlation matrix

Script_2_in

Script_2_out

(2.) Count the correlation coefficient above a certain level in each column

script_3_in

(3.) Delete the feature with the largest number of correlation coefficients from the row / column & emsp; & (4.) Repeat until there are no correlation coefficients above a certain level.

script_3_in

script_3_out

(5.) Output a new data table with the remaining features

script_4_in

script_4_out

`main_in`

`main_in`

`main_out`

`script_1_in`

`script_1_out`

`Script_2_in`

`Script_2_out`

`script_3_in`

(3.) Delete the feature with the largest number of correlation coefficients from the row / column
& emsp; & (4.) Repeat until there are no correlation coefficients above a certain level.

`script_3_in`

`script_3_out`

`script_4_in`

`script_4_out`