[Python] Correlation is below a certain level ・ Maximum number of features

1. Summary

When I was doing my thesis, I faced something like "I have to keep the correlation coefficient between two features below a certain level, but there are 100 candidates for features." I couldn't find anything similar on the net, probably because it was a basic thing that wasn't technically anything. I will write an article for the time being. The details of the contents are written after "2. Code structure".

1.1 Problems when the correlation between features is high

When analyzing with a statistical or machine learning model, if a combination of features with high correlation is included,

――I can't expect the persuasive power of the analysis results --Weighting is not stable for each learning

Such adverse effects (multilinearity) will occur. A quick method is to "delete one of each combination of highly correlated features".

1.2 Trade-off between "number of features" and "low correlation"

However, if the features are deleted without thinking about anything, the number of features may be extremely reduced. For example, in the case where a plurality of highly correlated combinations are included, there is a possibility that the features having a greater correlation with other features are left and the features must be reduced more than necessary. In short, "I want to find a combination that has a correlation coefficient below a certain level and has the largest number of features."

1.3 Code and practical examples

Here are some code examples and some practical examples.

Code example

The flow is to define the third function using the first two functions, and use the third function when executing the process. The outline of the third function no_high_corr (df, threshold) is as follows.

function argument Explanation
no_high_corr( ) df, threshold df: Feature data, pandas.DataFrame
threshold: Correlation coefficient level, 0 or more and 1 or less
Return value: The combination with the largest number of features when the correlation coefficient between the two features is less than or equal to threshold, pandas.DataFrame

The first two functions are summarized in (2.) of "4. Main".

main_in


import pandas as pd
import numpy as np

def corr_loc(corr_df,threshold):
    count_sums = pd.Series(np.zeros(corr_df.shape[1]))
    for i in range(corr_df.shape[1]):
        count_j = 0
        for j in range(corr_df.shape[1]):
            if corr_df.iloc[j,i] > threshold:
                count_j += 1
            else:
                pass
        count_sums.iloc[i] = count_j
    print(count_sums)
    return count_sums.idxmax()


def corr_max(corr_df,threshold):
    count_sums = pd.Series(np.zeros(corr_df.shape[1]))
    for i in range(corr_df.shape[1]):
        count_j = 0
        for j in range(corr_df.shape[1]):
            if corr_df.iloc[j,i] > threshold:
                count_j += 1
            else:
                pass
        count_sums.iloc[i] = count_j
    print(count_sums)
    return count_sums.max()


def no_high_corr(df, threshold):
    corrmat = df.corr()
    a = corrmat.abs()
    b = corr_loc(a,threshold)
    c = corr_max(a,threshold)
    while c > 1:
        A = a.drop(a.columns[b],axis=1)
        B = A.drop(A.index[b])
        a = B
        b = corr_loc(B,threshold)
        c = corr_max(B, threshold)
    return df.loc[:,a.columns]

Practical example

An implementation example for the features of the sklearn dataset "Boston Home Price". The combination in which the correlation coefficient between each feature is 0.5 or less and the number of features is maximized is output.

main_in


import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

boston = load_boston()
df_features = pd.DataFrame(boston.data, columns = boston.feature_names)
print('==============Oridinary data==============')
print(df_features.head())

low_corr_features = no_high_corr(df_features,0.5)
print('==============Corrected data==============')
print(low_corr_features)

The following is the output. From the original 13 features (Ordinary data), 6 features (Corrected data) were selected.

main_out


==============Oridinary data==============
      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  
0     15.3  396.90   4.98  
1     17.8  396.90   9.14  
2     17.8  392.83   4.03  
3     18.7  394.63   2.94  
4     18.7  396.90   5.33  

==============Corrected data==============
     CHAS     DIS  RAD  PTRATIO       B  LSTAT
0     0.0  4.0900  1.0     15.3  396.90   4.98
1     0.0  4.9671  2.0     17.8  396.90   9.14
2     0.0  4.9671  2.0     17.8  392.83   4.03
3     0.0  6.0622  3.0     18.7  394.63   2.94
4     0.0  6.0622  3.0     18.7  396.90   5.33
..    ...     ...  ...      ...     ...    ...
501   0.0  2.4786  1.0     21.0  391.99   9.67
502   0.0  2.2875  1.0     21.0  396.90   9.08
503   0.0  2.1675  1.0     21.0  396.90   5.64
504   0.0  2.3889  1.0     21.0  393.45   6.48
505   0.0  2.5050  1.0     21.0  396.90   7.88

2. Code structure

2.1 Overall flow

The combination of the maximum number of features whose correlation is below a certain level is acquired by the following flow.

  1. Get the correlation matrix
  2. Count the correlation coefficient above a certain level in each column
  3. Delete the features with the largest number of correlation coefficients from the rows and columns of the correlation matrix.
  4. Repeat steps 2, 3, and 4 until the correlation coefficient is above a certain level and becomes 0 in all columns.
  5. Output a new data table with the remaining features

2.2 Other

--Basically composed around Pandas.

3. Data preparation

Five features (A, B, C, D, E) $ \ times $ 10 data are prepared in pd.DataFrame for configuration.

script_1_in


demo_data = pd.DataFrame(np.random.randint(-1000, 1000,(10,5)),columns=['A','B','C','D','E'])
print(demo_data)

script_1_out


     A    B    C    D    E
0 -644  225    8  509 -980
1  809  993  882 -144 -462
2 -501 -505  972 -657  194
3 -980  862  886 -163 -444
4 -757 -254  186 -506 -178
5 -171 -317  973 -237  760
6  831  265  461    0  214
7  814 -466  610 -668  112
8 -281  832 -753  963  306
9  578 -557 -962    3  435

4. Main

Process according to 1 to 5 of "2.1 Overall flow".

(1.) Get the correlation matrix

The correlation matrix is ​​calculated by pd.DataFrame.corr () as follows.

Script_2_in


corrmat = demo_data.corr()
print(corrmat)

Script_2_out


          A         B         C         D         E
A  1.000000 -0.093300 -0.070089 -0.112714  0.312305
B -0.093300  1.000000  0.048836  0.559999 -0.472975
C -0.070089  0.048836  1.000000 -0.638188 -0.117121
D -0.112714  0.559999 -0.638188  1.000000 -0.153192
E  0.312305 -0.472975 -0.117121 -0.153192  1.000000

(2.) Count the correlation coefficient above a certain level in each column

Define a function that counts the correlation coefficient in each column and returns the column number of the maximum value (corr_loc (corr_df, threshold)) and a function that returns the maximum value (corr_max (corr_df, threshold)). \ * 1 to \ * 4 are common and the return values ​​* 5 and * 6 are different.

function argument Explanation
corr_loc( ) corr_df, threshold corr_df: Correlation matrix with absolute values ​​of each component, pandas.DataFrame
threshold: Correlation coefficient level, 0 or more and 1 or less
Return value: Column number of the feature with the highest correlation coefficient greater than threshold
corr_max( ) corr_df, threshold corr_df: Correlation matrix with absolute values ​​of each component, pandas.DataFrame
threshold: Correlation coefficient level, 0 or more and 1 or less
Return value: Number when the correlation coefficient is greater than threshold

script_3_in


#Function that returns the column number of the maximum value
def corr_loc(corr_df,threshold):
    # *1 Empty table to record counts above a certain correlation coefficient in column i
    count_sums = pd.Series(np.zeros(corr_df.shape[1]))
    # *2 Loop for counting the number of correlation coefficients above a certain level for each i column
    for i in range(corr_df.shape[1]):
        # *3 Count if row j of column i is above a certain correlation coefficient, pass if below a certain level
        count_j = 0
        for j in range(corr_df.shape[1]):
            if corr_df.iloc[j,i] > threshold:
                count_j += 1
            else:
                pass
        # *4 Record the number of correlation coefficients in column i above a certain level
        count_sums.iloc[i] = count_j
    print(count_sums)
    # *5 Correlation coefficient Returns the column number with the largest count above a certain level
    return count_sums.idxmax()

#Function that returns the maximum value
def corr_max(corr_df,threshold):
    # *1
    count_sums = pd.Series(np.zeros(corr_df.shape[1]))
    # *2
    for i in range(corr_df.shape[1]):
        # *3
        count_j = 0
        for j in range(corr_df.shape[1]):
            if corr_df.iloc[j,i] > threshold:
                count_j += 1
            else:
                pass
        # *4
        count_sums.iloc[i] = count_j
    print(count_sums)
    # *6 Correlation coefficient Returns the maximum value of counts above a certain level
    return count_sums.max()
number Annotation
*1 An empty table that records counts above a certain correlation coefficient for each i column
*2 Loop for counting the number of correlation coefficients above a certain level for each i column
*3 Count if j row of column i is above a certain correlation coefficient, pass if below a certain value
*4 Record the number of i columns
*5 Correlation coefficient Returns the column number with the largest count above a certain level
*6 Correlation coefficient Returns the maximum value of counts above a certain level

(3.) Delete the feature with the largest number of correlation coefficients from the row / column
& emsp; & (4.) Repeat until there are no correlation coefficients above a certain level.

Using pd.DataFrame.drop (), the feature quantity with the largest number of correlation coefficients larger than a certain level is repeatedly deleted from each row and column of the correlation matrix. The loop ends when all columns have no correlation coefficient greater than a certain level. In the following example, the absolute value of the correlation coefficient is set to 0.2 as a constant level. However, in the correlation matrix, each column contains the correlation with itself ($ = 1 $), so be careful about the end condition of the loop.

script_3_in


# *7 Since we want to delete high positive and negative correlation coefficients, set each component of the correlation matrix to an absolute value.
a = corrmat.abs()
# *8 This time, the absolute value of the correlation coefficient is 0.Remove anything greater than 2
b = corr_loc(a,0.2)
c = corr_max(a,0.2)
# *9 Repeat as long as the number of combinations with a certain correlation coefficient or more is greater than one
while c > 1:
    A = a.drop(a.columns[b],axis=1)
    B = A.drop(A.index[b])
    a = B
    b = corr_loc(B,0.2)
    c = corr_max(B, 0.2)
print(a)

script_3_out


          D         E
D  1.000000  0.153192
E  0.153192  1.000000
number Annotation
*7 Since we want to delete high positive and negative correlation coefficients, set each component of the correlation matrix to an absolute value.
*8 This time the absolute value of the correlation coefficient is 0.Remove anything greater than 2
*9 Repeat as long as the number of combinations with a certain correlation coefficient or more is greater than one

(5.) Output a new data table with the remaining features

Extract combinations with a correlation coefficient of 0.2 or less from the original data (demo_data).

script_4_in


# *List of 10 column names (a.coulumns) passed and extracted
no_high_corr = demo_data.loc[:,a.columns]
print(no_high_corr)

script_4_out


     D    E
0  509 -980
1 -144 -462
2 -657  194
3 -163 -444
4 -506 -178
5 -237  760
6    0  214
7 -668  112
8  963  306
9    3  435
number Annotation
*10 List of column names (a.coulumns) passed and extracted

Recommended Posts

[Python] Correlation is below a certain level ・ Maximum number of features
[python] [meta] Is the type of python a type?
Judge whether it is a prime number [Python]
Check if the string is a number in python
[Python] A program that counts the number of valleys
[Python] Randomly generate a large number of English names
Get the number of specific elements in a python list
A beginner's summary of Python machine learning is super concise.
Maximum number of characters in Python3 shell call (per OS)
Python list is not a list
A memorandum about correlation [Python]
About the features of Python
What is a python map?
[Python] A program that finds the maximum number of toys that can be purchased with your money
Executing a large number of Python3 Executor.submit may consume a lot of memory
A program that determines whether a number entered in Python is a prime number
I did a lot of research on how Python is executed
Get the number of readers of a treatise on Mendeley in Python
4 methods to count the number of occurrences of integers in a certain interval (including imos method) [Python implementation]
A record of patching a python package
[Python] What is a zip function?
[Python] What is a with statement?
A good description of Python decorators
[Python] A memorandum of beautiful soup4
[Python] A program that calculates the number of chocolate segments that meet the conditions
What is the XX file at the root of a popular Python project?
ETL processing for a large number of GTFS Realtime files (Python edition)
Get the number of searches with a regular expression. SeleniumBasic VBA Python
[Python] How to put any number of standard inputs in a list
Check the in-memory bytes of a floating point number float in Python
[Python] How to use list 2 Reference of list value, number of elements, maximum value, minimum value
python + faker Randomly generate a point with a radius of 100m from a certain point
[Python] Precautions when finding the maximum and minimum values in a numpy array with a small number of elements