When I was doing my thesis, I faced something like "I have to keep the correlation coefficient between two features below a certain level, but there are 100 candidates for features." I couldn't find anything similar on the net, probably because it was a basic thing that wasn't technically anything. I will write an article for the time being. The details of the contents are written after "2. Code structure".
When analyzing with a statistical or machine learning model, if a combination of features with high correlation is included,
――I can't expect the persuasive power of the analysis results --Weighting is not stable for each learning
Such adverse effects (multilinearity) will occur. A quick method is to "delete one of each combination of highly correlated features".
However, if the features are deleted without thinking about anything, the number of features may be extremely reduced. For example, in the case where a plurality of highly correlated combinations are included, there is a possibility that the features having a greater correlation with other features are left and the features must be reduced more than necessary. In short, "I want to find a combination that has a correlation coefficient below a certain level and has the largest number of features."
Here are some code examples and some practical examples.
The flow is to define the third function using the first two functions, and use the third function when executing the process. The outline of the third function no_high_corr (df, threshold) is as follows.
function | argument | Explanation |
---|---|---|
no_high_corr( ) | df, threshold | df: Feature data, pandas.DataFrame threshold: Correlation coefficient level, 0 or more and 1 or less Return value: The combination with the largest number of features when the correlation coefficient between the two features is less than or equal to threshold, pandas.DataFrame |
The first two functions are summarized in (2.) of "4. Main".
main_in
import pandas as pd
import numpy as np
def corr_loc(corr_df,threshold):
count_sums = pd.Series(np.zeros(corr_df.shape[1]))
for i in range(corr_df.shape[1]):
count_j = 0
for j in range(corr_df.shape[1]):
if corr_df.iloc[j,i] > threshold:
count_j += 1
else:
pass
count_sums.iloc[i] = count_j
print(count_sums)
return count_sums.idxmax()
def corr_max(corr_df,threshold):
count_sums = pd.Series(np.zeros(corr_df.shape[1]))
for i in range(corr_df.shape[1]):
count_j = 0
for j in range(corr_df.shape[1]):
if corr_df.iloc[j,i] > threshold:
count_j += 1
else:
pass
count_sums.iloc[i] = count_j
print(count_sums)
return count_sums.max()
def no_high_corr(df, threshold):
corrmat = df.corr()
a = corrmat.abs()
b = corr_loc(a,threshold)
c = corr_max(a,threshold)
while c > 1:
A = a.drop(a.columns[b],axis=1)
B = A.drop(A.index[b])
a = B
b = corr_loc(B,threshold)
c = corr_max(B, threshold)
return df.loc[:,a.columns]
An implementation example for the features of the sklearn dataset "Boston Home Price". The combination in which the correlation coefficient between each feature is 0.5 or less and the number of features is maximized is output.
main_in
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
boston = load_boston()
df_features = pd.DataFrame(boston.data, columns = boston.feature_names)
print('==============Oridinary data==============')
print(df_features.head())
low_corr_features = no_high_corr(df_features,0.5)
print('==============Corrected data==============')
print(low_corr_features)
The following is the output. From the original 13 features (Ordinary data), 6 features (Corrected data) were selected.
main_out
==============Oridinary data==============
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0
PTRATIO B LSTAT
0 15.3 396.90 4.98
1 17.8 396.90 9.14
2 17.8 392.83 4.03
3 18.7 394.63 2.94
4 18.7 396.90 5.33
==============Corrected data==============
CHAS DIS RAD PTRATIO B LSTAT
0 0.0 4.0900 1.0 15.3 396.90 4.98
1 0.0 4.9671 2.0 17.8 396.90 9.14
2 0.0 4.9671 2.0 17.8 392.83 4.03
3 0.0 6.0622 3.0 18.7 394.63 2.94
4 0.0 6.0622 3.0 18.7 396.90 5.33
.. ... ... ... ... ... ...
501 0.0 2.4786 1.0 21.0 391.99 9.67
502 0.0 2.2875 1.0 21.0 396.90 9.08
503 0.0 2.1675 1.0 21.0 396.90 5.64
504 0.0 2.3889 1.0 21.0 393.45 6.48
505 0.0 2.5050 1.0 21.0 396.90 7.88
The combination of the maximum number of features whose correlation is below a certain level is acquired by the following flow.
--Basically composed around Pandas.
Five features (A, B, C, D, E) $ \ times $ 10 data are prepared in pd.DataFrame for configuration.
script_1_in
demo_data = pd.DataFrame(np.random.randint(-1000, 1000,(10,5)),columns=['A','B','C','D','E'])
print(demo_data)
script_1_out
A B C D E
0 -644 225 8 509 -980
1 809 993 882 -144 -462
2 -501 -505 972 -657 194
3 -980 862 886 -163 -444
4 -757 -254 186 -506 -178
5 -171 -317 973 -237 760
6 831 265 461 0 214
7 814 -466 610 -668 112
8 -281 832 -753 963 306
9 578 -557 -962 3 435
Process according to 1 to 5 of "2.1 Overall flow".
The correlation matrix is calculated by pd.DataFrame.corr () as follows.
Script_2_in
corrmat = demo_data.corr()
print(corrmat)
Script_2_out
A B C D E
A 1.000000 -0.093300 -0.070089 -0.112714 0.312305
B -0.093300 1.000000 0.048836 0.559999 -0.472975
C -0.070089 0.048836 1.000000 -0.638188 -0.117121
D -0.112714 0.559999 -0.638188 1.000000 -0.153192
E 0.312305 -0.472975 -0.117121 -0.153192 1.000000
Define a function that counts the correlation coefficient in each column and returns the column number of the maximum value (corr_loc (corr_df, threshold)) and a function that returns the maximum value (corr_max (corr_df, threshold)). \ * 1 to \ * 4 are common and the return values * 5 and * 6 are different.
function | argument | Explanation |
---|---|---|
corr_loc( ) | corr_df, threshold | corr_df: Correlation matrix with absolute values of each component, pandas.DataFrame threshold: Correlation coefficient level, 0 or more and 1 or less Return value: Column number of the feature with the highest correlation coefficient greater than threshold |
corr_max( ) | corr_df, threshold | corr_df: Correlation matrix with absolute values of each component, pandas.DataFrame threshold: Correlation coefficient level, 0 or more and 1 or less Return value: Number when the correlation coefficient is greater than threshold |
script_3_in
#Function that returns the column number of the maximum value
def corr_loc(corr_df,threshold):
# *1 Empty table to record counts above a certain correlation coefficient in column i
count_sums = pd.Series(np.zeros(corr_df.shape[1]))
# *2 Loop for counting the number of correlation coefficients above a certain level for each i column
for i in range(corr_df.shape[1]):
# *3 Count if row j of column i is above a certain correlation coefficient, pass if below a certain level
count_j = 0
for j in range(corr_df.shape[1]):
if corr_df.iloc[j,i] > threshold:
count_j += 1
else:
pass
# *4 Record the number of correlation coefficients in column i above a certain level
count_sums.iloc[i] = count_j
print(count_sums)
# *5 Correlation coefficient Returns the column number with the largest count above a certain level
return count_sums.idxmax()
#Function that returns the maximum value
def corr_max(corr_df,threshold):
# *1
count_sums = pd.Series(np.zeros(corr_df.shape[1]))
# *2
for i in range(corr_df.shape[1]):
# *3
count_j = 0
for j in range(corr_df.shape[1]):
if corr_df.iloc[j,i] > threshold:
count_j += 1
else:
pass
# *4
count_sums.iloc[i] = count_j
print(count_sums)
# *6 Correlation coefficient Returns the maximum value of counts above a certain level
return count_sums.max()
number | Annotation |
---|---|
*1 | An empty table that records counts above a certain correlation coefficient for each i column |
*2 | Loop for counting the number of correlation coefficients above a certain level for each i column |
*3 | Count if j row of column i is above a certain correlation coefficient, pass if below a certain value |
*4 | Record the number of i columns |
*5 | Correlation coefficient Returns the column number with the largest count above a certain level |
*6 | Correlation coefficient Returns the maximum value of counts above a certain level |
Using pd.DataFrame.drop (), the feature quantity with the largest number of correlation coefficients larger than a certain level is repeatedly deleted from each row and column of the correlation matrix. The loop ends when all columns have no correlation coefficient greater than a certain level. In the following example, the absolute value of the correlation coefficient is set to 0.2 as a constant level. However, in the correlation matrix, each column contains the correlation with itself ($ = 1 $), so be careful about the end condition of the loop.
script_3_in
# *7 Since we want to delete high positive and negative correlation coefficients, set each component of the correlation matrix to an absolute value.
a = corrmat.abs()
# *8 This time, the absolute value of the correlation coefficient is 0.Remove anything greater than 2
b = corr_loc(a,0.2)
c = corr_max(a,0.2)
# *9 Repeat as long as the number of combinations with a certain correlation coefficient or more is greater than one
while c > 1:
A = a.drop(a.columns[b],axis=1)
B = A.drop(A.index[b])
a = B
b = corr_loc(B,0.2)
c = corr_max(B, 0.2)
print(a)
script_3_out
D E
D 1.000000 0.153192
E 0.153192 1.000000
number | Annotation |
---|---|
*7 | Since we want to delete high positive and negative correlation coefficients, set each component of the correlation matrix to an absolute value. |
*8 | This time the absolute value of the correlation coefficient is 0.Remove anything greater than 2 |
*9 | Repeat as long as the number of combinations with a certain correlation coefficient or more is greater than one |
Extract combinations with a correlation coefficient of 0.2 or less from the original data (demo_data).
script_4_in
# *List of 10 column names (a.coulumns) passed and extracted
no_high_corr = demo_data.loc[:,a.columns]
print(no_high_corr)
script_4_out
D E
0 509 -980
1 -144 -462
2 -657 194
3 -163 -444
4 -506 -178
5 -237 760
6 0 214
7 -668 112
8 963 306
9 3 435
number | Annotation |
---|---|
*10 | List of column names (a.coulumns) passed and extracted |