Support vector regression is one of the machine learning methods and is suitable for multivariate nonlinear regression problems because it estimates the regression curve without assuming a functional form. In addition, it has strong collinearity and is less likely to become unstable even if it is used roughly as much as possible using the explanatory variables.
** Example of support vector regression **
test_svr.py
import numpy as np
import random
import matplotlib.pyplot as plt
from sklearn import svm
PI = 3.14
#Make a point by dividing 0 to 2π into 120 equal parts
X = np.array(range(120))
X = X * 6 * PI / 360
# y=Calculate sinX and add an error that follows a Gaussian distribution
y = np.sin(X)
e = [random.gauss(0, 0.2) for i in range(len(y))]
y += e
#Convert to column vector
X = X[:, np.newaxis]
#Do learning
svr = svm.SVR(kernel='rbf')
svr.fit(X, y)
#Draw a regression curve
X_plot = np.linspace(0, 2*PI, 10000)
y_plot = svr.predict(X_plot[:, np.newaxis])
#Plot on the graph.
plt.scatter(X, y)
plt.plot(X_plot, y_plot)
plt.show()
In general, the regression curve of support vector regression is a non-linear map to a higher dimensional feature space. Therefore, it is not possible to simply infer the contribution of each explanatory variable to the explanatory power from the absolute value of the coefficient as in multiple regression analysis. (You can't do that, right?) Therefore, it is considered effective to perform sensitivity analysis, record the transition of the coefficient of determination while deleting the variables in ascending order of sensitivity, and use the variable set immediately before the coefficient of determination drops significantly as an effective feature. ..
I referred to this document.
Let's verify using the Boston house price data provided in scikit-learn.
** Number of rounds of feature reduction and coefficient of determination ** It can be seen that the coefficient of determination does not drop sharply even if some features are removed. The table is as follows.
Number of rounds | Features removed | Coefficient of determination |
---|---|---|
0 | - | 0.644 |
1 | ZN | 0.649 |
2 | INDUS | 0.663 |
3 | CHAS | 0.613 |
4 | CRIM | 0.629 |
5 | RAD | 0.637 |
6 | NOX | 0.597 |
7 | PTRATIO | 0.492 |
8 | B | 0.533 |
9 | TAX | 0.445 |
10 | DIS | 0.472 |
11 | AGE | 0.493 |
12 | RM | 0.311 |
The last remaining feature is LSTAT.
The meaning of each feature is roughly as follows. See here for more information. ** CRIM **: Crime rate per capita ** ZN **: Percentage of residential land over 25,000 square feet ** INDUS **: Percentage of non-retail industries ** CHAS **: Whether it touches the Charles River ** NOX **: Nitrogen oxide concentration ** RM **: Number of rooms ** AGE **: Percentage of homes built before 1940 ** DIS **: Distance to Boston employment centers ** RAD **: Easy access to radial highways ** TAX **: Property tax rate at maturity ** PTRATIO **: Number of students per teacher ** B **: Ratio of blacks to population ** LSTAT **: Percentage of lower class people
From this result, we can see the following.
--The top eight features of NOX, PTRATIO, B, TAX, DIS, AGE, RM, and LSTAT alone have an estimation ability comparable to that of using all the features.
――LSTAT and RM are more important than ZN and INDUS to predict house prices in Boston.
By combining sensitivity analysis in this way, the contribution of features can be ranked even when support vector regression is used. Finally, the code used for feature selection is described.
select_features.py
def standardize(data_table):
for column in data_table.columns:
if column in ["target"]:
continue
if data_table[column].std() == 0:
data_table.loc[:, column] = 0
else:
data_table.loc[:, column] = ((data_table.loc[:,column] - data_table[column].mean())
/ data_table[column].std())
return data_table
#It is a method to calculate the sensitivity
def calculate_sensitivity(data_frame, feature_name, k=10):
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn import linear_model
from sklearn import grid_search
#Set parameters for grid search
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [10**i for i in range(-4, 0)],
'C': [10**i for i in range(1,4)]}]
#A list that stores the slope values.
slope_list = []
#sample size
sample_size = len(data_frame.index)
features = list(data_frame.columns)
features.remove("target")
for number_set in range(k):
#Divide the data for training and testing.
if number_set < k - 1:
test_data = data_frame.iloc[number_set*sample_size//k:(number_set+1)*sample_size//k,:]
learn_data = pd.concat([data_frame.iloc[0:number_set*sample_size//k, :],data_frame.loc[(number_set+1)*sample_size//k:, :]])
else:
test_data = data_frame[(k-1)*sample_size//k:]
learn_data = data_frame[:(k-1)*sample_size//k]
#Divide each into labels and features
learn_label_data = learn_data["target"]
learn_feature_data = learn_data.loc[:,features]
test_label_data = test_data["target"]
test_feature_data = test_data.loc[:, features]
#Replace the columns other than the one for which you want to analyze the sensitivity of the test data with the column average.
for column in test_feature_data.columns:
if column == feature_name:
continue
test_feature_data.loc[:, column] = test_feature_data[column].mean()
#Numpy each data for SVR.Convert to array format.
X_test = np.array(test_feature_data)
X_linear_test = np.array(test_feature_data[feature_name])
X_linear_test = X_linear_test[:, np.newaxis]
y_test = np.array(test_label_data)
X_learn = np.array(learn_feature_data)
y_learn = np.array(learn_label_data)
#Perform regression analysis and get output
gsvr = grid_search.GridSearchCV(svm.SVR(), tuned_parameters, cv=5, scoring="mean_squared_error")
gsvr.fit(X_learn, y_learn)
y_predicted = gsvr.predict(X_test)
#Performs a linear regression on the output.
lm = linear_model.LinearRegression()
lm.fit(X_linear_test, y_predicted)
#Get the slope
slope_list.append(lm.coef_[0])
return np.array(slope_list).mean()
#A method that calculates the coefficient of determination.
def calculate_R2(data_frame,k=10):
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn import grid_search
#Set parameters for grid search
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [10**i for i in range(-4, 0)],
'C': [10**i for i in range(1,4)]}]
svr = svm.SVR()
#Define a list that stores the value of the coefficient of determination for each round.
R2_list = []
features = list(data_frame.columns)
features.remove("target")
#sample size
sample_size = len(data_frame.index)
for number_set in range(k):
#Divide the data for training and testing.
if number_set < k - 1:
test_data = data_frame[number_set*sample_size//k:(number_set+1)*sample_size//k]
learn_data = pd.concat([data_frame[0:number_set*sample_size//k],data_frame[(number_set+1)*sample_size//k:]])
else:
test_data = data_frame[(k-1)*sample_size//k:]
learn_data = data_frame[:(k-1)*sample_size//k]
#Divide each into labels and features
learn_label_data = learn_data["target"]
learn_feature_data = learn_data.loc[:, features]
test_label_data = test_data["target"]
test_feature_data = test_data.loc[:, features]
#Numpy each data for SVR.Convert to array format.
X_test = np.array(test_feature_data)
y_test = np.array(test_label_data)
X_learn = np.array(learn_feature_data)
y_learn = np.array(learn_label_data)
#Perform regression analysis and R for test data^Calculate 2.
gsvr = grid_search.GridSearchCV(svr, tuned_parameters, cv=5, scoring="mean_squared_error")
gsvr.fit(X_learn, y_learn)
score = gsvr.best_estimator_.score(X_test, y_test)
R2_list.append(score)
# R^Returns the mean of 2.
return np.array(R2_list).mean()
if __name__ == "__main__":
from sklearn.datasets import load_boston
from sklearn import svm
import pandas as pd
import random
import numpy as np
#Read Boston rent data.
boston = load_boston()
X_data, y_data = boston.data, boston.target
df = pd.DataFrame(X_data, columns=boston["feature_names"])
df['target'] = y_data
count = 0
temp_data = standardize(df)
#Randomly sort the data for cross-validation.
temp_data.reindex(np.random.permutation(temp_data.index)).reset_index(drop=True)
#Create a dataframe to store the sensitivity and coefficient of determination of the features in each loop.
result_data_frame = pd.DataFrame(np.zeros((len(df.columns), len(df.columns))), columns=df.columns)
result_data_frame["Coefficient of determination"] = np.zeros(len(df.columns))
#Execute the following loop until the features are completely removed.
while(len(temp_data.columns)>1):
#This is the coefficient of determination when all the remaining features in this round are used.
result_data_frame.loc[count, "Coefficient of determination"] = calculate_R2(temp_data,k=10)
#A data frame that stores the sensitivity of each feature in this round.
temp_features = list(temp_data.columns)
temp_features.remove('target')
temp_result = pd.DataFrame(np.zeros(len(temp_features)),
columns=["abs_Sensitivity"], index=temp_features)
#It loops the following for each feature.
for i, feature in enumerate(temp_data.columns):
if feature == "target":
continue
#Perform sensitivity analysis.
sensitivity = calculate_sensitivity(temp_data, feature)
result_data_frame.loc[count, feature] = sensitivity
temp_result.loc[feature, "abs_Sensitivity"] = abs(sensitivity)
print(feature, sensitivity)
print(count, result_data_frame.loc[count, "Coefficient of determination"])
#Make a copy of the data with the features with the smallest absolute value of sensitivity removed.
ineffective_feature = temp_result["abs_Sensitivity"].argmin()
print(ineffective_feature)
temp_data = temp_data.drop(ineffective_feature, axis=1)
#Data and sensitivity and R^Returns the transition of 2.
result_data_frame.to_csv("result.csv")
count += 1
Recommended Posts