○ The main points of this article Note that overfitting has been reproduced Overfitting: It can handle learning data, but it cannot handle unknown data. Feeling that there is no application power.

○ Source code (Python): Model overfitting and confirmation of overfitting

`How to check model overfitting and overfitting`


from sklearn.datasets import load_boston
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
%matplotlib inline

#Data preparation. Boston Home Prices
data = load_boston()
X = data.data[:, [5,]] #Extract only the number of rooms as explanatory variables
y = data.target

#Separated into training data and test data
train_X, test_X = X[:400], X[400:]
train_y, test_y = y[:400], y[400:]

#SVR with modified hyperparameters(Support vector machine (kernel method))Learning at
model_s = SVR(C=1.0, kernel='rbf') #Uses rbf kernel with regularization parameter 1
model_s.fit(train_X, train_y)
#Prediction using learning data
s_pred = model_s.predict(train_X)
#Prediction using test data (prediction for unknown data)
s_pred_t = model_s.predict(test_X)

#graph display
fig, ax = plt.subplots()
ax.scatter(train_X, train_y, color='red', marker='s', label='data')
ax.plot(train_X, s_pred, color='blue', label='svr_rbf curve(train)')
ax.plot(test_X, s_pred_t, color='orange', label='svr_rbf curve(test)')
ax.legend()
plt.show()

print("○ Mean square error and coefficient of determination of training data")
print(mean_squared_error(train_y, s_pred))
print(r2_score(train_y, s_pred))
print("○ Mean square error and coefficient of determination of test data")
print(mean_squared_error(test_y, s_pred_t))
print(r2_score(test_y, s_pred_t))

result ダウンロード.png ○ Mean square error and coefficient of determination of training data 30.330756428515905 0.6380880725968641 ○ Mean square error and coefficient of determination of test data 69.32813164021485 -1.4534559402985217

The training data line (blue) is drawn fairly nicely, but the test data line (orange) is subtle. It is clear from the values of mean square error and coefficient of determination. This is overfitting.

There are various ways to prevent overfitting, but I'll explain them again. ・ Increase the number of learning (training) data ・ Perform cross-validation ・ Adjust hyperparameters (make the model simple) ・ Reduce features ・ Implement regularization

Recommended Posts

About machine learning overfitting

Machine learning

About machine learning mixed matrices

[Memo] Machine learning

Machine learning classification

Machine Learning sample

A story about machine learning with Kyasuket

Personal notes and links about machine learning ① (Machine learning)

Machine learning tutorial summary

A story about simple machine learning using TensorFlow

Machine learning ⑤ AdaBoost Summary