The code in the article has been confirmed to work on Windows-10, Python 3.7.3.
import platform
Read regression and binary classification datasets from sklearn.datasets.
from sklearn import datasets
import numpy as np
import pandas as pd
#Regression dataset
boston = datasets.load_boston()
boston_X = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_y = pd.Series(boston.target)
#Binary classification dataset
cancer = datasets.load_breast_cancer()
cancer_X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
cancer_y = pd.Series(cancer.target)
In order to obtain the evaluation index, it is modeled on the text and the predicted value is output. I haven't even validated EDA or feature creation because it's really just to get an evaluation index. I feel guilty.
from sklearn.linear_model import LinearRegression, LogisticRegression
slr = LinearRegression()
slr.fit(boston_X, boston_y)
boston_y_pred = slr.predict(boston_X)
#Binary classification
lr = LogisticRegression(solver='liblinear')
lr.fit(cancer_X, cancer_y)
cancer_y_pred = lr.predict(cancer_X)
cancer_y_pred_prob = lr.predict_proba(cancer_X)[:, 1]
I often use it in practice because I can intuitively understand how much the predicted value deviates from the true value. Despite being a major metric, sklearn only supports MSE. I wish I could return it after np.sqrt.
from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(boston_y, boston_y_pred))
MAE (Mean Absolute Error)
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(boston_y, boston_y_pred)
The closer it is to 1, the higher the accuracy, which is an index just like that for humanities. That's why I used to love it, but once I saw it hurt in practice, I didn't really trust it. Since the coefficient of determination has a variance in the denominator, if the data has a large variation, even a techie model will be relatively high. For reference, it is safer to judge the actual hit condition by RMSE.
from sklearn.metrics import r2_score
r2 = r2_score(boston_y, boston_y_pred)
The binary classification task is treated separately because there are cases where the classification result of positive or negative examples is used as the predicted value and cases where the probability of being positive is used as the predicted value.
--TP (True Positive): When the prediction value is correct, using the prediction value as a positive example. --TN (True Negative): When the prediction value is a negative example and the prediction is correct --FP (False Positive): When the predicted value is a positive example and the prediction is incorrect --FN (False Negative): When the predicted value is a negative example and the prediction is incorrect
It's just a tally, but I think it's the most important. For example, suppose that a virus that infects only 1 in 100 people has spread and is tested. At this time, if everyone is negative anyway, the accuracy seems to be 99% at first glance .... It is necessary to remember the confusion matrix firmly so as not to be confused by such tricks. However, I don't understand why it's TP or FN.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(cancer_y, cancer_y_pred)
tn, fp, fn, tp = cm.flatten()
[[198 14] [ 9 348]]
In the previous example, the correct answer rate would be 0.99 if everyone was negative. In the case of a binary classification task, it's easy to see that nothing starts without first looking at whether the data is imbalanced.
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(cancer_y, cancer_y_pred)
precision is the ratio of the true value predicted to be the positive example, and recall is the ratio of the true value predicted to be the positive example. The two are in a trade-off relationship. Since precision is the rate of false positives and recall is the rate of oversights, decide which one to prioritize according to the purpose. The virus test example I mentioned earlier will emphasize recall, and precision will be important in the marketing context.
from sklearn.metrics import precision_score, recall_score
precision = precision_score(cancer_y, cancer_y_pred)
recall = recall_score(cancer_y, cancer_y_pred)
0.9613259668508287 0.9747899159663865
from sklearn.metrics import f1_score, fbeta_score
f1 = f1_score(cancer_y, cancer_y_pred)
fbeta = fbeta_score(cancer_y, cancer_y_pred, beta=0.5)
0.968011126564673 0.96398891966759
MCC (Matthews Correlation Coefficient)
from sklearn.metrics import matthews_corrcoef
mcc = matthews_corrcoef(cancer_y, cancer_y_pred)
AUC (Area Under the ROC Curve)
The area at the bottom of the ROC curve with the false positive rate plotted on the horizontal axis and the true positive rate plotted on the vertical axis.
Arrange the records with the highest predicted values in order from the left, and think of it as a plot that actually moves up if it is a positive example and horizontally if it is a negative example.
Therefore, at the time of perfect prediction, the ROC straight line jumps up to the ceiling on the upper left, and the AUC becomes 1. For random predictions, trace diagonally.
The famous Gini coefficient is expressed as Gini = 2AUC -1
, so it is linear with AUC.
import japanize_matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve
%matplotlib inline
auc = roc_auc_score(cancer_y, cancer_y_pred_prob)
fpr, tpr, thresholds = roc_curve(cancer_y, cancer_y_pred_prob)
plt.plot(fpr, tpr, label='AUC={:.2f}'.format(auc))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
from sklearn.metrics import log_loss
logloss = log_loss(cancer_y, cancer_y_pred_prob)
I have no practical experience in multi-class classification and recommendations, and it seemed that I would just copy sutras, so I will give it to the next opportunity. I want to be able to write someday.
