I will forget about the evaluation index for each task too much, so I will summarize the minimum things to remember. This article deals with the main metrics of regression and binary classification. References: [Kaggle Book](https://www.amazon.co.jp/Kaggle%E3%81%A7%E5%8B%9D%E3%81%A4%E3%83%87%E3%83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E3% 81% AE% E6% 8A% 80% E8% A1% 93-% E9% 96% 80% E8% 84% 87 -% E5% A4% A7% E8% BC% 94 / dp / 4297108437 / ref = sr_1_1_sspa? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords = Kaggle % E3% 81% A7% E5% 8B% 9D% E3% 81% A4% E3% 83% 87% E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E3 % 81% AE% E6% 8A% 80% E8% A1% 93 & qid = 1583042364 & sr = 8-1-spons & psc = 1 & spLa = ZW5jcnlwdGVkUXVhbGlmaWVyPUExTkhDNkExUTRFMzQ2JmVuY3J5cHRlZElkPUEwMzc3MjMyMzFUQ0g5SERIQ1BDSiZlbmNyeXB0ZWRBZElkPUFMT0hMWThLWFhJNDUmd2lkZ2V0TmFtZT1zcF9hdGYmYWN0aW9uPWNsaWNrUmVkaXJlY3QmZG9Ob3RMb2dDbGljaz10cnVl)

1. Preparation

1-1. Environment

The code in the article has been confirmed to work on Windows-10, Python 3.7.3.

import platform

print(platform.platform())
print(platform.python_version())

1-2. Data set

Read regression and binary classification datasets from sklearn.datasets.

from sklearn import datasets
import numpy as np
import pandas as pd

#Regression dataset
boston = datasets.load_boston()
boston_X = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_y = pd.Series(boston.target)

#Binary classification dataset
cancer = datasets.load_breast_cancer()
cancer_X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
cancer_y = pd.Series(cancer.target)

1-3. Modeling

In order to obtain the evaluation index, it is modeled on the text and the predicted value is output. I haven't even validated EDA or feature creation because it's really just to get an evaluation index. I feel guilty.

from sklearn.linear_model import LinearRegression, LogisticRegression

#Regression
slr = LinearRegression()
slr.fit(boston_X, boston_y)
boston_y_pred = slr.predict(boston_X)

#Binary classification
lr = LogisticRegression(solver='liblinear')
lr.fit(cancer_X, cancer_y)
cancer_y_pred = lr.predict(cancer_X)
cancer_y_pred_prob = lr.predict_proba(cancer_X)[:, 1]

2. Evaluation index of regression task

RMSE (Root Mean Squared Error)

\mathrm{RMSE}=\displaystyle\sqrt{\dfrac{1}{N}\sum_{i=1}^N(y_i-\hat{y}_i)^2}

I often use it in practice because I can intuitively understand how much the predicted value deviates from the true value. Despite being a major metric, sklearn only supports MSE. I wish I could return it after np.sqrt.

from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(boston_y, boston_y_pred))
print(rmse)

4.679191295697281

MAE (Mean Absolute Error) $ \mathrm{MAE}=\dfrac{1}{N}\displaystyle\sum_{i=1}^N|y_i-\hat{y}_i| $ The next team used this as an evaluation index. I have never used it. Compared to RMSE, the effect of outliers is small. If RMSE is average, MAE is the median image.

from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(boston_y, boston_y_pred)
print(mae)

3.2708628109003115

Coefficient of determination

    R^2=1-\dfrac{\sum_{i=1}^N(y_i-\hat{y}_i)^2}{\sum_{i=1}^N(y_i-\bar{y})^2}

The closer it is to 1, the higher the accuracy, which is an index just like that for humanities. That's why I used to love it, but once I saw it hurt in practice, I didn't really trust it. Since the coefficient of determination has a variance in the denominator, if the data has a large variation, even a techie model will be relatively high. For reference, it is safer to judge the actual hit condition by RMSE.

from sklearn.metrics import r2_score

r2 = r2_score(boston_y, boston_y_pred)
print(r2)

0.7406426641094095

3. Evaluation index of binary classification task

The binary classification task is treated separately because there are cases where the classification result of positive or negative examples is used as the predicted value and cases where the probability of being positive is used as the predicted value.

3-1. When the classification result is used as the predicted value

Confusion matrix

--TP (True Positive): When the prediction value is correct, using the prediction value as a positive example. --TN (True Negative): When the prediction value is a negative example and the prediction is correct --FP (False Positive): When the predicted value is a positive example and the prediction is incorrect --FN (False Negative): When the predicted value is a negative example and the prediction is incorrect

It's just a tally, but I think it's the most important. For example, suppose that a virus that infects only 1 in 100 people has spread and is tested. At this time, if everyone is negative anyway, the accuracy seems to be 99% at first glance .... It is necessary to remember the confusion matrix firmly so as not to be confused by such tricks. However, I don't understand why it's TP or FN.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(cancer_y, cancer_y_pred)
tn, fp, fn, tp = cm.flatten()
print(cm)

[[198 14] [ 9 348]]

accuracy (correct answer rate)

accuracy = \frac{TP+TN}{TP+TN+FP+FN}

error rate

error \; rate = 1-accuracy

In the previous example, the correct answer rate would be 0.99 if everyone was negative. In the case of a binary classification task, it's easy to see that nothing starts without first looking at whether the data is imbalanced.

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(cancer_y, cancer_y_pred)
print(accuracy)

0.9595782073813708

precision

precision = \frac{TP}{TP+FP}

recall

recall = \frac{TP}{TP+FN}

precision is the ratio of the true value predicted to be the positive example, and recall is the ratio of the true value predicted to be the positive example. The two are in a trade-off relationship. Since precision is the rate of false positives and recall is the rate of oversights, decide which one to prioritize according to the purpose. The virus test example I mentioned earlier will emphasize recall, and precision will be important in the marketing context.

from sklearn.metrics import precision_score, recall_score

precision = precision_score(cancer_y, cancer_y_pred)
recall = recall_score(cancer_y, cancer_y_pred)
print(precision)
print(recall)

0.9613259668508287 0.9747899159663865

F1-score $ F_1 = \dfrac{2}{\dfrac{1}{recall}+\dfrac{1}{precision}} $ Fβ-score $ F_\beta = \dfrac{(1+\beta^2)}{\dfrac{\beta^2}{recall}+\dfrac{1}{precision}} $ F1-score is the harmonic mean of precision and recall, and Fβ-score is an index adjusted by the coefficient β, which indicates how much emphasis is placed on recall. It seems to be easy to use because it is well-balanced, but I haven't used it in practice, so I can't imagine it.

from sklearn.metrics import f1_score, fbeta_score

f1 = f1_score(cancer_y, cancer_y_pred)
fbeta = fbeta_score(cancer_y, cancer_y_pred, beta=0.5)
print(f1)
print(fbeta)

0.968011126564673 0.96398891966759

MCC (Matthews Correlation Coefficient) $ MCC = \dfrac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} $ I've never heard of it. It takes a value from -1 to 1, and it seems to interpret that a perfect prediction is made when it is 1, a random prediction is made when it is 0, and a completely opposite prediction is made when it is -1. It is easy to evaluate even unbalanced data properly, so I would like to be able to use it.

from sklearn.metrics import matthews_corrcoef

mcc = matthews_corrcoef(cancer_y, cancer_y_pred)
print(mcc)

0.9132886202215396

3-2. When the probability is the predicted value

AUC (Area Under the ROC Curve) The area at the bottom of the ROC curve with the false positive rate plotted on the horizontal axis and the true positive rate plotted on the vertical axis. Arrange the records with the highest predicted values in order from the left, and think of it as a plot that actually moves up if it is a positive example and horizontally if it is a negative example. Therefore, at the time of perfect prediction, the ROC straight line jumps up to the ceiling on the upper left, and the AUC becomes 1. For random predictions, trace diagonally. The famous Gini coefficient is expressed as Gini = 2AUC -1, so it is linear with AUC.

import japanize_matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve
%matplotlib inline

auc = roc_auc_score(cancer_y, cancer_y_pred_prob)
print(auc)

fpr, tpr, thresholds = roc_curve(cancer_y, cancer_y_pred_prob)
plt.plot(fpr, tpr, label='AUC={:.2f}'.format(auc))
plt.legend()
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.show()

0.9946488029173934

logloss $ logloss = -\frac{1}{N} \sum_{i=1}^N(y_i {\log} p_i+(1-y_i){\log}(1-p_i)) $ This is also famous along with AUC. Also called cross entropy. It takes the logarithm of the probability of predicting the true value, and the sign is inverted, so it is better to be lower (it seems) The idea is to give a penalty when the probability of being a positive example is underestimated but is positive, or when the probability of being high is underestimated but negative.

from sklearn.metrics import log_loss

logloss = log_loss(cancer_y, cancer_y_pred_prob)
print(logloss)

0.09214591499092101

I have no practical experience in multi-class classification and recommendations, and it seemed that I would just copy sutras, so I will give it to the next opportunity. I want to be able to write someday.

Memorandum about regression and binary classification metrics