ROC curve and PR curve-Understanding how to evaluate classification performance ②-

Introduction

Machine learning classification tasks have several performance metrics depending on their purpose. The ROC curve and PR curve, which are the main performance evaluation indexes of the binary classification, and their AUC (lower area of the curve) are summarized.

reference

I referred to the following in understanding the ROC curve and PR curve.

-Discussion on the difference between ROC curve and PR curve -Practical machine learning with scikit-learn and TensorFlow

Classification task

We will explain the performance evaluation method with specific examples of document classification tasks. As a first step, this chapter briefly describes how to perform the classification task, but since it is not an article about the classification task itself, a detailed explanation of the model is omitted.

Library used

data set

This time, the dataset uses "livedoor news corpus". Please refer to Posted in the previously posted article for details of the dataset and its morphological analysis method.

In the case of Japanese, preprocessing that decomposes sentences into morphemes is required in advance, so after decomposing all sentences into morphemes, they are dropped into the following data frame.

スクリーンショット 2020-01-13 21.07.38.png

The rightmost column is the one in which all sentences are morphologically analyzed and separated by half-width spaces. Use this to perform a classification task.

Model creation and classification

This time, we will classify "Peachy" articles and "German News" articles (both articles for women). Since this time it is a binary classification, it is synonymous with the classification task to determine whether it is an article of "German communication". The dataset is divided into 7: 3, 7 for training and 3 for evaluation.


import pandas as pd
import numpy as np
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer

#It is assumed that the data frame after morpheme decomposition is already pickled and has
with open('df_wakati.pickle', 'rb') as f:
    df = pickle.load(f)

#Verify if you can classify two types of articles this time
ddf = df[(df[1]=='peachy') | (df[1]=='dokujo-tsushin')].reset_index(drop = True)

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(ddf[3])

def convert(x):
    if x == 'peachy':
        return 0
    elif x == 'dokujo-tsushin':
        return 1

target = ddf[1].apply(lambda x : convert(x))

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, target, train_size= 0.7, random_state = 0)

import lightgbm as lgb
from sklearn.metrics import classification_report

train_data = lgb.Dataset(X_train, label=y_train)
eval_data = lgb.Dataset(X_test, label=y_test)

params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'random_state':0
    }

gbm = lgb.train(
    params,
    train_data,
    valid_sets=eval_data,
)
y_preds = gbm.predict(X_test)

The forecast is complete here. y_preds contains the value of the probability that the document is "German communication".

Evaluation method of classification performance

Before looking at the ROC curve and PR curve, let's review the mixing matrix that precedes it. The mixed matrix is a matrix that summarizes the output results of the binary classification task and is expressed as follows.

Predicted to be positive Predicted to be Negative
Actually belongs to the Positive class TP(True positive) FN(False negative)
Actually belongs to the Negative class FP:(false positive) TN(True negative)

You can draw a ROC curve using each value in the mixed matrix.

ROC curve

Overview of ROC curve


\text{FPR} = \frac{FP}{TN + FP}


\text{TPR(recall)} = \frac{TP}{TP + FN}

** ROC curve is a plot of $ \ text {TPR} $ (true positive rate) against $ \ text {FPR} $ (false positive rate) **. Let's talk about what this plot means. First, let's apply the meanings of $ \ text {FPR} $ and $ \ text {TPR (recall)} $ to concrete examples.


{{\begin{eqnarray*}

\text{FPR} &=& \frac{The classification model is "\text{Peachy}Number of cases that mistakenly predicted that the article was an article of "German communication"}{The actual "\text{Peachy}Total number of articles} \\

\end{eqnarray*}}}


{{\begin{eqnarray*}

\text{TPR(recall)} &=& \frac{The number of cases where the result of predicting that the classification model was an article of "German communication" was correct}{Total number of actual "German News" articles} \\

\end{eqnarray*}}}

The meaning of this can be summarized as follows.

-$ \ text {FPR} $ represents the percentage of negative ("Peachy" article) data that was incorrectly classified as positive ("Peachy" article) → ** Inaccurate discrimination of negative data Represents (lower is better) ** -$ \ text {TPR} $ represents the percentage of positive ("German communication" article) data correctly classified as positive ("German communication" article) → ** Comprehensive positive judgment Represents (higher is better) **

In other words, ideally, $ \ text {FPR} $ is low while $ \ text {TPR} $ is high.

You can draw a ROC curve by plotting $ \ text {FPR} $ and $ \ text {TPR} $ at various thresholds. Considering that "ideally, $ \ text {FPR} $ is low but $ \ text {TPR} $ is high", the closer the shape of the ROC curve is to a right angle, the better = It leads to the idea that ** the larger the AUC (lower area of the ROC curve), the better **.

ROC curve drawing

Let's actually draw the ROC curve.


from sklearn import metrics
import matplotlib.pyplot as plt
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_preds)
auc = metrics.auc(fpr, tpr)
print(auc)

plt.plot(fpr, tpr, label='ROC curve (area = %.2f)'%auc)
plt.plot(np.linspace(1, 0, len(fpr)), np.linspace(1, 0, len(fpr)), label='Random ROC curve (area = %.2f)'%0.5, linestyle = '--', color = 'gray')

plt.legend()
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(True)
plt.show()

download.png

The ROC curve was drawn using the classifier created this time. Considering that the ROC curve is close to a right angle and the AUC is $ 0.98 $ (maximum value is $ 1 $), you can see that it is very accurate. For a random classifier, the AUC is fixed at $ 0.5 $, making it easy to compare with random.

PR curve

Overview of PR curve


\text{precision} = \frac{TP}{TP + FP}


\text{recall(TPR)} = \frac{TP}{TP + FN}

** A PR curve is a plot of $ \ text {precition} $ (reproducibility) for $ \ text {recall} $ (recall rate) **. Let's talk about what this plot means. First, let's apply the meanings of $ \ text {precition} $ and $ \ text {recall} $ to concrete examples.


{{\begin{eqnarray*}

\text{precision} &=& \frac{The number of cases where the result of predicting that the classification model was an article of "German communication" was correct}{The total number of cases predicted that the classification model is an article of "German News"} \\

\end{eqnarray*}}}


{{\begin{eqnarray*}

\text{recall} &=& \frac{The number of cases where the result of predicting that the classification model was an article of "German communication" was correct}{Total number of actual "German News" articles} \\

\end{eqnarray*}}}

-$ \ text {precision} $ represents the percentage of data that is truly positive (article of "German News") in the data classified as positive by the classification model → ** Represents the certainty of positive judgment Yes (higher is better) ** -$ \ text {recall} $ represents the percentage of positive ("German communication" article) data correctly classified as positive ("German communication" article) → ** Comprehensive positive judgment Represents (higher is better) **

In other words, ideally, $ \ text {precision} $ is high (while ensuring certainty), but $ \ text {recall} $ is also as high as possible (covered).

You can draw a PR curve by plotting $ \ text {precision} $ and $ \ text {recall} $ at various thresholds. As with the ROC curve, the larger the AUC (lower area of the PR curve), the better the accuracy.


precision, recall, thresholds = metrics.precision_recall_curve(y_test, y_preds)

auc = metrics.auc(recall, precision)
print(auc)

plt.plot(recall, precision, label='PR curve (area = %.2f)'%auc)
plt.legend()
plt.title('PR curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.grid(True)
plt.show()

download-1.png

I drew a PR curve using the classifier created this time. Given that the AUC is $ 0.98 $ (maximum value is $ 1 $), you can see that it is very accurate from the perspective of the PR curve. However, unlike the ROC curve, the AUC is not always $ 0.5 $ for a random classifier.

ROC curve and PR curve

It is a matter of whether to use the ROC curve or the PR curve, but in general ** In the case of imbalanced data (the number of negatives is overwhelmingly larger than the number of positives, etc.), the PR curve is used. Other than that, it is better to use the ROC curve **.

As an interpretation, the ROC curve has both the viewpoint that Positive can be judged as Positive and Negative can be judged as Negative, but the PR curve focuses only on the fact that Positive can be judged as Positive. doing. Therefore, it is better to use the ROC curve that looks at the balance of both as a performance index of the classifier, but if there are overwhelmingly many Negatives, it is accurate to judge the majority of Negatives as Negatives. Will be judged to be good. (Even if Positve's judgment is appropriate.) So my view is to use the PR curve to see if a small number of Positives can be judged properly.

As an extreme example, let's assume that there is a model in which Positive predicts appropriately and Negative definitely determines Negative in 100 Positive data and 99900 Negative data. Then the ROC curve and PR curve are as follows.


rand_predict = np.concatenate((np.random.rand(100) , 0.5*np.random.rand(99900)))
rand_test = np.concatenate((np.ones(100), np.zeros(99900)))

from sklearn import metrics
import matplotlib.pyplot as plt
fpr, tpr, thresholds = metrics.roc_curve(rand_test, rand_predict)
auc = metrics.auc(fpr, tpr)
print(auc)

#ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %.2f)'%auc)
plt.plot(np.linspace(1, 0, len(fpr)), np.linspace(1, 0, len(fpr)), label='Random ROC curve (area = %.2f)'%0.5, linestyle = '--', color = 'gray')

plt.legend()
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(True)
plt.show()

##PR curve
precision, recall, thresholds = metrics.precision_recall_curve(rand_test, rand_predict)

auc = metrics.auc(recall, precision)
print(auc)

plt.plot(recall, precision, label='PR curve (area = %.2f)'%auc)
plt.legend()
plt.title('PR curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.grid(True)
plt.show()

download-2.png

download-3.png

Can you see the big difference in AUC (ROC curve AUC is $ 0.72 $, PR curve AUC is $ 0.47 $)? Even though the data is the same, the accuracy judgment will change greatly depending on the index you see. Basically, it is important to use the ROC curve and PR curve properly from the above viewpoint and make individual judgments according to the task.

Next I would like to summarize the performance evaluation of machine learning other than classification.

Recommended Posts

ROC curve and PR curve-Understanding how to evaluate classification performance ②-
Conformity and recall-Understanding how to evaluate classification performance ①-
Consideration of the difference between ROC curve and PR curve
ROC curve for multiclass classification
Try to evaluate the performance of machine learning / classification model
How to install and use Tesseract-OCR
How to install and configure blackbird
How to use .bash_profile and .bashrc
How to install CUDA and nvidia-driver
How to solve slide puzzles and 15 puzzles
How to package and distribute Python scripts
How to split and save a DataFrame
How to install and use pandas_datareader [Python]
python: How to use locals () and globals ()
[Python] How to calculate MAE and RMSE
How to use Python zip and enumerate
How to use is and == in Python
How to use pandas Timestamp and date_range
How to install fabric and basic usage
How to write pydoc and multi-line comments