0. Introduction

In machine learning, ** ROC curve ** and ** AUC (Area Under the Curve) of the ROC curve are used as indicators of the goodness of the classifier when two classes are classified using a certain classifier. Area) ** is used.

Roughly speaking The ROC curve shows "how much the two distributions could be separated by using the classifier". You can also compare multiple ROC curves by using the amount AUC.

-Easy-to-understand explanation of the meaning and properties of AUC and ROC curves-Mathematics learned with concrete examples -[Machine learning evaluation index-ROC curve and AUC](https://techblog.gmo-ap.jp/2018/12/14/%E6%A9%9F%E6%A2%B0%E5%AD%A6 % E7% BF% 92% E3% 81% AE% E8% A9% 95% E4% BE% A1% E6% 8C% 87% E6% A8% 99-roc% E6% 9B% B2% E7% B7% 9A % E3% 81% A8auc /)

Depending on the model used for training, some ROC curves can be drawn and some cannot. A model in which the output (return value) is given by probability when using model.predict () etc. can draw a ROC curve, but a model in which the output is binary cannot draw a ROC curve.

This time, I'm going to play with this ROC curve using scikit-learn.

1. Try to prepare the data for the time being

As mentioned at the beginning, the ROC curve can be drawn only if the output is probabilistic, so we assume such an output. Then you will have y_true and y_pred like the ones below.

`In[1]`


%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics

`In[2]`


y_true = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

y_pred = [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.5, 0.65, 0.7,
         0.35, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.9]

df = pd.DataFrame({'y_true':y_true, 'y_pred':y_pred})
df

The data frame looks like the above.

Let's visualize the distribution of this.

`In[3]`


x0 = df[df['y_true']==0]['y_pred']
x1 = df[df['y_true']==1]['y_pred']

fig = plt.figure(figsize=(6,5)) #
ax = fig.add_subplot(1, 1, 1)
ax.hist([x0, x1], bins=10, stacked=True)

plt.xticks(np.arange(0, 1.1, 0.1), fontsize = 13) #Axis labels are easier to write with arange
plt.yticks(np.arange(0, 6, 1), fontsize = 13)

plt.ylim(0, 4)
plt.show()

It is like this. It is an example of a partially quantified distribution, as is often the case.

2. Draw an ROC curve

ROC The documentation for roc_curve () is here (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html). There are three return values.

fpr (False Positive Rate)
tpr (True Positive Rate)
thres (Thresholds)

If you decide one threshold and determine whether it is positive or negative based on it, you can find the false positive rate and the true positive rate. The above three return values are a list of them.

AUC auc represents the curved area of the ROC curve obtained above. Returns a value from 0 to 1.

`In[4]`


fpr, tpr, thres = metrics.roc_curve(y_true, y_pred)
auc = metrics.auc(fpr, tpr)
print('auc:', auc)

`Out[4]`


auc: 0.8400000000000001

`In[5]`


plt.figure(figsize = (5, 5)) #How to give the size ratio for a single graph
plt.plot(fpr, tpr, marker='o')
plt.xlabel('FPR: False Positive Rete', fontsize = 13)
plt.ylabel('TPR: True Positive Rete', fontsize = 13)
plt.grid()
plt.show()

Now you can draw the ROC curve.

3. Try different distributions

Now, let's draw a ROC curve with various distributions.

3.1. Completely separable distribution (AUC = 1.0)

Let's draw a ROC curve with a distribution that can be completely separated by setting a certain threshold.

`In[6]`


y_pred = [0, 0.15, 0.2, 0.2, 0.25,
          0.3, 0.35, 0.4, 0.4, 0.45,
          0.5, 0.55, 0.55, 0.65, 0.7,
          0.75, 0.8, 0.85, 0.9, 0.95]
          
y_true = [0, 0, 0, 0, 0,
          0, 0, 0, 0, 0,
          1, 1, 1, 1, 1,
          1, 1, 1, 1, 1]

df = pd.DataFrame({'y_true':y_true, 'y_pred':y_pred})
x0 = df[df['y_true']==0]['y_pred']
x1 = df[df['y_true']==1]['y_pred']

`In[7]`


#Calculate AUC etc.
fpr, tpr, thres = metrics.roc_curve(y_true, y_pred)
auc = metrics.auc(fpr, tpr)

#Mount(fig)Creation
fig = plt.figure(figsize = (12, 4))
fig.suptitle(' AUC = ' + str(auc), fontsize = 16)
fig.subplots_adjust(wspace=0.5, hspace=0.6) #Adjust the spacing between graphs

#Graph on the left(ax1)Creation
ax1 = fig.add_subplot(1, 2, 1)
ax1.hist([x0, x1], bins=10, stacked = True)
ax1.set_xlim(0, 1)
ax1.set_ylim(0, 5)

#Graph on the right(ax2)Creation
ax2 = fig.add_subplot(1, 2, 2)
ax2.plot(fpr, tpr, marker='o')
ax2.set_xlabel('FPR: False Positive Rete', fontsize = 13)
ax2.set_ylabel('TPR: True Positive Rete', fontsize = 13)
ax2.set_aspect('equal')
ax2.grid()

plt.show();

#Mount
## fig, ax = plt.subplots(2, 2, figsize=(6, 4))
## ...

#left
## ax[0].set_
## ...

#Right side
## ax[1].set_
## ...  #But possible

The ROC curve looks like this.

3.2. Distribution that is very difficult to separate (AUC ≒ 0.5)

Next, let's draw a ROC curve with a distribution that is difficult to separate.

`In[8]`


y_pred = [0, 0.15, 0.2, 0.2, 0.25,
          0.3, 0.35, 0.4, 0.4, 0.45,
          0.5, 0.55, 0.55, 0.65, 0.7,
          0.75, 0.8, 0.85, 0.9, 0.95]
          
y_true = [0, 1, 0, 1, 0,
          1, 0, 1, 0, 1,
          0, 1, 0, 1, 0,
          1, 0, 1, 0, 1]

df = pd.DataFrame({'y_true':y_true, 'y_pred':y_pred})
x0 = df[df['y_true']==0]['y_pred']
x1 = df[df['y_true']==1]['y_pred']

`In[9]`


#Calculate AUC etc.
fpr, tpr, thres = metrics.roc_curve(y_true, y_pred)
auc = metrics.auc(fpr, tpr)

#Mount(fig)Creation
fig = plt.figure(figsize = (12, 4))
fig.suptitle(' AUC = ' + str(auc), fontsize = 16)
fig.subplots_adjust(wspace=0.5, hspace=0.6) #Adjust the spacing between graphs

#Graph on the left(ax1)Creation
ax1 = fig.add_subplot(1, 2, 1)
ax1.hist([x0, x1], bins=10, stacked = True)
ax1.set_xlim(0, 1)
ax1.set_ylim(0, 5)

#Graph on the right(ax2)Creation
ax2 = fig.add_subplot(1, 2, 2)
ax2.plot(fpr, tpr, marker='o')
ax2.set_xlabel('FPR: False Positive Rete', fontsize = 13)
ax2.set_ylabel('TPR: True Positive Rete', fontsize = 13)
ax2.set_aspect('equal')
ax2.grid()

plt.show();

The ROC curve looks like this.

4. Examine the return value of the function roc_curve ()

Let's examine the contents of the function roc_curve (). You can see that it is fpr, tpr, thresholds as explained earlier. The 0th threshold s is 1.95, which is the 1st threshold plus 1 and seems to be devised to include a pair where both fpr and tpr are 0.

`In[10]`


print(fpr.shape, tpr.shape, thres.shape)
ROC_df = pd.DataFrame({'fpr':fpr, 'tpr':tpr, 'thresholds':thres})
ROC_df

In the first example, let's look at the drop_intermeditate argument. This is False by default, but you can set it to True to remove points that are not related to the shape of the ROC curve.

`In[11]`


y_pred = [0, 0.15, 0.2, 0.2, 0.25,
          0.3, 0.35, 0.4, 0.4, 0.45,
          0.5, 0.55, 0.55, 0.65, 0.7,
          0.75, 0.8, 0.85, 0.9, 0.95]
          
y_true = [0, 0, 0, 0, 0,
          0, 0, 0, 0, 0,
          1, 1, 1, 1, 1,
          1, 1, 1, 1, 1]

fpr, tpr, thres = metrics.roc_curve(y_true, y_pred, drop_intermediate =True)
print(fpr.shape, tpr.shape, thres.shape)

`Out[11]`


(10,) (10,) (10,)

Therefore, the actual number of points is also reduced.

5. Summary

This time, I summarized the ROC curve when visualizing the result of machine learning.

We are looking for questions and articles!

[Scikit-learn] I played with the ROC curve

0. Introduction

1. Try to prepare the data for the time being

In[1]

In[2]

In[3]

2. Draw an ROC curve

In[4]

Out[4]

In[5]