In machine learning, ** ROC curve ** and ** AUC (Area Under the Curve) of the ROC curve are used as indicators of the goodness of the classifier when two classes are classified using a certain classifier. Area) ** is used.
Roughly speaking The ROC curve shows "how much the two distributions could be separated by using the classifier". You can also compare multiple ROC curves by using the amount AUC.
-Easy-to-understand explanation of the meaning and properties of AUC and ROC curves-Mathematics learned with concrete examples -[Machine learning evaluation index-ROC curve and AUC](https://techblog.gmo-ap.jp/2018/12/14/%E6%A9%9F%E6%A2%B0%E5%AD%A6 % E7% BF% 92% E3% 81% AE% E8% A9% 95% E4% BE% A1% E6% 8C% 87% E6% A8% 99-roc% E6% 9B% B2% E7% B7% 9A % E3% 81% A8auc /)
Depending on the model used for training, some ROC curves can be drawn and some cannot. A model in which the output (return value) is given by probability when using model.predict () etc. can draw a ROC curve, but a model in which the output is binary cannot draw a ROC curve.
This time, I'm going to play with this ROC curve using scikit-learn.
As mentioned at the beginning, the ROC curve can be drawn only if the output is probabilistic, so we assume such an output. Then you will have y_true and y_pred like the ones below.
In[1]
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
In[2]
y_true = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
y_pred = [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.5, 0.65, 0.7,
0.35, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.9]
df = pd.DataFrame({'y_true':y_true, 'y_pred':y_pred})
df
The data frame looks like the above.
Let's visualize the distribution of this.
In[3]
x0 = df[df['y_true']==0]['y_pred']
x1 = df[df['y_true']==1]['y_pred']
fig = plt.figure(figsize=(6,5)) #
ax = fig.add_subplot(1, 1, 1)
ax.hist([x0, x1], bins=10, stacked=True)
plt.xticks(np.arange(0, 1.1, 0.1), fontsize = 13) #Axis labels are easier to write with arange
plt.yticks(np.arange(0, 6, 1), fontsize = 13)
plt.ylim(0, 4)
plt.show()
It is like this. It is an example of a partially quantified distribution, as is often the case.
ROC The documentation for roc_curve () is here (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html). There are three return values.
If you decide one threshold and determine whether it is positive or negative based on it, you can find the false positive rate and the true positive rate. The above three return values are a list of them.
AUC auc represents the curved area of the ROC curve obtained above. Returns a value from 0 to 1.
In[4]
fpr, tpr, thres = metrics.roc_curve(y_true, y_pred)
auc = metrics.auc(fpr, tpr)
print('auc:', auc)
Out[4]
auc: 0.8400000000000001
In[5]
plt.figure(figsize = (5, 5)) #How to give the size ratio for a single graph
plt.plot(fpr, tpr, marker='o')
plt.xlabel('FPR: False Positive Rete', fontsize = 13)
plt.ylabel('TPR: True Positive Rete', fontsize = 13)
plt.grid()
plt.show()
Now you can draw the ROC curve.
Now, let's draw a ROC curve with various distributions.
Let's draw a ROC curve with a distribution that can be completely separated by setting a certain threshold.
In[6]
y_pred = [0, 0.15, 0.2, 0.2, 0.25,
0.3, 0.35, 0.4, 0.4, 0.45,
0.5, 0.55, 0.55, 0.65, 0.7,
0.75, 0.8, 0.85, 0.9, 0.95]
y_true = [0, 0, 0, 0, 0,
0, 0, 0, 0, 0,
1, 1, 1, 1, 1,
1, 1, 1, 1, 1]
df = pd.DataFrame({'y_true':y_true, 'y_pred':y_pred})
x0 = df[df['y_true']==0]['y_pred']
x1 = df[df['y_true']==1]['y_pred']
In[7]
#Calculate AUC etc.
fpr, tpr, thres = metrics.roc_curve(y_true, y_pred)
auc = metrics.auc(fpr, tpr)
#Mount(fig)Creation
fig = plt.figure(figsize = (12, 4))
fig.suptitle(' AUC = ' + str(auc), fontsize = 16)
fig.subplots_adjust(wspace=0.5, hspace=0.6) #Adjust the spacing between graphs
#Graph on the left(ax1)Creation
ax1 = fig.add_subplot(1, 2, 1)
ax1.hist([x0, x1], bins=10, stacked = True)
ax1.set_xlim(0, 1)
ax1.set_ylim(0, 5)
#Graph on the right(ax2)Creation
ax2 = fig.add_subplot(1, 2, 2)
ax2.plot(fpr, tpr, marker='o')
ax2.set_xlabel('FPR: False Positive Rete', fontsize = 13)
ax2.set_ylabel('TPR: True Positive Rete', fontsize = 13)
ax2.set_aspect('equal')
ax2.grid()
plt.show();
#Mount
## fig, ax = plt.subplots(2, 2, figsize=(6, 4))
## ...
#left
## ax[0].set_
## ...
#Right side
## ax[1].set_
## ... #But possible
The ROC curve looks like this.
Next, let's draw a ROC curve with a distribution that is difficult to separate.
In[8]
y_pred = [0, 0.15, 0.2, 0.2, 0.25,
0.3, 0.35, 0.4, 0.4, 0.45,
0.5, 0.55, 0.55, 0.65, 0.7,
0.75, 0.8, 0.85, 0.9, 0.95]
y_true = [0, 1, 0, 1, 0,
1, 0, 1, 0, 1,
0, 1, 0, 1, 0,
1, 0, 1, 0, 1]
df = pd.DataFrame({'y_true':y_true, 'y_pred':y_pred})
x0 = df[df['y_true']==0]['y_pred']
x1 = df[df['y_true']==1]['y_pred']
In[9]
#Calculate AUC etc.
fpr, tpr, thres = metrics.roc_curve(y_true, y_pred)
auc = metrics.auc(fpr, tpr)
#Mount(fig)Creation
fig = plt.figure(figsize = (12, 4))
fig.suptitle(' AUC = ' + str(auc), fontsize = 16)
fig.subplots_adjust(wspace=0.5, hspace=0.6) #Adjust the spacing between graphs
#Graph on the left(ax1)Creation
ax1 = fig.add_subplot(1, 2, 1)
ax1.hist([x0, x1], bins=10, stacked = True)
ax1.set_xlim(0, 1)
ax1.set_ylim(0, 5)
#Graph on the right(ax2)Creation
ax2 = fig.add_subplot(1, 2, 2)
ax2.plot(fpr, tpr, marker='o')
ax2.set_xlabel('FPR: False Positive Rete', fontsize = 13)
ax2.set_ylabel('TPR: True Positive Rete', fontsize = 13)
ax2.set_aspect('equal')
ax2.grid()
plt.show();
The ROC curve looks like this.
Let's examine the contents of the function roc_curve (). You can see that it is fpr, tpr, thresholds as explained earlier. The 0th threshold s is 1.95, which is the 1st threshold plus 1 and seems to be devised to include a pair where both fpr and tpr are 0.
In[10]
print(fpr.shape, tpr.shape, thres.shape)
ROC_df = pd.DataFrame({'fpr':fpr, 'tpr':tpr, 'thresholds':thres})
ROC_df
In the first example, let's look at the drop_intermeditate argument. This is False by default, but you can set it to True to remove points that are not related to the shape of the ROC curve.
In[11]
y_pred = [0, 0.15, 0.2, 0.2, 0.25,
0.3, 0.35, 0.4, 0.4, 0.45,
0.5, 0.55, 0.55, 0.65, 0.7,
0.75, 0.8, 0.85, 0.9, 0.95]
y_true = [0, 0, 0, 0, 0,
0, 0, 0, 0, 0,
1, 1, 1, 1, 1,
1, 1, 1, 1, 1]
fpr, tpr, thres = metrics.roc_curve(y_true, y_pred, drop_intermediate =True)
print(fpr.shape, tpr.shape, thres.shape)
Out[11]
(10,) (10,) (10,)
Therefore, the actual number of points is also reduced.
This time, I summarized the ROC curve when visualizing the result of machine learning.
We are looking for questions and articles!
Recommended Posts