When classifying with machine learning, you may want to get the probability of belonging to those classes as well as the classification results. If the number of positive data is extremely small compared to the number of negative data (such data is called unbalanced data), if you build a prediction model using all of that data, the prediction result will also be negative. In many cases, it tends to be difficult to accurately classify positive data. Therefore, we often build a model using undersampled data so that the number of negative data is equal to the number of positive data. This makes it possible to classify positive example data with high accuracy, but because the balance between the number of positive and negative data is different from the original data, the probability prediction result is biased by undersampling. I will end up.
Some people have already summarized how to deal with this problem in the blog below, but I will summarize as a memorandum how to remove and correct the bias of the probability output by the model built with undersampled data. .. In this article, we simply use a logistic regression model as a model for probability prediction.
-[Correct the bias of the prediction probability when dealing with imbalanced data with Undersampling + bagging and visualize the result](https://tjo.hatenablog.com/entry/2019/08/04/ 150431) -Bias of prediction probability due to downsampling
The method of correcting bias by undersampling is described in the paper [Calibrating Probability with Undersampling for Unbalanced Classification]. Proposed at (https://www3.nd.edu/~dial/publications/dalpozzolo2015calibrating.pdf).
Now consider a binary classification task that predicts the objective variable $ Y $ ($ Y $ takes either 0 or 1) from the explanatory variable $ X $. The original dataset $ (X, Y) $ is unbalanced data with an extremely small number of positive examples, and the dataset in which the number of negative examples is equal to the number of positive examples by undersampling is $ (X_s). , Y_s) $. Also, if some data (sample) contained in $ (X, Y) $ is also contained in $ (X_s, Y_s) $, it takes 1 and 0 if it is not included in $ (X_s, Y_s) $. Introduces a sampling variable $ s $ that takes.
Original dataset
p=\frac{\beta p_s}{\beta p_s-p_s+1}
Here, $ \ beta = N ^ + / N ^-$ ($ N ^ + $ is the number of positive data, $ N ^-$ is the number of negative data).
The following is a detailed explanation of the formula, so if you are not interested, please skip it.
Bayes' theorem, and
p(y=1|x,s=1)=\frac{p(s=1|y=1)p(y=1|x)}{p(s=1|y=1)p(y=1|x)+p(s=1|y=0)p(y=0|x)}
Now, the number of regular data is extremely small, and all the data for which $ y = 1 $ is sampled, so if $ p (s = 1 | y = 1) = 1 $,
p(y=1|x,s=1)=\frac{p(y=1|x)}{p(y=1|x)+p(s=1|y=0)p(y=0|x)}
Can be written.
further,
p_s=\frac{p}{p+\beta(1-p)}
It will be. Finally, when $ p $ is transformed so that it is on the left side,
p=\frac{\beta p_s}{\beta p_s-p_s+1}
It will be. The last equation means the probability $ predicted by the model built with the undersampled data $ p_s $ to remove the bias by correcting the probability $ predicted by the model built with the original data. It means that you can calculate p $.
Where $ \ beta = p (s = 1 | y = 0) $ represents the probability that negative example data will be sampled. Now, since the negative example data is sampled by the same number as the positive example data, $ \ beta = N ^ + / N ^-$ ($ N ^ + $ is the number of positive example data, $ N ^-$ is It can be approximated to the number of data in the negative example).
In the following, we will perform an experiment to correct the prediction probability while showing an actual code example. (The operating environment of the following code is Python 3.7.3, pandas 0.24.2, scikit-learn 0.20.3.)
The experiment is performed according to the following flow.
Here, the Adult Dataset Published in the UCI Machine Learning Repository. Use edu / ml / machine-learning-databases / adult /). This dataset is a dataset for classifying whether an individual's annual income is 50,000 $ or more based on data such as gender and age.
First, load the data to be used. Here, in the Adult Dataset Save adult.data and adult.test locally as CSV files, and use the former as training data and the latter as verification data.
import numpy as np
import pandas as pd
#Data reading
train_data = pd.read_csv('./adult_data.csv', names=['age', 'workclass', 'fnlwgt', 'education', 'education-num',
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'obj'])
test_data = pd.read_csv('./adult_test.csv', names=['age', 'workclass', 'fnlwgt', 'education', 'education-num',
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'obj'],
skiprows=1)
data = pd.concat([train_data, test_data])
#Explanatory variable X,Machining the objective variable Y
X = pd.get_dummies(data.drop('obj', axis=1))
Y = data['obj'].map(lambda x: 1 if x==' >50K' or x==' >50K.' else 0) #Objective variable is 1 or 0
#Divided into training data and verification data
train_size = len(train_data)
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
Y_train, Y_test = Y.iloc[:train_size], Y.iloc[train_size:]
Looking at the percentage of positive cases in the training data, it is about 24%, which is less than the negative cases, and can be said to be imbalanced data.
print('positive ratio = {:.2f}%'.format((len(Y_train[Y_train==1])/len(Y_train))*100))
#output=> positive ratio = 24.08%
If you build a model using this training data as it is, you can see that the classification accuracy is as low as AUC = 0.57 and the recall rate (Recall) is as low as 0.26. It is thought that the number of negative examples in the training data is large, the prediction results are often negative, and the recall rate (the rate at which positive data can be correctly classified as positive) is low.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, recall_score
#Model building
lr = LogisticRegression(random_state=0)
lr.fit(X_train, Y_train)
#Verify classification accuracy
prob = lr.predict_proba(X_test)[:, 1] #Predict the probability that the objective variable is 1
pred = lr.predict(X_test) #Classified as 1 or 0
auc = roc_auc_score(y_true=Y_test, y_score=prob)
print('AUC = {:.2f}'.format(auc))
recall = recall_score(y_true=Y_test, y_pred=pred)
print('recall = {:.2f}'.format(recall))
#output=> AUC = 0.57
#output=> recall = 0.26
Next, undersampling is performed so that the number of negative example data in the training data is equal to the number of positive example data, and when a model is constructed using this data, the classification accuracy is greatly improved to AUC = 0.90 and recall = 0.86. You can see that
#Undersampling
pos_idx = Y_train[Y_train==1].index
neg_idx = Y_train[Y_train==0].sample(n=len(Y_train[Y_train==1]), replace=False, random_state=0).index
idx = np.concatenate([pos_idx, neg_idx])
X_train_sampled = X_train.iloc[idx]
Y_train_sampled = Y_train.iloc[idx]
#Model building
lr = LogisticRegression(random_state=0)
lr.fit(X_train_sampled, Y_train_sampled)
#Verify classification accuracy
prob = lr.predict_proba(X_test)[:, 1]
pred = lr.predict(X_test)
auc = roc_auc_score(y_true=Y_test, y_score=prob)
print('AUC = {:.2f}'.format(auc))
recall = recall_score(y_true=Y_test, y_pred=pred)
print('recall = {:.2f}'.format(recall))
#output=> AUC = 0.90
#output=> recall = 0.86
At this time, let's look at the prediction accuracy of the probability. You can see that the log loss is 0.41 and the calibration plot passes below the 45 degree line. The fact that the calibration plot passes below the 45 degree line means that the predicted probability is greater than the actual probability. Now, since the model is constructed using the undersampled data so that the number of negative example data is equal to the number of positive example data, learning is performed with the ratio of the number of positive example data larger than the actual number. It is thought that the probability is rather high.
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.calibration import calibration_curve
from sklearn.metrics import log_loss
def calibration_plot(y_true, y_prob):
prob_true, prob_pred = calibration_curve(y_true=y_true, y_prob=y_prob, n_bins=20)
fig, ax1 = plt.subplots()
ax1.plot(prob_pred, prob_true, marker='s', label='calibration plot', color='skyblue') #Create calibration plot
ax1.plot([0, 1], [0, 1], linestyle='--', label='ideal', color='limegreen') #Plot the 45 degree line
ax1.legend(bbox_to_anchor=(1.12, 1), loc='upper left')
plt.xlabel('predicted probability')
plt.ylabel('actual probability')
ax2 = ax1.twinx() #Added 2 axes
ax2.hist(prob, bins=20, histtype='step', color='orangered') #Plot the histogram of the score
plt.ylabel('frequency')
plt.show()
prob = lr.predict_proba(X_test)[:, 1]
loss = log_loss(y_true=Y_test, y_pred=prob)
print('log loss = {:.2f}'.format(loss))
calibration_plot(y_true=Y_test, y_prob=prob)
#output=> log loss = 0.41
Now, let's remove the bias due to undersampling and correct the probability. Calculate $ \ beta $ and $ If you correct the probability according to p = \ beta p_s / (\ beta p_s-p_s + 1 $), you can see that the log loss improved to 0.32 and the calibration plot was almost on the 45 degree line. Note that $ \ beta $ uses the number of positive / negative examples of training data (the number of positive / negative examples of verification data is unknown).
beta = len(Y_train[Y_train==1]) / len(Y_train[Y_train==0])
prob_corrected = beta*prob / (beta*prob - prob + 1)
loss = log_loss(y_true=Y_test, y_pred=prob_corrected)
print('log loss = {:.2f}'.format(loss))
calibration_plot(y_true=Y_test, y_prob=prob_corrected)
#output=> log loss = 0.32
It was confirmed that the bias due to undersampling can be removed and the probability can be corrected. That's all for verification.
In this article, we have briefly summarized how to correct the probabilities predicted by a model built using undersampled data. I would appreciate it if you could point out any mistakes.
-SQL / R / Python Practical Techniques for Preprocessing Complete Data Analysis
Recommended Posts