What is a confusion matrix?

A matrix that represents the number that was correctly determined and the number that was mistakenly determined between the prediction result passed through some model and the actual value.

When is the confusion matrix used?

Generally, binary classification.

Why is the confusion matrix used?

For example, when you want to predict whether you have cancer or not from a given image, The actual value is 98/100 for non-cancer people (0) 2/100 for people with cancer (1) Suppose it was.

At this time, if the predictions are all 0, the correct answer rate is 98%. This looks like a good number when viewed in terms of correct answer rate, Is this really a good evaluation? Isn't the two people who missed it a fatal mistake?

Even in such cases, the confusion matrix is used to make a successful evaluation.

Let's use the confusion matrix

In general, the horizontal axis is the prediction result of the model and the vertical axis is the actual value, and they are summarized by the combination of 2 × 2 = 4 as shown in the table below. スクリーンショット 2020-08-09 21.02.50.png

True: Results that can be predicted correctly False: Incorrectly predicted result positive: The result of determining that there is a disease (= 1) negative: Result of determining no disease (= 0)

`matrix.py`



from sklearn.metrics import confusion_matrix

#Creating a confusion matrix
cm = confusion_matrix(y_true=y_test, y_pred=y_pred)

# y_Passing to true is the objective variable data for evaluation
# y_X to pass to pred_test with predict()Results predicted using the function

#Dataframe the confusion matrix
df_cm = pd.DataFrame(np.rot90(cm, 2), index=["actual_Positive", "actual_Negative"], columns=["predict_Positive", "predict_Negative"])
print(df_cm)

#Visualization of confusion matrix with heatmap
sns.heatmap(df_cm, annot=True, fmt="2g", cmap='Blues')
plt.yticks(va='center')
plt.show()

The above code has no data, so of course this code alone will not work.

Consider the evaluation index that measures the performance of the model from here

Solution / Accuracy

First of all, check how correctly you could classify in the whole data

Accuracy = \dfrac{TP + TN}{TP + FP + FN + TN}

Accuracy

After getting a positive (1) result, check if you actually answered correctly

Presision=\dfrac{TP}{TP + FP}

Recall, True Positive Rate

The actual data is positive (1), how much Is the predicted data correctly inferred to be positive? The higher this value, the better the performance, and the less the wrong positive judgment is.

Recall=\dfrac{TP}{TP + FN}

True Negative Rate

The actual data is negative (0), how much Is the predicted data correctly estimated to be negative? The higher this value, the better the performance and the less false Negative judgments.

Recall=\dfrac{TN}{FP + TN}

False Negative Rate

The actual data is positive (1), how much Was the predicted data mistakenly presumed to be negative? The lower this value is, the better the performance is, and the less the wrong positive judgment is made.

False\ Negative\ Rate=\dfrac{FN}{TP + FN}

False Positive Rate

The actual data is negative (0), how much Was the predicted data mistakenly presumed to be positive? The lower this value, the better the performance, and the less false Negative judgments are made.

False\ Positive\ Rate=\dfrac{FP}{FP + TN}

Measure the true positive rate and the true negative rate in the example dealt with in the chapter "Why is the confusion matrix used?"

	Positive prediction results	Negative prediction results
Actual positive result	98	0
Actual negative result	2	0

Accuracy = \dfrac{98 + 0}{98 + 2 + 0 + 0}=0.98

98% correct answer rate

Recall=\dfrac{98}{98 + 0}=1

100% => This determines that all positives are correctly classified

Recall=\dfrac{0}{2 + 0}=0

0% => This determines that all negatives were classified incorrectly

Summary

To use a binary classification machine learning model in business, calculate an index to measure performance, It is important to understand and use the index value that suits the purpose

About the confusion matrix