This time, I will evaluate the performance of the classification model used for machine learning while creating code.
The dataset used is the breast cancer data that comes with sklearn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
# -----------Data set preparation--------------
dataset = load_breast_cancer()
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.Series(dataset.target, name='y')
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.3, random_state=1)
print('X_train.shape = ', X_train.shape)
print('X_test.shape = ', X_test.shape)
print(y_test.value_counts())
Divide the dataset by training data: test data = 7: 3. There are 171 test data and 30 features, of which 171 data are ** 1: 108 for normal patients and 0: 63 for cancer patients **.
There are eight classification models used this time. Later, we will put it together in the form of a pipeline for ease of use. Hyperparameters are the default.
# ----------Pipeline settings----------
pipelines = {
'1.KNN':
Pipeline([('scl',StandardScaler()),
('est',KNeighborsClassifier())]),
'2.Logistic':
Pipeline([('scl',StandardScaler()),
('est',LogisticRegression(solver='lbfgs', random_state=1))]), # solver
'3.SVM':
Pipeline([('scl',StandardScaler()),
('est',SVC(C=1.0, kernel='linear', class_weight='balanced', random_state=1, probability=True))]),
'4.K-SVM':
Pipeline([('scl',StandardScaler()),
('est',SVC(C=1.0, kernel='rbf', class_weight='balanced', random_state=1, probability=True))]),
'5.Tree':
Pipeline([('scl',StandardScaler()),
('est',DecisionTreeClassifier(random_state=1))]),
'6.Random':
Pipeline([('scl',StandardScaler()),
('est',RandomForestClassifier(random_state=1, n_estimators=100))]), ###
'7.GBoost':
Pipeline([('scl',StandardScaler()),
('est',GradientBoostingClassifier(random_state=1))]),
'8.MLP':
Pipeline([('scl',StandardScaler()),
('est',MLPClassifier(hidden_layer_sizes=(3,3),
max_iter=1000,
random_state=1))])
}
1.KNN This is the ** k-nearest neighbor method, which finds the k samples closest to the data you want to classify from the training data and classifies the data by majority voting of k samples.
2.Logistic It is ** Logistic Regression ** that converts the inner product result of the feature vector and the weight vector into a probability and classifies it.
3.SVM It is a ** Support Vector Machine ** that classifies for the purpose of maximizing the margin.
4.K-SVM It is a ** kernel Support Vector Machine ** that transforms training data into a higher dimensional feature space using a projection function and classifies it by SVM.
5.Tree It is a classification model by ** Decision Tree **.
6.Random It is a ** Random Forest ** that creates multiple decision trees from randomly selected features and outputs the average predictions of all decision trees.
7.GBoost It is ** gradient boosting ** (Gradinet Boosting) that improves the prediction accuracy by trying to explain the information (residual) that the existing tree group cannot explain by the succeeding tree.
8.MLP It is a ** multi-layer perceptron **, which is a type of feedforward neural network.
# -------- accuracy ---------
scores = {}
for pipe_name, pipeline in pipelines.items():
pipeline.fit(X_train, y_train)
scores[(pipe_name,'train')] = accuracy_score(y_train, pipeline.predict(X_train))
scores[(pipe_name,'test')] = accuracy_score(y_test, pipeline.predict(X_test))
print(pd.Series(scores).unstack())
Accuracy of training data and test data. Looking at the accuracy of the test data, which shows the generalization performance, we can see that ** 2.Logistic ** is the best at 0.970760.
The classification results are divided into four categories: ** true positive, false negative, false positive, and true negative **, and the square matrix is **. It is called the Confusion Matrix **.
The following is an example of the prediction result of cancer screening expressed by Confusion Matrix.
The accuracy is expressed by (TP + TN) / (TP + FN + FP + TN), but in the case of cancer screening, it is important to pay attention to the increase in FP and how much FN can be lowered. It is a viewpoint.
# ---------- Confusion Matrix ---------
from sklearn.metrics import confusion_matrix
import seaborn as sns
for pipe_name, pipeline in pipelines.items():
cmx_data = confusion_matrix(y_test, pipeline.predict(X_test))
df_cmx = pd.DataFrame(cmx_data)
plt.figure(figsize = (3,3))
sns.heatmap(df_cmx, fmt='d', annot=True, square=True)
plt.title(pipe_name)
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show()
The output of the code is eight, but as a representative, if you look at the Confusion Matrix in 2.Lostic, you can see that two out of 63 cancer patients are mistakenly classified normally.
6.accuracy, recall, precision, f1-score The following four indicators can be obtained from the Confusion Matrix. A code that displays four indicators. In this dataset, 0 is a cancer patient and 1 is a normal person, so pos_label = 0 is added to the arguments of recall, precision, and f1-score.
# ------- accuracy, precision, recall, f1_score for test_data------
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
scores = {}
for pipe_name, pipeline in pipelines.items():
scores[(pipe_name,'1.accuracy')] = accuracy_score(y_test, pipeline.predict(X_test))
scores[(pipe_name,'2.recall')] = recall_score(y_test, pipeline.predict(X_test), pos_label=0)
scores[(pipe_name,'3.precision')] = precision_score(y_test, pipeline.predict(X_test), pos_label=0)
scores[(pipe_name,'4.f1_score')] = f1_score(y_test, pipeline.predict(X_test), pos_label=0)
print(pd.Series(scores).unstack())
In this comparison, first of all, cancer patients are rarely mistaken for normal patients (high recall) 2-4 are candidates, and among them, normal patients are rarely mistaken for cancer patients (high accuracy) ), ** 2.Logistic ** seems to be the best.
First, I will explain the ROC curve and AUC with concrete examples.
# --------ROC curve, AUC -----------
for pipe_name, pipeline in pipelines.items():
fpr, tpr, thresholds = metrics.roc_curve(y_test, pipeline.predict_proba(X_test)[:, 0], pos_label=0) # 0:Classification of cancer patients
auc = metrics.auc(fpr, tpr)
plt.figure(figsize=(3, 3), dpi=100)
plt.plot(fpr, tpr, label='ROC curve (AUC = %.4f)'%auc)
x = np.arange(0, 1, 0.01)
plt.plot(x, x, c = 'red', linestyle = '--')
plt.legend()
plt.title(pipe_name)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(True)
plt.show()
The output of the code is eight, but as a representative, if you look at the ROC curve and AUC of ** 2.Lostic **, you can see that it is quite close to the ideal classification accuracy.
Recommended Posts