The k-nearest neighbor method (kNN) is a simple machine learning algorithm that determines the predicted value by taking a majority vote among the k items that are closest to the value to be predicted. The following explanation is intuitive and easy to understand.
In the figure on the right, the smiley mark is the position of the value you want to predict, and shows the range of the neighborhood when k = 3. In this case, ◆ is the predicted value of the smiley mark.
In this article, I would like to rewrite the part implemented by R in the teaching material for classification by the k-nearest neighbor method to python.
[High School Information Department "Information II" Teacher Training Materials (Main Volume): Ministry of Education, Culture, Sports, Science and Technology](https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/mext_00742.html "High School Information Department "Information II" teaching materials for teacher training (main part): Ministry of Education, Culture, Sports, Science and Technology ") Chapter 3 Information and Data Science Second Half (PDF: 7.6MB)
Learning 15 Prediction by classification: "3. k-Classification by neighbor method"
Download digit-recognizer data from kaggle in the same way as the material. It is "train.csv" to use.
https://www.kaggle.com/c/digit-recognizer/data
Information on 42,000 handwritten numbers is stored in train.csv, and the information on one handwritten number is the correct label (correct number) in the first column (label) and 784 in the second and subsequent columns (pixel). It looks like 256 grayscale gradation values (0-255) of (28 x 28) pixels.
Here, we will use the first 1,000 data as training data and the next 100 data as test data.
import numpy as np
import pandas as pd
from IPython.display import display
mnist = pd.read_csv('/content/train.csv')
mnist_light = mnist.iloc[:1000,:]
mnist_light_test = mnist.iloc[1000:1100,:]
#Training data
Y_mnist_light = mnist_light[['label']].values.ravel()
#display(Y_mnist_light)
X_mnist_light = mnist_light.drop('label', axis = 1)
#display(X_mnist_light)
#test data
Y_mnist_light_test = mnist_light_test[['label']].values.ravel()
#display(Y_mnist_light_test)
X_mnist_light_test = mnist_light_test.drop('label', axis = 1)
#display(X_mnist_light_test)
After training by the k-nearest neighbor method and training data when k = 3, the predicted value is acquired from 100 test data. The correct answer rate is displayed by comparing with the label (correct answer value) of the test data.
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
# sklearn.neighbors.Use KNeighbors Classifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_mnist_light, Y_mnist_light)
#Predictive execution
pred_y = knn.predict(X_mnist_light_test)
display(pred_y)
#Confirm correct answer
result_compare = pred_y == Y_mnist_light_test
display(result_compare)
minist_accuracy_score = accuracy_score(Y_mnist_light_test, pred_y)
#Correct answer rate
print(minist_accuracy_score)
The execution result is as follows.
array([1, 5, 1, 7, 9, 8, 9, 5, 7, 4, 7, 2, 8, 1, 4, 3, 8, 6, 2, 7, 2, 6,
7, 8, 1, 8, 8, 1, 9, 0, 9, 4, 6, 6, 8, 2, 3, 5, 4, 5, 4, 1, 3, 7,
1, 5, 0, 0, 9, 5, 5, 7, 6, 8, 2, 8, 4, 2, 3, 6, 2, 8, 0, 2, 4, 7,
3, 4, 4, 5, 4, 3, 3, 1, 5, 1, 0, 2, 2, 2, 9, 5, 1, 6, 6, 9, 4, 1,
7, 2, 2, 0, 7, 0, 6, 8, 0, 5, 7, 4])
array([ True, True, True, True, False, True, True, True, True,
True, True, True, False, True, True, False, True, True,
True, False, True, True, True, True, True, True, True,
True, True, True, False, False, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, False, True, False, True, True, True,
True, True, True, True, True, True, True, True, False,
True, True, True, True, True, True, True, True, False,
True, True, True, True, True, True, True, True, True,
True, False, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True])
0.89
The correct answer rate is 0.89.
Of the test data, the fifth test data is incorrectly recognized. We are actually checking the wrong numbers.
import matplotlib.pyplot as plt
#Display of test data
fig, axes = plt.subplots(2, 5)
fig.subplots_adjust(left=0, right=1, bottom=0, top=1.0, hspace=0.1, wspace=0.1)
for i in range(2):
for j in range(5):
axes[i, j].imshow(X_mnist_light_test.values[i*5+j].reshape((28, 28)), cmap='gray')
axes[i, j].set_xticks([])
axes[i, j].set_yticks([])
plt.show()
The execution result is as follows.
The handwritten number on the far right of the upper row had a label (correct answer value) of 4, but it seems that the predicted value was misrecognized as 9. It looks like a number that can be judged visually as 9.
The vertical axis is the predicted value and the horizontal axis is the correct label, and a table showing the number is displayed.
from sklearn.metrics import confusion_matrix
cfm = confusion_matrix(Y_mnist_light_test, pred_y)
print(cfm)
The execution result is as follows.
[[ 7 0 0 0 0 0 0 0 0 0]
[ 0 10 0 0 0 0 0 0 0 0]
[ 0 0 13 0 0 0 0 1 1 0]
[ 0 0 0 5 0 1 0 0 0 0]
[ 0 0 0 0 11 0 0 0 0 1]
[ 0 0 0 0 0 10 0 0 0 0]
[ 0 0 0 0 0 0 9 0 0 0]
[ 1 0 0 0 0 0 0 10 0 1]
[ 0 0 0 2 0 0 0 0 10 1]
[ 0 1 0 0 1 0 0 0 0 4]]
Next, when you change the value of k to see what was suitable for the value of k, the percentage of correct answers is displayed in a graph.
n_neighbors_chg_list = []
# n_Graph when neighbors are changed
for i in range(1,100):
# sklearn.neighbors.Use KNeighbors Classifier
knn_temp = KNeighborsClassifier(n_neighbors = i)
knn_temp.fit(X_mnist_light, Y_mnist_light)
#Predictive execution
pred_y_temp = knn_temp.predict(X_mnist_light_test)
#Correct answer rate
minist_accuracy_score_temp = accuracy_score(Y_mnist_light_test, pred_y_temp)
#Store in array
n_neighbors_chg_list.append(minist_accuracy_score_temp)
plt.plot(n_neighbors_chg_list)
The execution result is as follows.
In general, larger k values have less effect on the result, even if there are outliers, so the effect of noise can be reduced, but class boundaries tend to be less clear. The appropriate value of k changes depending on the number of training data, etc., but in this trial, the correct answer rate tended to decrease as k increased.
https://gist.github.com/ereyester/01237a69f6b8ae73c55ccca33c931ade
Recommended Posts