Introduction

The k-nearest neighbor method (kNN) is a simple machine learning algorithm that determines the predicted value by taking a majority vote among the k items that are closest to the value to be predicted. The following explanation is intuitive and easy to understand.

In the figure on the right, the smiley mark is the position of the value you want to predict, and shows the range of the neighborhood when k = 3. In this case, ◆ is the predicted value of the smiley mark.

In this article, I would like to rewrite the part implemented by R in the teaching material for classification by the k-nearest neighbor method to python.

Teaching materials

[High School Information Department "Information II" Teacher Training Materials (Main Volume): Ministry of Education, Culture, Sports, Science and Technology](https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/mext_00742.html "High School Information Department "Information II" teaching materials for teacher training (main part): Ministry of Education, Culture, Sports, Science and Technology ") Chapter 3 Information and Data Science Second Half (PDF: 7.6MB)

environment

ipython
Colaboratory - Google Colab

Parts to be taken up in the teaching materials

Learning 15 Prediction by classification: "3. k-Classification by neighbor method"

Data handled this time

Download digit-recognizer data from kaggle in the same way as the material. It is "train.csv" to use.

https://www.kaggle.com/c/digit-recognizer/data

Implementation example and result in python

Reading training data and test data

Information on 42,000 handwritten numbers is stored in train.csv, and the information on one handwritten number is the correct label (correct number) in the first column (label) and 784 in the second and subsequent columns (pixel). It looks like 256 grayscale gradation values (0-255) of (28 x 28) pixels.

Here, we will use the first 1,000 data as training data and the next 100 data as test data.

import numpy as np
import pandas as pd
from IPython.display import display

mnist = pd.read_csv('/content/train.csv')

mnist_light = mnist.iloc[:1000,:]
mnist_light_test = mnist.iloc[1000:1100,:]

#Training data
Y_mnist_light = mnist_light[['label']].values.ravel()
#display(Y_mnist_light)
X_mnist_light = mnist_light.drop('label', axis = 1)
#display(X_mnist_light)

#test data
Y_mnist_light_test = mnist_light_test[['label']].values.ravel()
#display(Y_mnist_light_test)
X_mnist_light_test = mnist_light_test.drop('label', axis = 1)
#display(X_mnist_light_test)

Training and prediction of training data

After training by the k-nearest neighbor method and training data when k = 3, the predicted value is acquired from 100 test data. The correct answer rate is displayed by comparing with the label (correct answer value) of the test data.

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score

# sklearn.neighbors.Use KNeighbors Classifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_mnist_light, Y_mnist_light)

#Predictive execution
pred_y = knn.predict(X_mnist_light_test)
display(pred_y)

#Confirm correct answer
result_compare = pred_y == Y_mnist_light_test
display(result_compare)

minist_accuracy_score = accuracy_score(Y_mnist_light_test, pred_y)

#Correct answer rate
print(minist_accuracy_score)

The execution result is as follows.

array([1, 5, 1, 7, 9, 8, 9, 5, 7, 4, 7, 2, 8, 1, 4, 3, 8, 6, 2, 7, 2, 6,
       7, 8, 1, 8, 8, 1, 9, 0, 9, 4, 6, 6, 8, 2, 3, 5, 4, 5, 4, 1, 3, 7,
       1, 5, 0, 0, 9, 5, 5, 7, 6, 8, 2, 8, 4, 2, 3, 6, 2, 8, 0, 2, 4, 7,
       3, 4, 4, 5, 4, 3, 3, 1, 5, 1, 0, 2, 2, 2, 9, 5, 1, 6, 6, 9, 4, 1,
       7, 2, 2, 0, 7, 0, 6, 8, 0, 5, 7, 4])
array([ True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True, False,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])
0.89

The correct answer rate is 0.89.

Confirmation of actually incorrect handwritten numbers

Of the test data, the fifth test data is incorrectly recognized. We are actually checking the wrong numbers.

import matplotlib.pyplot as plt

#Display of test data
fig, axes = plt.subplots(2, 5)
fig.subplots_adjust(left=0, right=1, bottom=0, top=1.0, hspace=0.1, wspace=0.1)
for i in range(2):
    for j in range(5):
        axes[i, j].imshow(X_mnist_light_test.values[i*5+j].reshape((28, 28)), cmap='gray')
        axes[i, j].set_xticks([])
        axes[i, j].set_yticks([])
plt.show()

The execution result is as follows.

The handwritten number on the far right of the upper row had a label (correct answer value) of 4, but it seems that the predicted value was misrecognized as 9. It looks like a number that can be judged visually as 9.

Change in correct answer rate when changing the mixed matrix and the value of k

The vertical axis is the predicted value and the horizontal axis is the correct label, and a table showing the number is displayed.

from sklearn.metrics import confusion_matrix

cfm = confusion_matrix(Y_mnist_light_test, pred_y)

print(cfm)

The execution result is as follows.

[[ 7  0  0  0  0  0  0  0  0  0]
 [ 0 10  0  0  0  0  0  0  0  0]
 [ 0  0 13  0  0  0  0  1  1  0]
 [ 0  0  0  5  0  1  0  0  0  0]
 [ 0  0  0  0 11  0  0  0  0  1]
 [ 0  0  0  0  0 10  0  0  0  0]
 [ 0  0  0  0  0  0  9  0  0  0]
 [ 1  0  0  0  0  0  0 10  0  1]
 [ 0  0  0  2  0  0  0  0 10  1]
 [ 0  1  0  0  1  0  0  0  0  4]]

Next, when you change the value of k to see what was suitable for the value of k, the percentage of correct answers is displayed in a graph.

n_neighbors_chg_list = []

# n_Graph when neighbors are changed
for i in range(1,100):
    # sklearn.neighbors.Use KNeighbors Classifier
    knn_temp = KNeighborsClassifier(n_neighbors = i)
    knn_temp.fit(X_mnist_light, Y_mnist_light)

    #Predictive execution
    pred_y_temp = knn_temp.predict(X_mnist_light_test)

    #Correct answer rate
    minist_accuracy_score_temp = accuracy_score(Y_mnist_light_test, pred_y_temp)

    #Store in array
    n_neighbors_chg_list.append(minist_accuracy_score_temp)

plt.plot(n_neighbors_chg_list)

The execution result is as follows.

ダウンロード (12).png

In general, larger k values have less effect on the result, even if there are outliers, so the effect of noise can be reduced, but class boundaries tend to be less clear. The appropriate value of k changes depending on the number of training data, etc., but in this trial, the correct answer rate tended to decrease as k increased.

Source code

https://gist.github.com/ereyester/01237a69f6b8ae73c55ccca33c931ade

Classification by k-nearest neighbor method (kNN) by python ([High school information department information II] teaching materials for teacher training)

Introduction

Teaching materials

environment

Parts to be taken up in the teaching materials

Data handled this time

Implementation example and result in python

Reading training data and test data

Training and prediction of training data

Confirmation of actually incorrect handwritten numbers

Change in correct answer rate when changing the mixed matrix and the value of k

Source code