There are several groups, and the attributes of the sample groups that belong to them are known, and when another ** new sample is added, it is ** k-NN to find out which group it belongs to **. (K-Nearest Neighbor: k-nearest neighbor method) **.
Specifically, it is a classification method in which ** K pieces with similar attributes to the new sample ** are obtained from the existing sample group and determined as the majority group among them. This is why these ** k ** pieces are called ** k-NN **.

If k = 3, it will be judged as a blue group, and if k = 12, it will be classified as a green group, so the judgment will be different depending on the number of k.

** It can also be used for regression, but here we will do a classification case. ** **

⑴ Import library

import numpy as np
import pandas as pd

from sklearn import datasets
# sklearn.neighbors module k-NN method
from sklearn.neighbors import KNeighborsClassifier
#sklearn data split utility
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
#Method to generate color map
from matplotlib.colors import ListedColormap

#Japanese display module of matplotlib
!pip install japanize-matplotlib
import japanize_matplotlib

Prepare the data

Get the "iris" from the scikit-learn dataset.
There are 4 features that measure the length and width of the "petals" and "gaku" of the iris, and it is the data of a total of 150 samples consisting of 3 types x 50 each. These three types correspond to groups.

	Variable name	meaning	Note	Data type
0	species	type	Setosa=0, Versicolour=1, Virginica=2	int64
1	sepal length	Sepal length	Continuous amount(cm)	float64
2	sepal width	Sepal width	Continuous amount(cm)	float64
3	petal length	Petal length	Continuous amount(cm)	float64
4	petal width	Petal width	Continuous amount(cm)	float64

⑵ Data acquisition

iris = datasets.load_iris()

Check the contents of the data for each of the explanatory variables (features) and objective variables (types).

#Explanatory variable (feature)
print("label:\n", iris.feature_names)
print("shape:\n", iris.data.shape)
print("First 10 lines:\n", iris.data[0:10, :]) 

#Objective variable (type)
print("label:\n", iris.target_names)
print("shape:\n", iris.target.shape)
print("Full display:\n", iris.target)

⑶ Data division

X_train, X_test, y_train, y_test = train_test_split(
    iris.data, 
    iris.target,
    stratify = iris.target, #Stratified sampling
    random_state = 0)

The argument stratify = iris.target specifies ** stratified sampling ** by type (iris.target). The default is random sampling, so here we will divide it so that it retains three types of composition ratios for both training and testing.
Check the contents of only the objective variable for training.

print("shape:", y_train.shape)

#Get the number of unique elements
np.unique(y_train, return_counts=True)

By default, 75% is allocated for training, and the contents of the 112 samples are evenly distributed among 3 types.

Determine the number of k

⑷ Execute k-NN while changing the k parameter

Execute k-NN while changing k from 3 to 20, and observe the change in the accuracy rate of the training data and test data.

#Variable to store the correct answer rate
training_accuracy = []
test_accuracy = []

#k while changing k-Execute NN and get the correct answer rate
for k in range(3,21):
    #Pass k to create an instance, fit the data and generate a model
    kNN = KNeighborsClassifier(n_neighbors = k)
    kNN.fit(X_train, y_train)
    #Obtain the correct answer rate with score and store it sequentially
    training_accuracy.append(kNN.score(X_train, y_train))
    test_accuracy.append(kNN.score(X_test, y_test))

#Convert correct answer rate to numpy array
training_accuracy = np.array(training_accuracy)
test_accuracy = np.array(test_accuracy)

⑸ Select the optimum k parameter

Visualize the change in the correct answer rate between the training data and the test data, and show the difference in the correct answer rate in a graph.

#Changes in the correct answer rate for training and testing
plt.figure(figsize=(6, 4))

plt.plot(range(3,21), training_accuracy, label='Training')
plt.plot(range(3,21), test_accuracy, label='test')

plt.xticks(np.arange(2, 21, 1)) #x-axis scale
plt.xlabel('k number')
plt.ylabel('Correct answer rate')
plt.title('Transition of correct answer rate')

plt.grid()
plt.legend()

#Transition of difference in correct answer rate
plt.figure(figsize=(6, 4))

difference = np.abs(training_accuracy - test_accuracy) #Calculate the difference
plt.plot(range(3,21), difference, label='Difference')

plt.xticks(np.arange(2, 21, 1)) #x-axis scale
plt.xlabel('k number')
plt.ylabel('Difference(train - test)')
plt.title('Transition of difference in correct answer rate')

plt.grid()
plt.legend()

plt.show()

When I changed k from 3 to 20, in the test, it was constant at 100% except that it dropped at k = 14.
On the other hand, in training, it shows a gradual increase tendency from k = 3 to 6, then it remains flat until k = 11, and it falls at k = 12, but after that it shows an increasing tendency while rising and falling, and k = 15 is the peak. It has turned to decrease.
Also, if you look at the transition of the difference in the correct answer rate, you can see that the ** correct answer rate of training and test is closest ** at k = 15.

Execute and visualize k-NN

⑹ Re-execute k-NN with the optimum k parameter

Adopt k = 15 and execute k-NN again.
Here, only the first two of the four features will be used.

#Specify the number of k
k = 15

#Set explanatory variable X and objective variable y
X = iris.data[:, :2]
y = iris.target

#Create an instance, fit the data and generate a model
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X, y)

⑺ Plot on contour diagram (isoline diagram)

Create mesh data Z to draw the boundaries of each group on a two-dimensional plane.

#Specify mesh spacing
h = 0.02

#Create a color map
cmap_surface = ListedColormap(['darkseagreen', 'mediumpurple', 'gold']) #For area charts
cmap_dot = ListedColormap(['darkgreen', 'darkslateblue', 'olive']) #For scatter plots

# x,Get the minimum and maximum values of the y-axis
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
#Generate grid columns at specified mesh intervals
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

#Predict by passing the grid sequence to the model
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape) #Shape conversion

Flatten the grid sequence data of xx and yy to one dimension with the ravel () function, and pass it to the model to predict what is combined with the c_ () function of numpy.
As an example, Z is shown when the mesh spacing is expanded to 0.8.

In this way, Z is data that has type (group) information for each cell that is meshed at specified intervals.
Based on this, draw a contour diagram (isoline diagram) and plot individual data at the same time.

plt.figure(figsize=(6,5))

#Isolate diagram
plt.pcolormesh(xx, yy, Z, cmap=cmap_surface)
#Scatter plot
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_dot, s=30)

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xlabel('sepal length')
plt.ylabel('sepal width')

plt.show()

matplotlib's pcolormesh () function produces a color plot based on an amorphous rectangular grid.
If the argument is simply (x, y, Z, c), x, y is the coordinates of the mesh from the left. The data Z, which has group information for each cell, is assigned a color with c.

Afterword

In this example, groups 1 and 2 are interlaced, and enlave can be seen. However, if the analysis axis (combination of features) is changed, the facial expression of the map will be significantly different, and the predicted area and data will be clearly divided between the groups. A more effective analysis axis can be visually grasped.
As a procedure, it is important to first determine the optimum value of k.

2. Multivariate analysis spelled out in Python 8-1. K-nearest neighbor method (scikit-learn)

⑴ Import library

Prepare the data

⑵ Data acquisition

⑶ Data division

Determine the number of k

⑷ Execute k-NN while changing the k parameter

⑸ Select the optimum k parameter

Execute and visualize k-NN

⑹ Re-execute k-NN with the optimum k parameter

⑺ Plot on contour diagram (isoline diagram)

Afterword