Understand the k-nearest neighbor algorithm and try it with scikit-learn.
The k-nearest neighbor method is one of the supervised classification algorithms that does not acquire a discriminative model and can also be used as supervised regression or unsupervised classification.
The algorithm for the k-nearest neighbor method is as follows.
The Minkowski distance is used as the distance index. Here, if the new input data is $ x'$, the training data is $ x $, and the number of features is $ m $,
d(x', x) = \sqrt[p]{\sum^m_{j=1} |x'_j - x_j|^p }
This is synonymous with Manhattan distance when $ p = 1 $ and Euclidean distance when $ p = 2 $.
While increasing the value of k can reduce the effects of noise, it can be difficult to determine the value of k because the boundaries of the categories may not be clear.
In the case shown below, when classifying new black data points, when $ k = 3 $, it is classified as an orange circle, but when $ k = 5 $, it is classified as a blue square, and $ k. When = 8 $, it is classified as an orange circle again.
Also, if implemented according to the algorithm, it is necessary to calculate the distance index for all the data held when predicting new data, so the amount of calculation will increase in proportion to the number of data. There are drawbacks.
As the subject of the k-nearest neighbor method, we use the iris dataset, which is a dataset of scikit-learn classification problems.
The number of data | Number of features | Number of categories |
---|---|---|
150 | 4 | 3 |
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Matplotlib 3.1.1 ・ Numpy 1.19.2 ・ Scikit-learn 0.23.0
The implemented program is published on GitHub.
nearest_neighbor.py
The execution result when $ k = 15 $ is as follows.
The blue area is classified as setosa, the green area as versicolor, and the orange area as virginica.
The data of the regression problem was executed as $ k = 5 $ by adding a random number to the sine wave.
In a classification problem, we performed unsupervised classification at $ k = 15 $ without giving a category label.
Recommended Posts