Machine Learning: k-Nearest Neighbors

Target

Understand the k-nearest neighbor algorithm and try it with scikit-learn.

theory

The k-nearest neighbor method is one of the supervised classification algorithms that does not acquire a discriminative model and can also be used as supervised regression or unsupervised classification.

k-nearest neighbor algorithm

The algorithm for the k-nearest neighbor method is as follows.

  1. Select the value of k and the distance index.
  2. Calculate the distance index between the new input data and the training data.
  3. Obtain k training data and corresponding categories in ascending order of distance index.
  4. Predict new input data categories by majority voting.

The Minkowski distance is used as the distance index. Here, if the new input data is $ x'$, the training data is $ x $, and the number of features is $ m $,

d(x', x) = \sqrt[p]{\sum^m_{j=1} |x'_j - x_j|^p }

This is synonymous with Manhattan distance when $ p = 1 $ and Euclidean distance when $ p = 2 $.

Disadvantages of the k-nearest neighbor method

While increasing the value of k can reduce the effects of noise, it can be difficult to determine the value of k because the boundaries of the categories may not be clear.

In the case shown below, when classifying new black data points, when $ k = 3 $, it is classified as an orange circle, but when $ k = 5 $, it is classified as a blue square, and $ k. When = 8 $, it is classified as an orange circle again.

104_nearest_neighbor.png

Also, if implemented according to the algorithm, it is necessary to calculate the distance index for all the data held when predicting new data, so the amount of calculation will increase in proportion to the number of data. There are drawbacks.

Classification problem dataset

As the subject of the k-nearest neighbor method, we use the iris dataset, which is a dataset of scikit-learn classification problems.

The number of data Number of features Number of categories
150 4 3

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Matplotlib 3.1.1 ・ Numpy 1.19.2 ・ Scikit-learn 0.23.0

Program to run

The implemented program is published on GitHub.

nearest_neighbor.py


result

Supervised classification

The execution result when $ k = 15 $ is as follows.

104_nearest_neighbor_classification.png

The blue area is classified as setosa, the green area as versicolor, and the orange area as virginica.

Supervised regression

The data of the regression problem was executed as $ k = 5 $ by adding a random number to the sine wave.

104_nearest_neighbor_regression.png

Unsupervised classification

In a classification problem, we performed unsupervised classification at $ k = 15 $ without giving a category label.

104_nearest_neighbor_unsupervised.png

reference

1.6. Nearest Neighbors

Recommended Posts

Machine Learning: k-Nearest Neighbors
Machine learning
Machine learning ④ K-nearest neighbor Summary
[Memo] Machine learning
Machine learning classification
Machine Learning sample
Machine learning tutorial summary
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine learning logistic regression
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Machine learning rabbit challenge
Introduction to machine learning
What is machine learning?
Machine learning #k-nearest neighbor method and its implementation and various
Machine learning model considering maintainability
Machine learning learned with Pokemon
Data set for machine learning
Japanese preprocessing for machine learning
Machine learning in Delemas (practice)
An introduction to machine learning
Machine learning / classification related techniques
Basics of Machine Learning (Notes)
Machine learning beginners tried RBM
Machine learning with Python! Preparation
Machine Learning Study Resource Notepad
Machine learning ② Naive Bayes Summary
Understand machine learning ~ ridge regression ~.
Machine learning article summary (self-authored)
About machine learning mixed matrices
Machine Learning: Supervised --Random Forest
Practical machine learning system memo
Machine learning Minesweeper with PyTorch
Machine learning environment construction macbook 2021
Build a machine learning environment
Python Machine Learning Programming> Keywords
Machine learning algorithm (simple perceptron)
Importance of machine learning datasets
Machine learning and mathematical optimization
Machine Learning: Supervised --Support Vector Machine
Supervised machine learning (classification / regression)
I implemented Extreme learning machine
Beginning with Python machine learning
Machine learning algorithm (support vector machine)
Super introduction to machine learning
4 [/] Four Arithmetic by Machine Learning
Pokemon machine learning Nth decoction
Try machine learning with Kaggle
Machine learning stacking template (regression)
Machine Learning: Supervised --Decision Tree
Machine learning algorithm (logistic regression)
<Course> Machine Learning Chapter 6: Algorithm 2 (k-means)
Introduction to machine learning Note writing
Significance of machine learning and mini-batch learning
[Machine learning] Try studying decision trees