I was reconfirming my knowledge related to machine learning, and after studying the most basic method, the k-NN method (k-nearest neighbor method), I implemented it myself.
First, let's review the outline of k-NN. It's a very simple and easy-to-understand algorithm, so you can easily understand it. The figure below is all about this algorithm. The data of interest is the central star. First, for the given data, calculate the distance between all the data. For example, when k = 3, look at the labels of the top 3 data closest to the data of interest. In this case, since there are two classes 1 and one class 2, it is classified into class 1 which takes the mode. On the other hand, if k = 6, there are 2 classes 1 and 6 classes 2, so this data is classified as class 2. The data given in this way will be decided by a majority vote on the labels of similar data. By the way, when k = 1, it is also called the nearest neighbor method. A similar named technique is the k-means method (https://ja.wikipedia.org/wiki/K%E5%B9%B3%E5%9D%87%E6%B3%95). , This is an unsupervised clustering method, which is different from the k-NN method, which is a supervised classification method.
Then I would like to actually implement it in Python.
The dataset uses the famous Iris Data Set. Download iris.data and actually load the data.
knn.py
data = np.loadtxt('iris.data', delimiter=',', dtype='float64',usecols=[0, 1, 2, 3])
labels = np.loadtxt('iris.data', delimiter=',', dtype='str',usecols=[4])
print(data.shape)
print(labels.shape)
(150, 4)
(150,)
This dataset consists of 150 data with 4 features and their labels (3 classes).
This time we will use these libraries.
knn.py
import numpy as np
from sklearn import model_selection
from scipy.spatial import distance
from sklearn.metrics import accuracy_score
from statistics import mean, median,variance
First, calculate the distance to each data for a given point.
knn.py
distance_matrix = distance.cdist(test_data, train_data)
indexes = np.argsort(distance_matrix, axis=1)
Next, count how many data of which class are in the vicinity of k data of the given data.
knn.py
class_dict = {}
for label in labels:
class_dict[label] = 0
class_dict_list = []
for data_num, index in enumerate(indexes[:,:self.k]):
class_dict_list.append(class_dict.copy())
for i in index:
class_dict_list[data_num][self._labels[i]] += 1
Finally, identify the label of the most common class.
knn.py
predict_class = []
for d in class_dict_list:
max_class = max(d, key=d.get)
predict_class.append(max_class)
That's all for the k-NN algorithm itself.
This time, each class is randomly divided in half, and the flow of learning and execution is repeated 20 times. The actual execution result is as follows. (Since the data is divided randomly, there will be a slight difference in accuracy.)
training number 1 ...
knn accuracy : 0.9466666666666667
training number 2 ...
knn accuracy : 0.9333333333333333
training number 3 ...
knn accuracy : 0.9466666666666667
training number 4 ...
knn accuracy : 0.9466666666666667
training number 5 ...
knn accuracy : 0.9333333333333333
training number 6 ...
knn accuracy : 0.92
training number 7 ...
knn accuracy : 0.9466666666666667
training number 8 ...
knn accuracy : 0.9466666666666667
training number 9 ...
knn accuracy : 0.8933333333333333
training number 10 ...
knn accuracy : 0.9466666666666667
training number 11 ...
knn accuracy : 0.96
training number 12 ...
knn accuracy : 0.96
training number 13 ...
knn accuracy : 0.96
training number 14 ...
knn accuracy : 0.96
training number 15 ...
knn accuracy : 0.92
training number 16 ...
knn accuracy : 0.96
training number 17 ...
knn accuracy : 0.92
training number 18 ...
knn accuracy : 0.9866666666666667
training number 19 ...
knn accuracy : 0.9333333333333333
training number 20 ...
knn accuracy : 0.96
=================================================
knn accuracy mean : 0.944
knn accuracy variance : 0.00042292397660818664
In this way, you can see that high accuracy has been achieved.
There was only a simple algorithm and it was very easy to implement. It is very effective if the dimension is small and the number of data is small like this data. However, in reality, for high-dimensional data, [curse of dimensionality](https://ja.wikipedia.org/wiki/%E6%AC%A1%E5%85%83%E3%81%AE%E5% The effect of 91% AA% E3% 81% 84) makes the Euclidean distance meaningless, and it is not practical for both time complexity and spatial complexity in large data sets. For such problems, Methods for hashing and quantizing vectors has proposed methods for reducing memory consumption and computational complexity. I will. You can check the actual source code on github. https://github.com/kotaYkw/machine_learning
Recommended Posts