Preface

This article is a memo taken while studying Video of a great programmer, Mr. Sentdex. https://www.youtube.com/user/sentdex

First of all, what is K-NN? According to WIKI: https://ja.wikipedia.org/wiki/K%E8%BF%91%E5%82%8D%E6%B3%95

The k-nearest neighbor method is the simplest of almost any machine learning algorithm. The classification of an object is determined by voting on its neighbors (ie, assigning the most common class of k nearest neighbors to that object). k is a positive integer, generally small. If k = 1, it will only be classified in the same class as the nearest neighbor object. In the case of binary classification, if k is set to an odd number, the problem of not being able to classify with the same number of votes can be avoided.

Yeah, a very democratic way. First look at the graph below. There are 6 points of data. What if you want to group?

Maybe it will be like this. Then why? Because they are close to each other. What we are doing is called "Cluster".

All right. This time, there are already groups of + and-.

I have a red dot now. Which group will this red dot belong to? You will probably be in the + group. But why? What kind of thinking circuit did you think in your head?

The upcoming red dot will probably-be in the group. What is it again?

Last red dot If you are here, which group is it? This depends on the algorithm you have chosen.

Let's get back to the story. Perhaps (at least I) this red dot is closer to the + group than the-group, so the red dot is the + group! I should have decided. And we call this "idea" K-nearest neighbors (K-NN). I think that it is the k-nearest neighbor method in Japanese. All I have to do for K-NN is to measure the distance between the points I want to group and the other points, and decide which group to join. In the example shown above, there are only 2 groups, but what if there are 100 tasks? What if there are 1000 tasks? Yeah, I'm going to lose my mind ...

But think twice. Do I really have to measure all the points? I don't think it's necessary in most cases. "K" is defined in K-NN. Now, let's set K = 2.

Yeah, I see. Perhaps the place where the orange circle is written is the closest to the red dot. Therefore, I judge that this red dot is + group.

This time a sunspot came out. If K = 2, which is the closest point? Yeah, maybe these two. I have a problem! It was a draw! I can't judge that. K-NN basically doesn't allow this k to be even.

So set K = 3. When K = 3, we need another point. Maybe ... is closer? This sunspot is a group.

So, according to the previous judgment, there are two-and one +. 2/3 = 66% This 66% is the confidence in this decision. And the distance measured in advance is called Euclidean Distance.

Implementation

Let's decide the flow first.

Data preparation

Let's implement it using sklearn.

First of all, I have to prepare a Data-set, so this is the one I will use this time: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29 Data on breast cancer.

Data analysis

I think the downloaded one will look like this.

Open breast-cancer-wisconsin.names for a description of the data. The "meaning" of each data value is written in 7. From here, it seems that 1 to 10 are Features and 11 is Label. There is data in breast-cancer-wisconsin.data. However, the important label is not written. You can add it programmatically, but for now, enter it manually ...

Library Import


import numpy as np
from sklearn import preprocessing,model_selection,neighbors
import pandas as pd

Pre-process with pandas

The id doesn't seem to have much to do with the Data-set, so drop it. It is written that "?" Is displayed in the breast-cancer-wisconsin.names. Since NAN is a basic NG, we will replace it with -99999 here.

df=pd.read_csv('Data/Breast_Cancer_Wisconsin/breast-cancer-wisconsin.data')
df.replace('?',-99999,inplace=True)
df.drop(['id'],1,inplace=True)

Prepare X and Y data

Simply put, X is unlabeled data. Y is label-only data. So X puts a DataFrame with a "class" dropped. Y puts a DataFrame with only "Class".

#features
x=np.array(df.drop(['class'],1))
#label
y=np.array(df['class'])

Separate Test and Train data

80% Train data and 20% test data are separated.

x_train,x_test,y_train,y_test=model_selection.train_test_split(x,y,test_size=0.2)

Prepare Classifier

Since I use K-NN, it's a K Neighbors Classifier.

clf=neighbors.KNeighborsClassifier()

Train

clf.fit(x_train,y_train)

Verify with Test data

Yeah, it's about 96%. It's a good accuracy.

accuracy=clf.score(x_test,y_test)
print(accuracy)

>0.9642857142857143

Overall Code

import numpy as np
from sklearn import preprocessing,model_selection,neighbors
import pandas as pd

df=pd.read_csv('Data/Breast_Cancer_Wisconsin/breast-cancer-wisconsin.data')
df.replace('?',-99999,inplace=True)
df.drop(['id'],1,inplace=True)

#features
x=np.array(df.drop(['class'],1))
#label
y=np.array(df['class'])

x_train,x_test,y_train,y_test=model_selection.train_test_split(x,y,test_size=0.2)
clf=neighbors.KNeighborsClassifier()
clf.fit(x_train,y_train)
accuracy=clf.score(x_test,y_test)
print(accuracy)

Let's dig deep

From now on, I will write KNN by myself without relying on sklearn. Let's look again first. Actually, it is a distance to decide which group to make sunspots. This distance is called "Euclidean Distance".

a formula

The calculation formula is as follows. Sorry for the ugly characters ... here, i＝Diemensions q = point p = difference

For example, q = (1,3), p = (2,5). Euclidean Distance＝2.236

Implementation

Let's write this formula in python first.

from math import sqrt

plot1=[1,3]
plot2=[2,5]

euclidean_distance=sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )

print(euclidean_distance)

Create a Function

Before creating a Function, let's think about the idea first. The code below first plots a Dataset and new data. That's right for the function of Function. data: Train_data. predict: This is Test_data. k: How many points do you get, I think sklearn has a default of k = 5 ... And at the end, it activates some magic and produces the result!

Implementation

from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings

style.use('fivethirtyeight')

data_set={'k':[[1,2],[2,3],[3,1]]
          ,'r':[[6,5],[7,7],[8,6]]} 

new_features=[5,7]

[[plt.scatter(ii[0],ii[1],s=100,color=i) for ii in data_set[i]] for i in data_set]
plt.scatter(new_features[0],new_features[1])

def k_nearest_neighbors(data,predict,k=3):
    if len(data) >=k:
        warnings.warnings('K is set to a value less than total voting groups!')
    
    #?make some magic
    
    return vote_result

Think of a magical device

Are you really looking for this Function? How do we compare one Data-point with another and find the closest Point. KNN has to compare Predict Points to all Points, which is a KNN problem. Certainly, it is possible to save the calculation time by setting Radius and seeing only a certain number of points.

Implementation

Calculate all euclidean_distane in For Loop and add it to the array.
For example, [2.44,'r'], [1.4,'r'], [2.14,'k'], [4.4,'k'], [5.44,'k'] ..
Then sort the array called votes from small to large euclidean_distane calculated in the previous array.
Store the label corresponding to that euclidean_distane.
Use Counter (votes) .most_common (1) to extract the most frequent elements. This is tentatively [('r', 3)].
By using Counter (votes) .most_common (1) [0] [0], ‘r’ can be retrieved further.
Finally, return the result.

from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings

style.use('fivethirtyeight')

data_set={'k':[[1,2],[2,3],[3,1]]
          ,'r':[[6,5],[7,7],[8,6]]} 

new_features=[5,7]

[[plt.scatter(ii[0],ii[1],s=100,color=i) for ii in data_set[i]] for i in data_set]
plt.scatter(new_features[0],new_features[1])

def k_nearest_neighbors(data,predict,k=3):
    if len(data) >=k:
        warnings.warnings('K is set to a value less than total voting groups!')
    
    for group in data:
        #data_set={'k':[[1,2],[2,3],[3,1]],'r':[[6,5],[7,7],[8,6]]} 
        #iterating  ^                       ^
        for features in data[group]:
            #data_set={'k':[[1,2],[2,3],[3,1]],'r':[[6,5],[7,7],[8,6]]} 
            #iterating      ^^^^^^^^^^^^^^^^^      ^^^^^^^^^^^^^^^^^
            
            #not fast and just for 2-dimensions
            #euclidean_distacnce=sqrt(  (features[0]-predict[0])**2 + features[1]-predict[1]**2 )
            
            #better
            #euclidean_distane=np.sqrt( np.sum((np.array(features)-np.array(predict))**2))
            
            euclidean_distane=np.linalg.norm( np.array(features)-np.array(predict)  )
            distance.append([euclidean_distane,group])
    
    votes=[i[1] for i in sorted(distance)[:k]]
    
    print(Counter(votes).most_common(1))
    
    vote_result=Counter(votes).most_common(1)[0][0]
            
    return vote_result

result=k_nearest_neighbors(data_set,new_features,k=3)
print(result)
[[plt.scatter(ii[0],ii[1],s=100,color=i) for ii in data_set[i]] for i in data_set]
plt.scatter(new_features[0],new_features[1],color=result)

Yeah, it's the red group!

Let's go a little further

Now let's use real data and compare it with scikit-learn, which is more accurate.

Randomly change the order of Data with full_data = df.astype (float) .values.tolist ().
Prepare the data-set of train_set and test_set. In the dictionary, use 2 and 4 to match the Class in the previous breast cancer data.
train_data = full_data [:-int (test_size * len (full_data))], for example, if there are 100 full_data, test_size is 0.2, so 100 * 0.2 = 20. train_data = full_data [: -20] stores train_data for 80 datasets.
test_data = full_data [-int (test_size * len (full_data)):], for example, if there are 100 full_data, test_size is 0.2, so 100 * 0.2 = 20. train_data = full_data [-20:] stores test_data in the last 20 datasets.
The last For loop uses Train_data and test_set, and although test_set is group 2, it can be said that it was wrong when vote returned 4, and conversely, test_set was group 2 and it was found to be correct when vote returned 4. Set correct to +1. And the exact number / Total can be said to be the correct answer rate.

Self-help Function

from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random

style.use('fivethirtyeight')

def k_nearest_neighbors(data,predict,k=3):
    if len(data) >=k:
        warnings.warnings('K is set to a value less than total voting groups!')
    
    for group in data:
        #data_set={'k':[[1,2],[2,3],[3,1]],'r':[[6,5],[7,7],[8,6]]} 
        #iterating  ^                       ^
        for features in data[group]:
            #data_set={'k':[[1,2],[2,3],[3,1]],'r':[[6,5],[7,7],[8,6]]} 
            #iterating      ^^^^^^^^^^^^^^^^^      ^^^^^^^^^^^^^^^^^
            
            #not fast and just for 2-dimensions
            #euclidean_distacnce=sqrt(  (features[0]-predict[0])**2 + features[1]-predict[1]**2 )
            
            #better
            #euclidean_distane=np.sqrt( np.sum((np.array(features)-np.array(predict))**2))
            
            euclidean_distane=np.linalg.norm( np.array(features)-np.array(predict)  )
            distance.append([euclidean_distane,group])
    
    votes=[i[1] for i in sorted(distance)[:k]]
    
    print(Counter(votes).most_common(1))
    
    vote_result=Counter(votes).most_common(1)[0][0]
            
    return vote_result

df=pd.read_csv('Data/Breast_Cancer_Wisconsin/breast-cancer-wisconsin.data')
df.replace('?',-99999,inplace=True)
df.drop(['id'],1,inplace=True)
full_data=df.astype(float).values.tolist()

#random sort your data
random.shuffle(full_data)

test_size=0.2
train_set={2:[],4:[]}
test_set={2:[],4:[]}

train_data=full_data[:-int(test_size*len(full_data))]#first 80%
test_data=full_data[-int(test_size*len(full_data)):] #last 20%

'''
for i in train_data:
    #train_set[i[-1]] is the class data from train_set:2 or 4.
    #.append(i[:-1]) is the function to take all the data (unless the class) to the list 
    train_set[i[-1]].append(i[:-1])

for i in test_data:
    test_set[i[-1]].append(i[:-1])
'''
correct=0
total=0

for group in test_set:
    #loop the class'2' and '4'
    for data in test_set[group]:
        #take the data from set'2' and '4'
        vote=k_nearest_neighbors(train_set,data,k=5)
        if group == vote:
            correct+=1
        total+=1
        

print('Accuracy:',correct/total)

>Accuracy: 0.9640287769784173

sklearn Function

Code as usual.

import numpy as np
from sklearn import preprocessing,model_selection,neighbors
import pandas as pd

df=pd.read_csv('Data/Breast_Cancer_Wisconsin/breast-cancer-wisconsin.data')
df.replace('?',-99999,inplace=True)

df.drop(['id'],1,inplace=True)

#features
x=np.array(df.drop(['class'],1))
#label
y=np.array(df['class'])

x_train,x_test,y_train,y_test=model_selection.train_test_split(x,y,test_size=0.4)

clf=neighbors.KNeighborsClassifier()
clf.fit(x_train,y_train)

accuracy=clf.score(x_test,y_test)

>Accuracies: 0.9671428571428571

Conclusion

… The correct answer rate does not change much, but the calculation speed is clearly faster with sklearn.

Finally

Well, I think there are still many other ways to classify without using KNN. For example, let's say you have such Linear data. Perhaps? Write these lines first ... And a new red dot will come out. Is it possible to do some calculation with Square Error? Now suppose you have such Non-linear data. If you force a line, maybe it's like this? In that case, you can use KNN to find out immediately.