This article is a memo taken while studying Video of a great programmer, Mr. Sentdex. https://www.youtube.com/user/sentdex
First of all, what is K-NN? According to WIKI: https://ja.wikipedia.org/wiki/K%E8%BF%91%E5%82%8D%E6%B3%95
The k-nearest neighbor method is the simplest of almost any machine learning algorithm. The classification of an object is determined by voting on its neighbors (ie, assigning the most common class of k nearest neighbors to that object). k is a positive integer, generally small. If k = 1, it will only be classified in the same class as the nearest neighbor object. In the case of binary classification, if k is set to an odd number, the problem of not being able to classify with the same number of votes can be avoided.
Yeah, a very democratic way. First look at the graph below. There are 6 points of data. What if you want to group?
Maybe it will be like this. Then why? Because they are close to each other. What we are doing is called "Cluster".
All right. This time, there are already groups of + and-.
I have a red dot now. Which group will this red dot belong to? You will probably be in the + group. But why? What kind of thinking circuit did you think in your head?
The upcoming red dot will probably-be in the group. What is it again?
Last red dot If you are here, which group is it? This depends on the algorithm you have chosen.
Let's get back to the story. Perhaps (at least I) this red dot is closer to the + group than the-group, so the red dot is the + group! I should have decided. And we call this "idea" K-nearest neighbors (K-NN). I think that it is the k-nearest neighbor method in Japanese. All I have to do for K-NN is to measure the distance between the points I want to group and the other points, and decide which group to join. In the example shown above, there are only 2 groups, but what if there are 100 tasks? What if there are 1000 tasks? Yeah, I'm going to lose my mind ...
But think twice. Do I really have to measure all the points? I don't think it's necessary in most cases. "K" is defined in K-NN. Now, let's set K = 2.
Yeah, I see. Perhaps the place where the orange circle is written is the closest to the red dot. Therefore, I judge that this red dot is + group.
This time a sunspot came out. If K = 2, which is the closest point? Yeah, maybe these two. I have a problem! It was a draw! I can't judge that. K-NN basically doesn't allow this k to be even.
So set K = 3. When K = 3, we need another point. Maybe ... is closer? This sunspot is a group.
So, according to the previous judgment, there are two-and one +. 2/3 = 66% This 66% is the confidence in this decision. And the distance measured in advance is called Euclidean Distance.
Let's decide the flow first.
Let's implement it using sklearn.
First of all, I have to prepare a Data-set, so this is the one I will use this time: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29 Data on breast cancer.
I think the downloaded one will look like this.
Open breast-cancer-wisconsin.names for a description of the data. The "meaning" of each data value is written in 7. From here, it seems that 1 to 10 are Features and 11 is Label. There is data in breast-cancer-wisconsin.data. However, the important label is not written. You can add it programmatically, but for now, enter it manually ...
import numpy as np
from sklearn import preprocessing,model_selection,neighbors
import pandas as pd
The id doesn't seem to have much to do with the Data-set, so drop it. It is written that "?" Is displayed in the breast-cancer-wisconsin.names. Since NAN is a basic NG, we will replace it with -99999 here.
df=pd.read_csv('Data/Breast_Cancer_Wisconsin/breast-cancer-wisconsin.data')
df.replace('?',-99999,inplace=True)
df.drop(['id'],1,inplace=True)
Simply put, X is unlabeled data. Y is label-only data. So X puts a DataFrame with a "class" dropped. Y puts a DataFrame with only "Class".
#features
x=np.array(df.drop(['class'],1))
#label
y=np.array(df['class'])
80% Train data and 20% test data are separated.
x_train,x_test,y_train,y_test=model_selection.train_test_split(x,y,test_size=0.2)
Since I use K-NN, it's a K Neighbors Classifier.
clf=neighbors.KNeighborsClassifier()
Train
clf.fit(x_train,y_train)
Yeah, it's about 96%. It's a good accuracy.
accuracy=clf.score(x_test,y_test)
print(accuracy)
>0.9642857142857143
import numpy as np
from sklearn import preprocessing,model_selection,neighbors
import pandas as pd
df=pd.read_csv('Data/Breast_Cancer_Wisconsin/breast-cancer-wisconsin.data')
df.replace('?',-99999,inplace=True)
df.drop(['id'],1,inplace=True)
#features
x=np.array(df.drop(['class'],1))
#label
y=np.array(df['class'])
x_train,x_test,y_train,y_test=model_selection.train_test_split(x,y,test_size=0.2)
clf=neighbors.KNeighborsClassifier()
clf.fit(x_train,y_train)
accuracy=clf.score(x_test,y_test)
print(accuracy)
From now on, I will write KNN by myself without relying on sklearn. Let's look again first. Actually, it is a distance to decide which group to make sunspots. This distance is called "Euclidean Distance".
The calculation formula is as follows. Sorry for the ugly characters ... here, i=Diemensions q = point p = difference
For example, q = (1,3), p = (2,5). Euclidean Distance=2.236
Let's write this formula in python first.
from math import sqrt
plot1=[1,3]
plot2=[2,5]
euclidean_distance=sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )
print(euclidean_distance)
Before creating a Function, let's think about the idea first. The code below first plots a Dataset and new data. That's right for the function of Function. data: Train_data. predict: This is Test_data. k: How many points do you get, I think sklearn has a default of k = 5 ... And at the end, it activates some magic and produces the result!
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
style.use('fivethirtyeight')
data_set={'k':[[1,2],[2,3],[3,1]]
,'r':[[6,5],[7,7],[8,6]]}
new_features=[5,7]
[[plt.scatter(ii[0],ii[1],s=100,color=i) for ii in data_set[i]] for i in data_set]
plt.scatter(new_features[0],new_features[1])
def k_nearest_neighbors(data,predict,k=3):
if len(data) >=k:
warnings.warnings('K is set to a value less than total voting groups!')
#?make some magic
return vote_result
Are you really looking for this Function? How do we compare one Data-point with another and find the closest Point. KNN has to compare Predict Points to all Points, which is a KNN problem. Certainly, it is possible to save the calculation time by setting Radius and seeing only a certain number of points.
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
style.use('fivethirtyeight')
data_set={'k':[[1,2],[2,3],[3,1]]
,'r':[[6,5],[7,7],[8,6]]}
new_features=[5,7]
[[plt.scatter(ii[0],ii[1],s=100,color=i) for ii in data_set[i]] for i in data_set]
plt.scatter(new_features[0],new_features[1])
def k_nearest_neighbors(data,predict,k=3):
if len(data) >=k:
warnings.warnings('K is set to a value less than total voting groups!')
for group in data:
#data_set={'k':[[1,2],[2,3],[3,1]],'r':[[6,5],[7,7],[8,6]]}
#iterating ^ ^
for features in data[group]:
#data_set={'k':[[1,2],[2,3],[3,1]],'r':[[6,5],[7,7],[8,6]]}
#iterating ^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^
#not fast and just for 2-dimensions
#euclidean_distacnce=sqrt( (features[0]-predict[0])**2 + features[1]-predict[1]**2 )
#better
#euclidean_distane=np.sqrt( np.sum((np.array(features)-np.array(predict))**2))
euclidean_distane=np.linalg.norm( np.array(features)-np.array(predict) )
distance.append([euclidean_distane,group])
votes=[i[1] for i in sorted(distance)[:k]]
print(Counter(votes).most_common(1))
vote_result=Counter(votes).most_common(1)[0][0]
return vote_result
result=k_nearest_neighbors(data_set,new_features,k=3)
print(result)
[[plt.scatter(ii[0],ii[1],s=100,color=i) for ii in data_set[i]] for i in data_set]
plt.scatter(new_features[0],new_features[1],color=result)
Yeah, it's the red group!
Now let's use real data and compare it with scikit-learn, which is more accurate.
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
style.use('fivethirtyeight')
def k_nearest_neighbors(data,predict,k=3):
if len(data) >=k:
warnings.warnings('K is set to a value less than total voting groups!')
for group in data:
#data_set={'k':[[1,2],[2,3],[3,1]],'r':[[6,5],[7,7],[8,6]]}
#iterating ^ ^
for features in data[group]:
#data_set={'k':[[1,2],[2,3],[3,1]],'r':[[6,5],[7,7],[8,6]]}
#iterating ^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^
#not fast and just for 2-dimensions
#euclidean_distacnce=sqrt( (features[0]-predict[0])**2 + features[1]-predict[1]**2 )
#better
#euclidean_distane=np.sqrt( np.sum((np.array(features)-np.array(predict))**2))
euclidean_distane=np.linalg.norm( np.array(features)-np.array(predict) )
distance.append([euclidean_distane,group])
votes=[i[1] for i in sorted(distance)[:k]]
print(Counter(votes).most_common(1))
vote_result=Counter(votes).most_common(1)[0][0]
return vote_result
df=pd.read_csv('Data/Breast_Cancer_Wisconsin/breast-cancer-wisconsin.data')
df.replace('?',-99999,inplace=True)
df.drop(['id'],1,inplace=True)
full_data=df.astype(float).values.tolist()
#random sort your data
random.shuffle(full_data)
test_size=0.2
train_set={2:[],4:[]}
test_set={2:[],4:[]}
train_data=full_data[:-int(test_size*len(full_data))]#first 80%
test_data=full_data[-int(test_size*len(full_data)):] #last 20%
'''
for i in train_data:
#train_set[i[-1]] is the class data from train_set:2 or 4.
#.append(i[:-1]) is the function to take all the data (unless the class) to the list
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
'''
correct=0
total=0
for group in test_set:
#loop the class'2' and '4'
for data in test_set[group]:
#take the data from set'2' and '4'
vote=k_nearest_neighbors(train_set,data,k=5)
if group == vote:
correct+=1
total+=1
print('Accuracy:',correct/total)
>Accuracy: 0.9640287769784173
sklearn Function
Code as usual.
import numpy as np
from sklearn import preprocessing,model_selection,neighbors
import pandas as pd
df=pd.read_csv('Data/Breast_Cancer_Wisconsin/breast-cancer-wisconsin.data')
df.replace('?',-99999,inplace=True)
df.drop(['id'],1,inplace=True)
#features
x=np.array(df.drop(['class'],1))
#label
y=np.array(df['class'])
x_train,x_test,y_train,y_test=model_selection.train_test_split(x,y,test_size=0.4)
clf=neighbors.KNeighborsClassifier()
clf.fit(x_train,y_train)
accuracy=clf.score(x_test,y_test)
>Accuracies: 0.9671428571428571
… The correct answer rate does not change much, but the calculation speed is clearly faster with sklearn.
Well, I think there are still many other ways to classify without using KNN. For example, let's say you have such Linear data. Perhaps? Write these lines first ... And a new red dot will come out. Is it possible to do some calculation with Square Error? Now suppose you have such Non-linear data. If you force a line, maybe it's like this? In that case, you can use KNN to find out immediately.
It's difficult to do Thank you for staying with us until the end! looking forward to contact with you again!
Recommended Posts