(Added on 2020/02/25) TODO: The weights of k-NN are calculated by the sum of the distances without using the reciprocal of the distances. Will be corrected to the reciprocal of the distance. (The calculation method of KNeighborsClassifier is not wrong, but the calculation method of my own function is wrong)
--With KNeighbors Classifier of sklearn, it was possible to set a heavy weight on the siple side with less imbalanced data.
--Result: We were able to raise the recall on the small sample side.
- before confusion matrix
[[2641, 67]
[ 167, 125]]
- after: confusion matrix
[[2252 456]
[ 80 212]]
Image diagram after
--The behavior of the weights argument of sklearn.neighbors.KNeighborsClassifier was unclear, so I checked it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
%matplotlib inline
from sklearn.datasets import make_classification
data_base = make_classification(
n_samples = 10000, n_features = 2, n_informative = 2, n_redundant = 0,
n_repeated = 0, n_classes = 2, n_clusters_per_class = 2, weights = [0.9, 0.1],
flip_y = 0, class_sep = 0.5, hypercube = True, shift = 0.0,
scale = 1.0, shuffle = True, random_state =5)
df = pd.DataFrame(data_base[0], columns = ['f1', 'f2'])
df['class'] = data_base[1]
fig = plt.figure()
ax = fig.add_subplot()
for i in df.groupby('class'):
cls = i[1]
ax.plot(cls['f1'],
cls['f2'],
'o',
ms=2)
plt.show()
X = df[["f1","f2"]]
y = df["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)
print("train", X_train.shape, y_train.shape)
print("test", X_test.shape, y_test.shape)
train (7000, 2) (7000,) test (3000, 2) (3000,)
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
result = confusion_matrix(y_test, pred)
result2 = accuracy_score(y_test, pred)
print("confusion matrix \n",result)
print("accuracy \n", result2 )
confusion matrix [[2641 67] [ 167 125]] accuracy 0.922
--First, calculate the reciprocal of the ratio of the sample size to weight.
size_and_weight = pd.DataFrame({
'class0': [sum(clf._y == 0),1/ (sum(clf._y == 0)/ len(clf._y))],
'class1': [sum(clf._y == 1),1/ (sum(clf._y == 1)/ len(clf._y))]}).T
size_and_weight.columns = ['sample_size', 'weight']
size_and_weight
sample_size | weight | |
---|---|---|
class0 | 6292.0 | 1.112524 |
class1 | 708.0 | 9.887006 |
--Train the train data, and then calculate the distance for the test data.
weights_array = pd.Categorical(clf._y)
weights_array.categories = [size_and_weight.loc[('class0'),'weight'],
size_and_weight.loc[('class1'),'weight']]
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
neigh_dist, neigh_ind = clf.kneighbors(X_test) #The data frame of this part will be described later.
weights_array = np.array(weights_array).reshape((-1, 1))[neigh_ind,0]
pd.DataFrame(weights_array).head()
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1.112524 | 1.112524 | 1.112524 | 1.112524 | 1.112524 |
1 | 1.112524 | 1.112524 | 1.112524 | 1.112524 | 1.112524 |
2 | 1.112524 | 9.887006 | 1.112524 | 1.112524 | 1.112524 |
3 | 1.112524 | 1.112524 | 1.112524 | 1.112524 | 1.112524 |
4 | 1.112524 | 1.112524 | 1.112524 | 1.112524 | 1.112524 |
-↑ Weight completed to handle imbalanced data
--Set the argument weights to take weight into account and execute until prediction
def tmp(array_):
global weights_array
array_ = array_ * weights_array
return array_
clf = KNeighborsClassifier(n_neighbors=5,weights=tmp)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
pred = clf.predict(X_test)
result = confusion_matrix(y_test, pred)
result2 = accuracy_score(y_test, pred)
print("confusion matrix \n",result)
print("accuracy \n", result2 )
confusion matrix [[2252 456] [ 80 212]] accuracy 0.8213333333333334
--KNeighborsClassifier classifies test data into the class with the largest total distance. (Although it is a slightly counterintuitive algorithm that the estimation result is in a distant class, the distance calculation is applied only to the closest n data, so if weight is not applied, the estimation is closer to a majority vote than the sum of the distances. It becomes.) --By replacing the sum with "element product of sum and weights_array", we customized the method to deal with imbalanced data.
Replace the sum with "sum and element product of weights_array"
In order to do so, it is necessary to understand the correspondence between the following data frames.
neigh_dist, neigh_ind = clf.kneighbors(X_test)
pd.DataFrame(neigh_dist).tail(5)
pd.DataFrame(neigh_ind).tail(5)
pd.DataFrame(clf._y.reshape((-1, 1))[neigh_ind,0]).tail(5)
From the left of ↑, it becomes neighbor_dist, neigh_ind, "neigh_ind class"
--Discussion 1: The shapes of the above three tables are the same. --Discussion 2: [Number of 3 lines above] = [Number of lines of test data] --Discussion 3: Number of columns = [n_neighbors = 5] --Discussion 4: Regarding neigh_dist, it increases as you move to the right. In other words, it is considered that the five points closest to the test data were extracted.
--DataFrame: Calculate the following numbers for neigh_dist --index = 2998 # 2998th test data --values = 0.015318 # Distance between [1374th data of X_train determined to be the closest distance] and [test data of the above index]
test_index = 2998
tmp1 = pd.DataFrame(X_test.iloc[test_index])
display(tmp1.T)
train_index = 1374
tmp2 = pd.DataFrame(X_train.iloc[train_index])
display(tmp2.T)
#Calculate Euclidean distance
(
sum( (tmp1.values - tmp2.values) **2 )
**(1/2)
)
array([0.01531811])
--About neigh_dist.iloc [2998,0]: could be calculated from the training data and the test data.
sum(clf_knn._y == y_train) == len(y_train)
True
--It turns out that y_train and clf_knn._y match.
index_ = neigh_ind[2998,:]
pd.DataFrame(clf._y[index_]).T
--About "neigh_ind class" line 2998: I was able to create a "neigh_ind class" from night_ind and y_train.
-Chapter [Details: About customization arguments] only explains the source code of the KNeighborsClassifier.predict part, so it may be faster to look at the source code of git. Reference git sklearn
Recommended Posts