(Added on 2020/02/25) TODO: The weights of k-NN are calculated by the sum of the distances without using the reciprocal of the distances. Will be corrected to the reciprocal of the distance. (The calculation method of KNeighborsClassifier is not wrong, but the calculation method of my own function is wrong)

Conclusion

--With KNeighbors Classifier of sklearn, it was possible to set a heavy weight on the siple side with less imbalanced data.

--Result: We were able to raise the recall on the small sample side. - before confusion matrix
[[2641, 67]
[ 167, 125]] - after: confusion matrix
[[2252 456]
[ 80 212]]

Image diagram before スクリーンショット 2020-02-24 12.39.05.png

Image diagram after スクリーンショット 2020-02-24 12.38.55.png

Background / Issues

--The behavior of the weights argument of sklearn.neighbors.KNeighborsClassifier was unclear, so I checked it.

Method

No setting for imbalanced data

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score

%matplotlib inline

from sklearn.datasets import make_classification
data_base = make_classification(
    n_samples = 10000, n_features = 2, n_informative = 2, n_redundant = 0, 
    n_repeated = 0, n_classes = 2, n_clusters_per_class = 2, weights = [0.9, 0.1], 
    flip_y = 0, class_sep = 0.5, hypercube = True, shift = 0.0, 
    scale = 1.0, shuffle = True, random_state =5)

df = pd.DataFrame(data_base[0], columns = ['f1', 'f2'])
df['class'] = data_base[1]

fig = plt.figure()
ax = fig.add_subplot()
for i in df.groupby('class'):
    cls = i[1]
    ax.plot(cls['f1'],
              cls['f2'],
               'o',
            ms=2)

plt.show()

X = df[["f1","f2"]]
y = df["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

print("train", X_train.shape, y_train.shape)
print("test", X_test.shape, y_test.shape)

train (7000, 2) (7000,) test (3000, 2) (3000,)


from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
    
result   = confusion_matrix(y_test, pred)
result2 = accuracy_score(y_test, pred)

print("confusion matrix \n",result) 
print("accuracy \n", result2 )

confusion matrix [[2641 67] [ 167 125]] accuracy 0.922

There is a setting for imbalanced data

--First, calculate the reciprocal of the ratio of the sample size to weight.

size_and_weight = pd.DataFrame({
                'class0': [sum(clf._y == 0),1/ (sum(clf._y == 0)/ len(clf._y))],
                'class1': [sum(clf._y == 1),1/ (sum(clf._y == 1)/ len(clf._y))]}).T
size_and_weight.columns = ['sample_size', 'weight']
size_and_weight

	sample_size	weight
class0	6292.0	1.112524
class1	708.0	9.887006

--Train the train data, and then calculate the distance for the test data.



weights_array = pd.Categorical(clf._y)
weights_array.categories = [size_and_weight.loc[('class0'),'weight'],
                            size_and_weight.loc[('class1'),'weight']]

clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
neigh_dist, neigh_ind = clf.kneighbors(X_test) #The data frame of this part will be described later.

weights_array = np.array(weights_array).reshape((-1, 1))[neigh_ind,0]
pd.DataFrame(weights_array).head()

	0	1	2	3	4
0	1.112524	1.112524	1.112524	1.112524	1.112524
1	1.112524	1.112524	1.112524	1.112524	1.112524
2	1.112524	9.887006	1.112524	1.112524	1.112524
3	1.112524	1.112524	1.112524	1.112524	1.112524
4	1.112524	1.112524	1.112524	1.112524	1.112524

-↑ Weight completed to handle imbalanced data

--Set the argument weights to take weight into account and execute until prediction

def tmp(array_):
    global weights_array
    array_ = array_ * weights_array
    return array_

clf = KNeighborsClassifier(n_neighbors=5,weights=tmp)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

pred = clf.predict(X_test)
result   = confusion_matrix(y_test, pred)
result2 = accuracy_score(y_test, pred)

print("confusion matrix \n",result) 
print("accuracy \n", result2 )

confusion matrix [[2252 456] [ 80 212]] accuracy 0.8213333333333334

Conclusion

--KNeighborsClassifier classifies test data into the class with the largest total distance. (Although it is a slightly counterintuitive algorithm that the estimation result is in a distant class, the distance calculation is applied only to the closest n data, so if weight is not applied, the estimation is closer to a majority vote than the sum of the distances. It becomes.) --By replacing the sum with "element product of sum and weights_array", we customized the method to deal with imbalanced data.

Details: About customized arguments

Replace the sum with "sum and element product of weights_array"

In order to do so, it is necessary to understand the correspondence between the following data frames.

X_train
X_test
y_train --neigh_dist <-distance calculation distance between X_train and X_test --neigh_ind <-index of distance calculation between X_train and X_test --neigh_ind class

Understand the correspondence of neigh_dist, neigh_ind, y_train (clf._y) from the table.

neigh_dist, neigh_ind = clf.kneighbors(X_test)
pd.DataFrame(neigh_dist).tail(5)
pd.DataFrame(neigh_ind).tail(5)
pd.DataFrame(clf._y.reshape((-1, 1))[neigh_ind,0]).tail(5)

From the left of ↑, it becomes neighbor_dist, neigh_ind, "neigh_ind class"

--Discussion 1: The shapes of the above three tables are the same. --Discussion 2: [Number of 3 lines above] = [Number of lines of test data] --Discussion 3: Number of columns = [n_neighbors = 5] --Discussion 4: Regarding neigh_dist, it increases as you move to the right. In other words, it is considered that the five points closest to the test data were extracted.

Understand the correspondence between neighbor_dist, X_train, and X_test from the calculation of neigh_dist.

--DataFrame: Calculate the following numbers for neigh_dist --index = 2998 # 2998th test data --values = 0.015318 # Distance between [1374th data of X_train determined to be the closest distance] and [test data of the above index]

test_index = 2998
tmp1 = pd.DataFrame(X_test.iloc[test_index])
display(tmp1.T)

train_index = 1374
tmp2 = pd.DataFrame(X_train.iloc[train_index])
display(tmp2.T)

#Calculate Euclidean distance
(
sum(   (tmp1.values - tmp2.values)  **2    )
**(1/2)
)

array([0.01531811])

--About neigh_dist.iloc [2998,0]: could be calculated from the training data and the test data.

Correspondence between y_train and clf_knn._y

sum(clf_knn._y == y_train) == len(y_train)

True

--It turns out that y_train and clf_knn._y match.

Correspondence between neigh_ind and "neigh_ind class"

index_ = neigh_ind[2998,:]
pd.DataFrame(clf._y[index_]).T

--About "neigh_ind class" line 2998: I was able to create a "neigh_ind class" from night_ind and y_train.

Finally

-Chapter [Details: About customization arguments] only explains the source code of the KNeighborsClassifier.predict part, so it may be faster to look at the source code of git. Reference git sklearn

Machine learning imbalanced data sklearn with k-NN

Conclusion

Background / Issues

Method

No setting for imbalanced data

There is a setting for imbalanced data

Conclusion

Details: About customized arguments

Understand the correspondence of neigh_dist, neigh_ind, y_train (clf._y) from the table.

Understand the correspondence between neighbor_dist, X_train, and X_test from the calculation of neigh_dist.

Correspondence between y_train and clf_knn._y

Correspondence between neigh_ind and "neigh_ind class"

Finally