This article is written by beginners in machine learning. Please note.
The previous article is here. The next article is here.
I associated it without permission. For more information (not very detailed), please refer to the previous article (https://qiita.com/sorax/items/8663906fae41798a00b8). The simple summary is "I tried using kernel density estimation as a classifier for supervised learning!".
I modified the script summarized in Previous article to make it object-oriented. The name is "Gaussian kernel-density estimate classifier", or "GKDE Classifier" for short. I just named it arbitrarily.
↓ Script ↓
import numpy as np
class GKDEClassifier(object):
def __init__(self, bw_method="scotts_factor", weights="None"):
# Kernel bandwidth
self.bw_method = bw_method
# Kernel weight
self.weights = weights
def fit(self, X, y):
Number of labels for # y
self.y_num = len(np.unique(y))
# List containing estimated probability density functions
self.kernel_ = []
# Store probability density function
for i in range(self.y_num):
kernel = gaussian_kde(X[y==i].T)
self.kernel_.append(kernel)
return self
def predict(self, X):
# List to store predictive labels
pred = []
#List of test data label-specific probabilities
self.p_ = []
# Store probabilities by label
for i in range(self.y_num):
self.p_.append(self.kernel_[i].evaluate(X.T).tolist())
# ndarray
self.p_ = np.array(self.p_)
# Prediction label allocation
for j in range(self.p_.shape[1]):
pred.append(np.argmax(self.p_.T[j]))
return pred
Labels should be assigned in the order 0, 1, 2 ... (in ascending order of non-negative integers). Maybe: LabelEncoder
(Added on 2020/8/5: Part 3 has released the modified code)
Initializes the object. Here, specify the parameters required for kernel density estimation, that is, the arguments required for initializing SciPy's gaussian_kde. This time, I set the same value as the default value of gaussian_kde.
Learning is performed using teacher data. After estimating the kernel density with gaussian_kde, the estimated density function of label 0, the estimated density function of label 1, and so on are stored in order.
Predict test data.
for i in range(self.y_num):
self.p_.append(self.kernel_[i].evaluate(X.T).tolist())
Here, the estimated density functions are extracted one by one from kernel_ and the probability density of the test data is calculated.
Subsequent scripts are messed up. I wanted to write it more concisely, but it didn't behave as I expected ... There is a coding beginner. Just move. It's better just to move.
That's why the object-oriented Gaussian kernel density estimation classifier is complete.
The wine dataset has 13 features, but after standardization, it will be reduced to 4 dimensions. Let's learn and classify with the data after dimensionality reduction.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Data set loading
wine = datasets.load_wine()
X = wine.data
y = wine.target
# Data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=1, stratify=y)
# Standardization
sc = StandardScaler()
sc = sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
# Dimensionality reduction
pca = PCA(n_components=4)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
# Learning and prediction
f = GKDEClassifier()
f.fit(X_train_pca, y_train)
y_pred = f.predict(X_test_pca)
Result is……?
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
0.9722222222222222
Hooray. Since there are 36 test data, the correct answer rate is 35/36. It's pretty good.
What will happen?
# Learning and prediction
f = GKDEClassifier()
f.fit(X_train_std, y_train)
y_pred = f.predict(X_test_std)
print(accuracy_score(y_test, y_pred))
0.9722222222222222
Result: Same.
I made a circular dataset.
from sklearn.datasets import make_circles
from matplotlib import pyplot as plt
X, y = make_circles(n_samples=1000, random_state=1, noise=0.1, factor=0.2)
plt.scatter(X[y==0, 0], X[y==0, 1], c="red", marker="^", alpha=0.5)
plt.scatter(X[y==1, 0], X[y==1, 1], c="blue", marker="o", alpha=0.5)
plt.show()
The label is different at the center and the outer edge. Can you classify it correctly?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=1, stratify=y)
f = GKDEClassifier()
f.fit(X_train, y_train)
y_pred = f.predict(X_test)
print(accuracy_score(y_test, y_pred))
0.9933333333333333
Conclusion: Great victory.
I've categorized it so well, but I've forgotten what's important. That is the academic correctness of this classification method. Next time I will discuss it.
Continued to Part 3