Supervised learning using kernel density estimation

This article is written by beginners in machine learning. Please note.

An example actually used is here. The specific background of the idea and the revised content are here.

What's kernel density estimation

Bukkake [WIkipedia](https://ja.wikipedia.org/wiki/%E3%82%AB%E3%83%BC%E3%83%8D%E3%83%AB%E5%AF%86%E5% It is faster to look at BA% A6% E6% 8E% A8% E5% AE% 9A).

Imagine a simple histogram. It can be said that the part where the histogram is high is *** relatively easy to happen ***, and the part where the histogram is low *** is relatively unlikely to occur ***. Have you ever heard a similar story somewhere?

This is the same idea as the probability density function. Histogram is, in a sense, *** a true probability density function *** estimated by *** measured values ***. *** Kernel density estimation *** is a more continuous and smoother estimation method using kernel functions.

What's supervised learning

[Wikipedia](https://ja.wikipedia.org/wiki/%E6%95%99%E5%B8%AB%E3%81%82%E3%82%8A%E5%AD%A6%E7%BF % 92) See the teacher or read another person's Qiita.

Kernel density estimation and supervised learning

A "teacher" in supervised learning is a set of "data" and "correct labels".

Consider a dataset with the correct label "0, 1, 2". This is divided into label 0 data, label 1 data, and label 2 data. If you estimate the kernel density using teacher data with a correct label of 0, you can find the probability density function for the event that the label becomes 0.

Find the probability density function for all labels based on the teacher data and calculate the probability density of the test data. Then, let's classify by the size of the value. That is this attempt.

Strictly speaking, we really have to calculate the percentage of each label in the population ... I would like to summarize the difficult story again.

Let's implement it for the time being

This world is wonderful. This is because kernel density estimation using the Gaussian kernel has already been implemented in SciPy.

How to use Gaussian KDE

Here is a brief summary of how to use SciPy's gaussian_kde.

Kernel density estimation

kernel = gaussian_kde(X, bw_method="scotts_factor", weights="None")

--X: Data set for kernel density estimation. --bw_method: Kernel bandwidth. Scotts_factor if not specified. --weights: Weights for kernel density estimation. If not specified, all weights are equal.

Calculate the probability

Enter new data into the estimated probability density function to calculate the probability.

pd = kernel.evaluate(Z)

--Z: Data point (s) for which you want to calculate the probability.

It is returned as a list array containing the probabilities of Z.

Try supervised learning

Try it with Scikit-learn's iris dataset!

The flow is like this iris dataset read → Split training data and test data with train_test_split → Standardization of training data and test data → Perform kernel density estimation for each label using training data → Calculate the probability density for each label of test data → Output the label with the largest value

↓ Script ↓

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy.stats import gaussian_kde

# Loading iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Division of training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=1, stratify=y)

# Standardization
sc = StandardScaler()
sc = sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

# Kernel density estimation
kernel0 = gaussian_kde(X_train_std[y_train==0].T)
kernel1 = gaussian_kde(X_train_std[y_train==1].T)
kernel2 = gaussian_kde(X_train_std[y_train==2].T)

# Calculate the probability density of test data
p0s = kernel0.evaluate(X_test_std.T)
p1s = kernel1.evaluate(X_test_std.T)
p2s = kernel2.evaluate(X_test_std.T)

# Prediction label output
y_pred = []
for p0, p1, p2 in zip(p0s, p1s, p2s):
    if max(p0, p1, p2) == p0:
        y_pred.append(0)
    elif max(p0, p1, p2) == p1:
        y_pred.append(1)
    else:
        y_pred.append(2)

Precautions for standardization

Test data is standardized using the mean and standard deviation of the training data. This is because if standardization is performed separately, the data may be biased or misaligned.

Precautions for kernel density estimation

If you let gaussian_kde read the dataset as it is, it seems that the *** column vector is processed as one data ***. But the iris dataset transposes the data because *** row vector is one data ***. The same is true when calculating the probability density of test data.

Prediction label output

y_pred = []
for p0, p1, p2 in zip(p0s, p1s, p2s):
    if max(p0, p1, p2) == p0:
        y_pred.append(0)
    elif max(p0, p1, p2) == p1:
        y_pred.append(1)
    else:
        y_pred.append(2)

The probability density of the test data is stored in p0s, p1s, p2s for each label. Take out one each

--0 if the value of label 0 is the maximum --1 if the value of label 1 is the maximum --Otherwise 2

Store the results in the list y_pred in the order of test data.

Result announcement

Let's check the accuracy rate of the prediction label with the accuracy_score of scikit-learn. Throb.

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
 1.0

Hooray.

Finally

I tried to use the result of kernel density estimation as a classifier for supervised learning. In reality, such techniques are rarely used. I think that there is a drawback that the amount of calculation is large and the accuracy is significantly reduced depending on the data. However, as you can see from this trial, it seems that some data can be classified relatively quickly and nicely.

Continue to Part 2

[Machine learning] Supervised learning using kernel density estimation