I started writing my study content on another blog using the waiting period at home, but since the tex function there was shit, I will move some articles to qiita. .. ..
I learned k-means, the most basic [citation needed] algorithm for unsupervised learning, and tried it easily.
k-means One of the methods to separate given data into multiple clusters.
advantage
Since it is classified by the distance from the center of the cluster, it is almost always the intuitive result when plotting.
Disadvantage
given
Method of calculation
Where $ N_k $ is the set of data contained in the cluster $ k $ at that time.
Experiment with the following parameters in sklearn's datasets.make_circles.
With the above settings, two clusters, a circle with a radius of 0.1 and a circle with a radius of 1, are obtained under a noise of 0.01.
# k:Number of clusters
def kmeans(data, k):
df = pd.DataFrame()
df['x'] = data.T[0]
df['y'] = data.T[1]
clusters = np.random.randint(0, k, len(df))
df['c'] = clusters
while True:
before_clusters = list(df.c.copy())
centers = df.groupby('c').mean().reset_index().drop(columns=['c']).T
tmp = df.drop(columns=['c']).T
new_clusters = [min(centers.columns, key=lambda k:((tmp[col] - centers[k]) ** 2).sum()) for col in tmp.columns]
if before_clusters == new_clusters:
break
else:
df['c'] = new_clusters
return df.c
The result of turning k-means is as follows. Since the data are clearly not linearly separable, it can be seen that the classification results are not correctly classified as expected.
kernel k-means
advantage
Disadvantage
given
Almost the same as normal k-means. When finding the center of each cluster, the center of the feature space may be found. That is, considering the cluster $ k $, we define the vector $ m_k $, which minimizes the following quantity, as the center of the cluster $ k $.
\displaystyle{
\begin{aligned}
\sum_{i \in N_k} (\phi(x_i) - m_k)^ 2.
\end{aligned}
}
$ M_k] to minimize this can be obtained as follows.
\displaystyle{
\begin{aligned}
m_k = \frac{1}{|N_k|} \sum_{i \in N_k} \phi(x_i).
\end{aligned}
}
Using this, the distance between the data $ x_i $ and the cluster $ k $ is as follows.
\displaystyle{
\begin{aligned}
(\phi(x_i) - m_k)^ 2 &= K(x_i, x_i) -\frac{2}{|N_k|} \sum_{j \in N_k} K(x_i, x_j) + \frac{1}{|N_k|^ 2} \sum_{m, n \in N_k} K(x_m, x_n)
\end{aligned}
}
Where $ K (x, y) $ is a kernel function.
Experiment with the following parameters in sklearn's datasets.make_circles.
The kernel function uses the following RBF kernel.
\displaystyle{
\begin{aligned}
K(x, y) = \exp \left(\frac{(x-y)^ 2}{\sigma^2} \right)
\end{aligned}
}
# kernel:Kernel function that receives 2 variables
def kernel_kmeans(data, k, kernel):
df = pd.DataFrame()
df['x'] = data.T[0]
df['y'] = data.T[1]
clusters = np.random.randint(0, k, len(df))
df['c'] = clusters
self_corr = {}
#In-class correlation is heavy when calculated many times, so cache it
def compute_self_corr():
for j in range(0, k):
sub_df = df[df.c == j].drop(columns=['c']).T
ret = sub_df.apply(lambda col1: sub_df.apply(lambda col2: kernel(col1, col2), axis=0).mean(), axis=0).mean()
self_corr[j] = ret
#Calculate the distance to the cluster center in the feature space
def compute_distance_from_center_j(x, j):
sub_df = df[df.c == j].drop(columns=['c']).T
ret = kernel(x, x)
ret += sub_df.apply(lambda col: -2 * kernel(x, col), axis=0).mean()
ret += self_corr[j]
return ret
while True:
compute_self_corr()
before_clusters = list(df.c.copy())
tmp = df.drop(columns=['c']).T
new_clusters = [min(range(0, k), key=lambda compute_distance_from_center_j(tmp[col], j)) for col in tmp.columns]
if before_clusters == new_clusters:
break
else:
df['c'] = new_clusters
return df.c
The result of turning kernel k-means is as follows. The kernel parameter $ \ sigma $ is $ 0.5 $.
Consider the following situation.
Under the above situation, new data $ D = (x, y) = (R, 0) $ is added and classified. In this case, if the distance to enter the class $ c $ is written as $ C (c) $, the distance to the classes $ c_r $ and $ c_R $ is as follows.
\displaystyle{
\begin{aligned}
C(c_r) &= 1 -2 \exp \left(-\frac{r^ 2 + R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2rR}{\sigma^ 2}\right) + \exp \left(-\frac{2r^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2r^ 2}{\sigma^ 2}\right), \\
C(c_R) &= 1 - \exp \left(-\frac{2R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2R^ 2}{\sigma^ 2}\right).
\end{aligned}
}
Where $ I_ \ alpha (x) $ is a modified Bessel function. About this, when it is $ r \ ll \ sigma \ ll R $
\displaystyle{
\begin{aligned}
C(c_r) &= 2, \\
C(c_R) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}
Obviously $ C (c_R) \ lt C (c_r) $, so it is classified as the correct cluster. Also, for $ \ sigma \ ll r, R $
\displaystyle{
\begin{aligned}
C(c_r) &= 1 - 2 \sqrt \frac{\sigma^ 2}{2\pi r R} \exp \left(-\frac{(r - R)^ 2}{\sigma^ 2}\right) + \sqrt \frac{\sigma^ 2}{2\pi r^ 2}, \\
C(c_R) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}
If you look at the area where the two are equal,
\displaystyle{
\begin{aligned}
C(c_r) &= 0, \\
C(c_R) &= 0.
\end{aligned}
}
Obviously it cannot be classified correctly. Also, considering the reading order of $ r / \ sigma and R / \ sigma $, it can be seen that they are not classified correctly because they have the same expression as normal k-means.
Similarly, consider the case where the data $ d = (x, y) = (r, 0) $ is added. The distance to the center of the class
\displaystyle{
\begin{aligned}
C(c_r) &= 1 - \exp \left(-\frac{2r^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2r^ 2}{\sigma^ 2}\right), \\
C(c_R) &= 1 -2 \exp \left(-\frac{r^ 2 + R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2rR}{\sigma^ 2}\right) + \exp \left(-\frac{2R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2R^ 2}{\sigma^ 2}\right).
\end{aligned}
}
About this, when it is $ r \ ll \ sigma \ ll R $
\displaystyle{
\begin{aligned}
C(c_r) &= 0, \\
C(c_R) &= 1 + \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}
Obviously $ C (c_r) \ lt C (c_R) $, so it is classified as the correct cluster. Also, for $ \ sigma \ ll r, R $
\displaystyle{
\begin{aligned}
C(c_r) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi r^ 2}, \\
C(c_R) &= 1 - 2 \sqrt \frac{\sigma^ 2}{2\pi r R} \exp \left(-\frac{(r - R)^ 2}{\sigma^ 2}\right) + \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}
In this case as well, if you examine the area where the two are equal, there is only $ r = R $, and otherwise it is $ C (c_r) \ lt C (c_R) $, which is an incorrect classification. Also, when it is $ r, R \ ll \ sigma $
\displaystyle{
\begin{aligned}
C(c_r) &= 0, \\
C(c_R) &= 0.
\end{aligned}
}
Obviously it cannot be classified correctly. Similarly, considering the reading order of $ r / \ sigma and R / \ sigma $, the formula is the same as that of normal k-means.
As you can see from these results, the classification works well when the resolution of the kernel function, $ \ sigma $, is between the scales of the two clusters. Perhaps the RBF kernel cannot look in the direction, only the size, especially larger or smaller than the given scale $ \ sigma $. Therefore, if the kernel scale is not between the scales of each class, it will not work as expected.
Let's do a concrete numerical experiment. Let's change $ \ sigma $ under the same parameters $ r = 0.1, R = 1 $ as before. Let's see how it behaves especially in the areas of $ \ sigma \ ll r, R $, $ r \ ll \ sigma R $, $ r, R \ ll \ sigma $.
kernel k-means is not easy to use because it depends heavily on the kernel and its parameters. However, ordinary k-means works only for linearly separable ones, and is not easy to use. .. ..
--Hajipata --Castella book
Recommended Posts