I started writing my study content on another blog using the waiting period at home, but since the tex function there was shit, I will move some articles to qiita. .. ..

I learned k-means, the most basic [citation needed] algorithm for unsupervised learning, and tried it easily.

k-means One of the methods to separate given data into multiple clusters.

Advantages, etc.

advantage

The algorithm is simple and easy to understand
Interpretation of results is not difficult

Since it is classified by the distance from the center of the cluster, it is almost always the intuitive result when plotting.

Disadvantage

The number of clusters must be specified in advance (it is not something that recognizes the number of clusters in advance)
Not suitable for things that cannot be linearly separated

algorithm

given

Dataset $ x_i $
Number of clusters $ K $

Method of calculation

Randomly assign cluster $ k_i $ to all data $ x_i $ $ (x_i, k_i) $ 1.Central for each cluster(m_k)Seekingm_k |N_k| = \sum_{i \in N_k} x_i
Calculate the distance to all cluster centers for all data and update to the shortest cluster
Repeat 2-3 until there are no more cluster updates

Where $ N_k $ is the set of data contained in the cluster $ k $ at that time.

Example of use

Experiment with the following parameters in sklearn's datasets.make_circles.

n_samples = 100
factor = 0.1
noise = 0.01

With the above settings, two clusters, a circle with a radius of 0.1 and a circle with a radius of 1, are obtained under a noise of 0.01.

code

# k:Number of clusters
def kmeans(data, k):
  df = pd.DataFrame()
  df['x'] = data.T[0]
  df['y'] = data.T[1]
  clusters = np.random.randint(0, k, len(df))
  df['c'] = clusters

  while True:
    before_clusters = list(df.c.copy())
    centers = df.groupby('c').mean().reset_index().drop(columns=['c']).T

    tmp = df.drop(columns=['c']).T
    new_clusters = [min(centers.columns, key=lambda k:((tmp[col] - centers[k]) ** 2).sum()) for col in tmp.columns]
    if before_clusters == new_clusters:
      break
    else:
      df['c'] = new_clusters

  return df.c

Experimental result

The result of turning k-means is as follows. Since the data are clearly not linearly separable, it can be seen that the classification results are not correctly classified as expected.

kernel k-means

Advantages, etc.

advantage

Non-linear separation is also possible

Disadvantage

The classification result depends greatly on how to select the kernel function and kernel parameters.

algorithm

given

Dataset $ x_i $
Number of clusters $ K $
Kernel function (and required parameters for that kernel function)

Almost the same as normal k-means. When finding the center of each cluster, the center of the feature space may be found. That is, considering the cluster $ k $, we define the vector $ m_k $, which minimizes the following quantity, as the center of the cluster $ k $.

\displaystyle{
\begin{aligned}
\sum_{i \in N_k} (\phi(x_i) - m_k)^ 2.
\end{aligned}
}

$ M_k] to minimize this can be obtained as follows.

\displaystyle{
\begin{aligned}
m_k = \frac{1}{|N_k|} \sum_{i \in N_k} \phi(x_i).
\end{aligned}
}

Using this, the distance between the data $ x_i $ and the cluster $ k $ is as follows.

\displaystyle{
\begin{aligned}
(\phi(x_i) - m_k)^ 2 &= K(x_i, x_i) -\frac{2}{|N_k|} \sum_{j \in N_k} K(x_i, x_j) + \frac{1}{|N_k|^ 2} \sum_{m, n \in N_k} K(x_m, x_n)
\end{aligned}
}

Where $ K (x, y) $ is a kernel function.

Example of use

Experiment with the following parameters in sklearn's datasets.make_circles.

n_samples = 100
factor = 0.1
noise = 0.01

The kernel function uses the following RBF kernel.

\displaystyle{
\begin{aligned}
K(x, y) = \exp \left(\frac{(x-y)^ 2}{\sigma^2} \right)
\end{aligned}
}

code

# kernel:Kernel function that receives 2 variables
def kernel_kmeans(data, k, kernel):
  df = pd.DataFrame()
  df['x'] = data.T[0]
  df['y'] = data.T[1]
  clusters = np.random.randint(0, k, len(df))
  df['c'] = clusters

  self_corr = {}
  #In-class correlation is heavy when calculated many times, so cache it
  def compute_self_corr():
    for j in range(0, k):
      sub_df = df[df.c == j].drop(columns=['c']).T
      ret = sub_df.apply(lambda col1: sub_df.apply(lambda col2:  kernel(col1, col2), axis=0).mean(), axis=0).mean()
      self_corr[j] = ret

  #Calculate the distance to the cluster center in the feature space
  def compute_distance_from_center_j(x, j):
    sub_df = df[df.c == j].drop(columns=['c']).T
    ret = kernel(x, x)
    ret += sub_df.apply(lambda col: -2 * kernel(x, col), axis=0).mean()
    ret += self_corr[j]
    return ret

  while True:
    compute_self_corr()
    before_clusters = list(df.c.copy())
    tmp = df.drop(columns=['c']).T
    new_clusters = [min(range(0, k), key=lambda compute_distance_from_center_j(tmp[col], j)) for col in tmp.columns]
    if before_clusters == new_clusters:
      break
    else:
      df['c'] = new_clusters
  return df.c

Experimental result

The result of turning kernel k-means is as follows. The kernel parameter $ \ sigma $ is $ 0.5 $. kernel-k-means-sig0.5.png

If kernel k-means works (does not work)

Consider the following situation.

Number of clusters 2
Cluster $ c_r $ uniformly distributed on the circumference of radius $ r $ and cluster $ c_R $ uniformly distributed on the circumference of radius $ R $
r \lt R
The number of data is large enough, and the sum can be replaced with sufficient accuracy by integration.
The data is correctly classified. That is, the data on the radius $ r $ is classified into cluster $ c_r $ and the other is classified into cluster $ c_R $.
Kernel functions use RBF kernel

Under the above situation, new data $ D = (x, y) = (R, 0) $ is added and classified. In this case, if the distance to enter the class $ c $ is written as $ C (c) $, the distance to the classes $ c_r $ and $ c_R $ is as follows.

\displaystyle{
\begin{aligned}
C(c_r) &= 1 -2 \exp \left(-\frac{r^ 2 + R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2rR}{\sigma^ 2}\right) + \exp \left(-\frac{2r^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2r^ 2}{\sigma^ 2}\right), \\
C(c_R) &= 1 - \exp \left(-\frac{2R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2R^ 2}{\sigma^ 2}\right).
\end{aligned}
}

Where $ I_ \ alpha (x) $ is a modified Bessel function. About this, when it is $ r \ ll \ sigma \ ll R $

\displaystyle{
\begin{aligned}
C(c_r) &= 2, \\
C(c_R) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}

Obviously $ C (c_R) \ lt C (c_r) $, so it is classified as the correct cluster. Also, for $ \ sigma \ ll r, R $

\displaystyle{
\begin{aligned}
C(c_r) &= 1 - 2 \sqrt \frac{\sigma^ 2}{2\pi r R} \exp \left(-\frac{(r - R)^ 2}{\sigma^ 2}\right) + \sqrt \frac{\sigma^ 2}{2\pi r^ 2}, \\
C(c_R) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}

If you look at the area where the two are equal,r=RThere is no choice but otherwiseC(c_r) \lt C(c_R)It becomes an incorrect classification. The important thing here is\sigmaWhen\delta r := |R - r|It doesn't matter how big or small the scale is. Alsor, R \ll \sigmaWhen

\displaystyle{
\begin{aligned}
C(c_r) &= 0, \\
C(c_R) &= 0.
\end{aligned}
}

Obviously it cannot be classified correctly. Also, considering the reading order of $ r / \ sigma and R / \ sigma $, it can be seen that they are not classified correctly because they have the same expression as normal k-means.

Similarly, consider the case where the data $ d = (x, y) = (r, 0) $ is added. The distance to the center of the class

\displaystyle{
\begin{aligned}
C(c_r) &= 1 - \exp \left(-\frac{2r^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2r^ 2}{\sigma^ 2}\right), \\
C(c_R) &= 1 -2 \exp \left(-\frac{r^ 2 + R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2rR}{\sigma^ 2}\right) + \exp \left(-\frac{2R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2R^ 2}{\sigma^ 2}\right).
\end{aligned}
}

About this, when it is $ r \ ll \ sigma \ ll R $

\displaystyle{
\begin{aligned}
C(c_r) &= 0, \\
C(c_R) &= 1 + \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}

Obviously $ C (c_r) \ lt C (c_R) $, so it is classified as the correct cluster. Also, for $ \ sigma \ ll r, R $

\displaystyle{
\begin{aligned}
C(c_r) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi r^ 2}, \\
C(c_R) &= 1 - 2 \sqrt \frac{\sigma^ 2}{2\pi r R} \exp \left(-\frac{(r - R)^ 2}{\sigma^ 2}\right) + \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}

In this case as well, if you examine the area where the two are equal, there is only $ r = R $, and otherwise it is $ C (c_r) \ lt C (c_R) $, which is an incorrect classification. Also, when it is $ r, R \ ll \ sigma $

\displaystyle{
\begin{aligned}
C(c_r) &= 0, \\
C(c_R) &= 0.
\end{aligned}
}

Obviously it cannot be classified correctly. Similarly, considering the reading order of $ r / \ sigma and R / \ sigma $, the formula is the same as that of normal k-means.

As you can see from these results, the classification works well when the resolution of the kernel function, $ \ sigma $, is between the scales of the two clusters. Perhaps the RBF kernel cannot look in the direction, only the size, especially larger or smaller than the given scale $ \ sigma $. Therefore, if the kernel scale is not between the scales of each class, it will not work as expected.

Let's do a concrete numerical experiment. Let's change $ \ sigma $ under the same parameters $ r = 0.1, R = 1 $ as before. Let's see how it behaves especially in the areas of $ \ sigma \ ll r, R $, $ r \ ll \ sigma R $, $ r, R \ ll \ sigma $.

For $ \ sigma \ ll r, R $

\sigma = 0.01, r=0.1, R=1 kernel-k-means-sig0.01.png

For $ r \ ll \ sigma R $

\sigma = 0.5, r=0.1, R=1 kernel-k-means-sig0.5.png

For $ r, R \ ll \ sigma $

\sigma = 2, r=0.1, R=1 kernel-k-means-sig2.0.png

Impressions

kernel k-means is not easy to use because it depends heavily on the kernel and its parameters. However, ordinary k-means works only for linearly separable ones, and is not easy to use. .. ..

reference

--Hajipata --Castella book

https://research.miidas.jp/2019/07/kernel-kmeans%E3%81%AEnumpy%E5%AE%9F%E8%A3%85/

k-means and kernel k-means

Advantages, etc.

algorithm

Example of use

code

Experimental result

Advantages, etc.

algorithm

Example of use

code

Experimental result

If kernel k-means works (does not work)

Impressions

reference