Understand k-means ++

Introduction

I posted Article about k-means before. Since k-means has a problem of dependency on initial values, an algorithm called k-means ++ has been developed to overcome it. This time, I summarized what I learned about k-means ++.

reference

In understanding k-means ++, I referred to the following.

-Introduction to Machine Learning for Language Processing (Natural Language Processing Series) Daiya Takamura (Author), Manabu Okumura (Supervised) Publisher; Corona Publishing -Essence of Machine Learning Koichi Kato (Author) Publisher; SB Creative Co., Ltd. -K-means ++ method-Wikipedia

About k-means ++

Review of k-means

Overview of k-means

k-means is an algorithm that first divides the data into appropriate clusters and then adjusts the data so that it is divided well using the average of the clusters. It is called the k-means method (k-means method) because it is an algorithm that creates k clusters of arbitrary designation.

k-means algorithm

Specifically, k-means follows the following process.

  1. Randomly allocate clusters for each point $ x_ {i} $
  2. Calculate the centroid for the points assigned to each cluster
  3. For each point, calculate the distance from the center of gravity calculated above and reassign it to the cluster with the closest distance.
  4. Repeat steps 2 and 3 until the assigned cluster does not change.

Problems with k-means

** Since k-means allocates clusters randomly at the beginning, its initial value may result in clustering that is far from optimal. Also, it takes a lot of time for the result to converge depending on the initial value. ** **

k-means++

Overview of k-means ++

k-means ++ is an algorithm that aims to overcome the above initial value dependency problem. ** k-means ++ is designed based on the idea that the centers of the initial clusters should be separated from each other **, and the initial cluster allocation is stochastically allocated according to the distance between data points. I will.

k-means ++ algorithm

  1. Randomly select one point from each point $ x_ {i} $ and use it as the center of the cluster.
  2. For each point $ x_ {i} $, calculate the distance $ D (x) $ from the existing cluster center to the nearest cluster center.
  3. Randomly choose a new cluster center using the weighted probability distribution $ \ frac {D (x) ^ 2} {\ sum_ {} D (x) ^ 2} $ for each point $ x_ {i} $
  4. Repeat steps 2 and 3 until k cluster centers can be selected.

As mentioned above, we are trying to solve the initial value dependence problem by probabilistically determining the initial cluster center point based on the distance between the data points. ** **

Implement k-means ++

Implementation of k-means ++ without using a library

The following is an implementation of the k-means method without using a library. Based on the k-means code described in Essence of Machine Learning, I modified it by myself when changing to k-means ++. I am.

import numpy as np

class KMeans_pp:
    def __init__(self, n_clusters, max_iter = 1000, random_seed = 0):
        self.n_clusters = n_clusters
        self.max_iter = max_iter
        self.random_state = np.random.RandomState(random_seed)
        
    def fit(self, X):
        #Randomly determine the first cluster point
        tmp = np.random.choice(np.array(range(X.shape[0])))
        first_cluster = X[tmp]
        first_cluster = first_cluster[np.newaxis,:]
        
        #Calculate the square of the distance between the first cluster point and the other data points and divide each by its sum
        p = ((X - first_cluster)**2).sum(axis = 1) / ((X - first_cluster)**2).sum()
        
        r =  np.random.choice(np.array(range(X.shape[0])), size = 1, replace = False, p = p)
        
        first_cluster = np.r_[first_cluster ,X[r]]
        
        #When the number of clusters to be divided is 3 or more
        if self.n_clusters >= 3:
            #Repeat until you can specify the specified number of cluster points
            while first_cluster.shape[0] < self.n_clusters:
                #Calculate the square of the distance between each cluster point and each data point
                dist_f = ((X[:, :, np.newaxis] - first_cluster.T[np.newaxis, :, :])**2).sum(axis = 1)
                #Which cluster point is closest to you?
                f_argmin = dist_f.argmin(axis = 1)
                #Derivation of the square of the distance between the closest cluster point and each data point
                for i in range(dist_f.shape[1]):
                    dist_f.T[i][f_argmin != i] = 0
                    
                #Probabilistically derive new cluster points
                pp = dist_f.sum(axis = 1) / dist_f.sum()
                rr = np.random.choice(np.array(range(X.shape[0])), size = 1, replace = False, p = pp)
                #Add new cluster points as initial values
                first_cluster = np.r_[first_cluster ,X[rr]]        
        
        #Do the first labeling
        dist = (((X[:, :, np.newaxis] - first_cluster.T[np.newaxis, :, :]) ** 2).sum(axis = 1))
        self.labels_ = dist.argmin(axis = 1)
        labels_prev = np.zeros(X.shape[0])
        count = 0
        self.cluster_centers_ = np.zeros((self.n_clusters, X.shape[1]))
        
        #Ends when the cluster to which each data point belongs does not change or exceeds a certain number of iterations
        while (not (self.labels_ == labels_prev).all() and count < self.max_iter):
            #Calculate the centroid of each cluster at that time
            for i in range(self.n_clusters):
                XX = X[self.labels_ == i, :]
                self.cluster_centers_[i, :] = XX.mean(axis = 0)
            #Brute force the distance between each data point and the center of gravity of each cluster
            dist = ((X[:, :, np.newaxis] - self.cluster_centers_.T[np.newaxis, :, :]) ** 2).sum(axis = 1)
            #Remember the previous cluster label. If the previous label and the label do not change, the program ends.
            labels_prev = self.labels_
            #As a result of recalculation, assign the label of the cluster closest to the distance.
            self.labels_ = dist.argmin(axis = 1)
            count += 1
            self.count = count
            
    def predict(self, X):
        dist = ((X[:, :, np.newaxis] - self.cluster_centers_.T[np.newaxis, :, :]) ** 2).sum(axis = 1)
        labels = dist.argmin(axis = 1)
        return labels

Verification

The following is a verification of whether clustering is really possible with this algorithm.

import matplotlib.pyplot as plt

#Create a suitable dataset
np.random.seed(0)
points1 = np.random.randn(80, 2)
points2 = np.random.randn(80, 2) + np.array([4,0])
points3 = np.random.randn(80, 2) + np.array([5,8])

points = np.r_[points1, points2, points3]
np.random.shuffle(points)

#Create a model to divide into 3 clusters
model =  KMeans_pp(3)
model.fit(points)

print(model.labels_)

Then the output will look like this. You can see that the labels are brilliantly assigned to three.

[2 1 0 2 1 1 0 1 2 0 1 1 0 1 0 0 1 1 0 2 0 1 2 0 1 2 0 2 1 2 1 1 1 0 1 0 1
 2 2 1 1 1 1 2 0 1 1 1 0 2 1 0 2 1 0 1 0 2 2 2 2 2 1 0 1 0 0 1 1 1 1 1 0 1
 0 0 0 2 1 0 2 0 1 1 0 1 2 0 2 2 2 0 0 0 2 0 0 0 2 0 2 1 1 1 1 1 0 1 2 1 2
 0 1 2 2 1 2 0 2 2 2 0 0 2 0 2 1 2 2 0 1 2 1 2 2 2 1 0 2 1 1 2 0 0 0 2 1 1
 1 0 0 0 1 1 2 0 1 0 0 0 2 0 0 0 0 0 2 2 1 2 0 2 2 0 1 2 2 2 2 1 0 2 1 2 2
 0 2 0 0 0 2 0 1 2 0 0 0 1 1 1 1 2 2 0 2 2 2 0 0 1 1 2 2 0 1 1 2 1 2 2 0 2
 1 2 0 2 1 2 0 0 2 2 0 2 0 2 1 2 2 0]

Let's illustrate this with matplotlib.

markers = ["+", "*", "o", '+']
color = ['r', 'b', 'g', 'k']
for i in range(4):
    p = points[model.labels_ == i, :]
    plt.scatter(p[:, 0], p[:, 1], marker = markers[i], color = color[i])
    
plt.show()

Here is the output. You can see that clustering is completed without any problems.

ダウンロード.png

Also, try to output the number of trials until the clustering converges.

print(model.count)
4

This time, the number of trials until the clustering of k-means ++ converged was four. When counting the number of trials in the same way with ordinary k-means, the result was 3 times. Originally, k-means ++ is said to take less time to converge, so I would like to verify this by myself.

Next I would like to challenge clustering by mixed normal distribution and clustering by EM algorithm.

Recommended Posts

Understand k-means ++
Understand k-means method
Beginner Kmeans
Understand Word2Vec
Understand base64.
k-means and kernel k-means
Understand numpy's axis
kmeans ++ with scikit-learn