This article is the entry for the 6th day of Freee Data People Advent Calendar 2019. I started writing in the middle of the night of the previous day and wrote while saying hehehe.

Introduction

Do you know a library called PyClustering? PyClustering is a clustering-specific library available from Python and C ++. An algorithm called G-means has been newly implemented in such PyClustering v0.9.2. I saw the name G-means for the first time + I couldn't find an article in Japanese, so I looked it up and summarized it. Since the algorithm itself is simple, it may be easiest to read the Paper directly. not.

G-means

G-means is an extension of K-means and is an algorithm that automatically determines the number of clusters, which was a parameter of K-means. There is a similar method called X-means, but there is "I examined the X-means method that automatically estimates the number of clusters" It is introduced in an easy-to-understand manner with the code. (By the way, in Pyclustering, you can also use X-means)

algorithm

The algorithm is as follows. It is the same as X-means to start with a small number of clusters and subdivide it only with different stop conditions.

Determine some center points as appropriate.
Clustering by K-means is performed using the center point determined in 1. as the initial value.
Perform a statistical test (Anderson–Darling test) to see if the sample in each cluster created in 2. follows the Gaussian distribution. [^ 1]
If the null hypothesis is rejected, split the cluster in two. If the null hypothesis is not rejected, determine the cluster. [^ 2]
Repeat steps 2 and 4 until the cluster is no longer split.

[^ 1]: G-means is like Gaussian's G. [^ 2]: The paper uses $ \ alpha = 0.0001 $, but Pyclustering Then it seems that it is done with $ \ alpha = 0.01 $. Since the paper says that it is better to reduce the probability of Type I Error (false positive), is this implementation okay? I will think.

That's all for the algorithm, but in the paper

――How to determine a new center point when the cluster is divided in 4. --Anderson-Ingenuity to make the Darling test easier

Is also mentioned.

How to determine a new center point initial value when dividing a cluster

--When the center point in the cluster is $ \ mathbf {c} $, the two points $ \ mathbf {c} \ pm \ mathbf {m} $ are set as the new center point initial values. --Two types of $ \ mathbf {m} $ have been proposed. ――It will be decided at random. --Principal component analysis is performed on the sample in the cluster, and when the first principal component is $ \ mathbf {s} $ and the eigenvalue of the principal component (variance of the principal component) is $ \ lambda $, $ \ mathbf {m} = \ mathbf {s} \ sqrt {2 \ lambda / \ pi} $. [^ 3]

[^ 3]: Is it an image if you divide the axis in the direction in which the samples are scattered?

In the paper, it was said that the method using the second principal component was adopted, but in Pyclustering [it seems to adopt the first method](https://github.com/annoviko/pyclustering/ blob / 3457a2457c9aa385f625d628cec19dd6fada8326 / pyclustering / cluster / gmeans.py # L331).

Ingenuity to make the Anderson-Darling test easier

If the sample to be clustered is left as it is, it is difficult to perform statistical testing in high dimensions, so it is projected in one dimension. Specifically, the sample $ \ mathbf {x} $ is projected onto the one-dimensional $ x ^ {\ prime} $ as follows.

x^{\prime} = \frac{\langle \mathbf{x}, \mathbf{v} \rangle}{\| \mathbf{v} \|^2}

However, $ \ mathbf {v} $ is $ \ mathbf {c} \ _1 $, $ \ mathbf {c} , respectively, for the two points determined in the above "How to determine the initial value of the new center point when dividing the cluster". It is $ \ mathbf {c} \ _1 - \ mathbf {c} \ _2 $ when it is _2 $. Does that mean projecting onto the first principal component (the axis in the direction of the spread) and testing whether it follows a normal distribution?

Practice

Now that we know the algorithm, I would like to actually perform clustering using Pyclustering. For the time being [here](https://qiita.com/deaikei/items/8615362d320c76e2ce0b#%E3%81%95%E3%82%89%E3%81%AB%E3%83%A2%E3%83%A4% E3% 83% 83% E3% 81% A8% E3% 81% 97% E3% 81% 9F% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% 92% E3% 82% AF% E3% 83% A9% E3% 82% B9% E3% 82% BF% E3% 83% AA% E3% 83% B3% E3% 82% B0% E3% 81% 95% E3% 81% I tried to target the same data as 9B% E3% 81% A6% E3% 81% BF% E3% 82% 8B).

Dummy data generation

from sklearn.datasets import make_blobs

X, Y = make_blobs(n_samples=500,
                  n_features=2,
                  centers=8,
                  cluster_std=1.5,
                  center_box=(-10.0, 10.0),
                  shuffle=True,
                  random_state=1)

Correct label

Correct label plot

import seaborn as sns
sns.set(style="whitegrid")
sns.scatterplot(
    X[:, 0], X[:, 1], hue=Y, legend="full"
);

Since centers = 8, naturally 8 colors will come out.

正解ラベル

G-means

Well, the main subject is from here. What does it look like when you try using G-means?

gmeans result plot

from pyclustering.cluster import gmeans
import numpy as np
import itertools

gmeans_instance = gmeans.gmeans(X).process()

clusters = gmeans_instance.get_clusters()
centers = gmeans_instance.get_centers()

labels_size = len(
    list(itertools.chain.from_iterable(clusters))
)
labels = np.zeros((1, labels_size))
for n, n_th_cluster in np.ndenumerate(clusters):
    for img_num in n_th_cluster:
        labels[0][img_num] = n[0]
labels = labels.ravel()

sns.scatterplot(
    X[:, 0], X[:, 1], hue=labels, legend="full"
)

Oh. The number of clusters is not eight, but they are clustered in a comfortable way. Isn't this pretty cool?

G-means

If you want to cluster in a situation where you have no idea how many clusters you have, using G-means seems like a good option.

The X-means of the rival horse is ...

Finally, I would like to compare the results by using Pyclustering's X-means as a rival horse.

xmeans result plot

from pyclustering.cluster import xmeans
import numpy as np
import itertools

xmeans_instance = xmeans.xmeans(X).process()

clusters = xmeans_instance.get_clusters()
centers = xmeans_instance.get_centers()

labels_size = len(
    list(itertools.chain.from_iterable(clusters))
)
labels = np.zeros((1, labels_size))
for n, n_th_cluster in np.ndenumerate(clusters):
    for img_num in n_th_cluster:
        labels[0][img_num] = n[0]
labels = labels.ravel()

sns.scatterplot(
    X[:, 0], X[:, 1], hue=labels, legend="full"
)

Oh? [Referenced X-means article](https://qiita.com/deaikei/items/8615362d320c76e2ce0b#%E3%81%95%E3%82%89%E3%81%AB%E3%83%A2% E3% 83% A4% E3% 83% 83% E3% 81% A8% E3% 81% 97% E3% 81% 9F% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% 92% E3% 82% AF% E3% 83% A9% E3% 82% B9% E3% 82% BF% E3% 83% AA% E3% 83% B3% E3% 82% B0% E3% 81% It's completely different from 95% E3% 81% 9B% E3% 81% A6% E3% 81% BF% E3% 82% 8B).

X-means

The G-means paper also contains experimental results that X-means overfits and overestimates the number of clusters, so the results are uncomfortable ... (I haven't been able to dig deep here) As far as Pyclustering is used, it seems better to use G-means than X-means.

in conclusion

We have summarized the G-means algorithm that was recently (?) Implemented in Pyclustering and the results of actually using it. G-means gave better results than X-means when I used it lightly, so it may be better to use G-means when you want to "try clustering for the time being!".

Huh. The Advent Calendar managed to make it in time ... Do you want to go to bed ... It's early morning! !! !!

Clustering G-means that automatically determines the number of clusters

Introduction

algorithm

How to determine a new center point initial value when dividing a cluster

Ingenuity to make the Anderson-Darling test easier

Practice

Correct label

The X-means of the rival horse is ...

in conclusion