This article is the entry for the 6th day of Freee Data People Advent Calendar 2019. I started writing in the middle of the night of the previous day and wrote while saying hehehe.
Do you know a library called PyClustering? PyClustering is a clustering-specific library available from Python and C ++. An algorithm called G-means has been newly implemented in such PyClustering v0.9.2. I saw the name G-means for the first time + I couldn't find an article in Japanese, so I looked it up and summarized it. Since the algorithm itself is simple, it may be easiest to read the Paper directly. not.
G-means
G-means is an extension of K-means and is an algorithm that automatically determines the number of clusters, which was a parameter of K-means. There is a similar method called X-means, but there is "I examined the X-means method that automatically estimates the number of clusters" It is introduced in an easy-to-understand manner with the code. (By the way, in Pyclustering, you can also use X-means)
The algorithm is as follows. It is the same as X-means to start with a small number of clusters and subdivide it only with different stop conditions.
[^ 1]: G-means is like Gaussian's G. [^ 2]: The paper uses $ \ alpha = 0.0001 $, but Pyclustering Then it seems that it is done with $ \ alpha = 0.01 $. Since the paper says that it is better to reduce the probability of Type I Error (false positive), is this implementation okay? I will think.
That's all for the algorithm, but in the paper
――How to determine a new center point when the cluster is divided in 4. --Anderson-Ingenuity to make the Darling test easier
Is also mentioned.
--When the center point in the cluster is $ \ mathbf {c} $, the two points $ \ mathbf {c} \ pm \ mathbf {m} $ are set as the new center point initial values. --Two types of $ \ mathbf {m} $ have been proposed. ――It will be decided at random. --Principal component analysis is performed on the sample in the cluster, and when the first principal component is $ \ mathbf {s} $ and the eigenvalue of the principal component (variance of the principal component) is $ \ lambda $, $ \ mathbf {m} = \ mathbf {s} \ sqrt {2 \ lambda / \ pi} $. [^ 3]
[^ 3]: Is it an image if you divide the axis in the direction in which the samples are scattered?
In the paper, it was said that the method using the second principal component was adopted, but in Pyclustering [it seems to adopt the first method](https://github.com/annoviko/pyclustering/ blob / 3457a2457c9aa385f625d628cec19dd6fada8326 / pyclustering / cluster / gmeans.py # L331).
If the sample to be clustered is left as it is, it is difficult to perform statistical testing in high dimensions, so it is projected in one dimension. Specifically, the sample $ \ mathbf {x} $ is projected onto the one-dimensional $ x ^ {\ prime} $ as follows.
x^{\prime} = \frac{\langle \mathbf{x}, \mathbf{v} \rangle}{\| \mathbf{v} \|^2}
However, $ \ mathbf {v} $ is $ \ mathbf {c} \ _1 $, $ \ mathbf {c} , respectively, for the two points determined in the above "How to determine the initial value of the new center point when dividing the cluster". It is $ \ mathbf {c} \ _1
Now that we know the algorithm, I would like to actually perform clustering using Pyclustering. For the time being [here](https://qiita.com/deaikei/items/8615362d320c76e2ce0b#%E3%81%95%E3%82%89%E3%81%AB%E3%83%A2%E3%83%A4% E3% 83% 83% E3% 81% A8% E3% 81% 97% E3% 81% 9F% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% 92% E3% 82% AF% E3% 83% A9% E3% 82% B9% E3% 82% BF% E3% 83% AA% E3% 83% B3% E3% 82% B0% E3% 81% 95% E3% 81% I tried to target the same data as 9B% E3% 81% A6% E3% 81% BF% E3% 82% 8B).
from sklearn.datasets import make_blobs
X, Y = make_blobs(n_samples=500,
n_features=2,
centers=8,
cluster_std=1.5,
center_box=(-10.0, 10.0),
shuffle=True,
random_state=1)
import seaborn as sns
sns.set(style="whitegrid")
sns.scatterplot(
X[:, 0], X[:, 1], hue=Y, legend="full"
);
Since centers = 8
, naturally 8 colors will come out.
G-means
Well, the main subject is from here. What does it look like when you try using G-means?
from pyclustering.cluster import gmeans
import numpy as np
import itertools
gmeans_instance = gmeans.gmeans(X).process()
clusters = gmeans_instance.get_clusters()
centers = gmeans_instance.get_centers()
labels_size = len(
list(itertools.chain.from_iterable(clusters))
)
labels = np.zeros((1, labels_size))
for n, n_th_cluster in np.ndenumerate(clusters):
for img_num in n_th_cluster:
labels[0][img_num] = n[0]
labels = labels.ravel()
sns.scatterplot(
X[:, 0], X[:, 1], hue=labels, legend="full"
)
Oh. The number of clusters is not eight, but they are clustered in a comfortable way. Isn't this pretty cool?
If you want to cluster in a situation where you have no idea how many clusters you have, using G-means seems like a good option.
Finally, I would like to compare the results by using Pyclustering's X-means as a rival horse.
from pyclustering.cluster import xmeans
import numpy as np
import itertools
xmeans_instance = xmeans.xmeans(X).process()
clusters = xmeans_instance.get_clusters()
centers = xmeans_instance.get_centers()
labels_size = len(
list(itertools.chain.from_iterable(clusters))
)
labels = np.zeros((1, labels_size))
for n, n_th_cluster in np.ndenumerate(clusters):
for img_num in n_th_cluster:
labels[0][img_num] = n[0]
labels = labels.ravel()
sns.scatterplot(
X[:, 0], X[:, 1], hue=labels, legend="full"
)
Oh? [Referenced X-means article](https://qiita.com/deaikei/items/8615362d320c76e2ce0b#%E3%81%95%E3%82%89%E3%81%AB%E3%83%A2% E3% 83% A4% E3% 83% 83% E3% 81% A8% E3% 81% 97% E3% 81% 9F% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% 92% E3% 82% AF% E3% 83% A9% E3% 82% B9% E3% 82% BF% E3% 83% AA% E3% 83% B3% E3% 82% B0% E3% 81% It's completely different from 95% E3% 81% 9B% E3% 81% A6% E3% 81% BF% E3% 82% 8B).
The G-means paper also contains experimental results that X-means overfits and overestimates the number of clusters, so the results are uncomfortable ... (I haven't been able to dig deep here) As far as Pyclustering is used, it seems better to use G-means than X-means.
We have summarized the G-means algorithm that was recently (?) Implemented in Pyclustering and the results of actually using it. G-means gave better results than X-means when I used it lightly, so it may be better to use G-means when you want to "try clustering for the time being!".
Huh. The Advent Calendar managed to make it in time ... Do you want to go to bed ... It's early morning! !! !!
Recommended Posts