Clustering with scikit-learn (2)

Significance and purpose of clustering

Yesterday, I explained the outline of clustering and the flow of actually clustering using scikit-learn.

Let's go back to the basics and explore what clustering is in the first place.

In many machine learning algorithms, features (features) are represented by vectors. In linear algebra, the set in which the sum and scalar product are defined internally is called a vector space, and its elements are called a vector.

Roughly speaking, clustering is a method of calculating how similar features are and grouping similar ones.

Regardless of whether the original data is characters or images, when the pattern is recognized and reduced to features, grouping can be performed without giving data to be a teacher.

For example, it can be applied to various technologies such as collecting an unspecified number of questionnaire answers between similar people and extracting the skin color part of an image.

Calculation of similarity

By reading this far, you can see that the key to clustering is how to find the similarity of sets.

I'll walk you through the code along with the scikit-learn tutorial. Clustering


labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]
print ( metrics.adjusted_mutual_info_score(labels_true, labels_pred) )
# => 0.225042310598

labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
print ( metrics.adjusted_mutual_info_score(labels_true, labels_pred) )
# => -0.105263430575

labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

print ( metrics.homogeneity_score(labels_true, labels_pred) )
# => 0.666666666667

print ( metrics.completeness_score(labels_true, labels_pred) )
# => 0.420619835714

As you can see, scikit-learn can find various similarities.

Clustering

Let's try clustering with yesterday's code. Since scikit-learn has a dataset, we will use it as it is. First, prepare the data set.


from sklearn import metrics
from sklearn.metrics import pairwise_distances
from sklearn import datasets

dataset = datasets.load_iris()
X = dataset.data
y = dataset.target

#Take a peek at the contents
print (X)
print (y)

Let's cluster with yesterday's code.


import numpy as np
from sklearn.cluster import KMeans

kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_

#Try to calculate the Euclidean distance
print ( metrics.silhouette_score(X, labels, metric='euclidean') )

#Cluster using yesterday's code
clusters = make_cluster(X)

#Output the result to a file
write_cluster(clusters, 'out.txt')

#Peep into the contents of the generated clustering
print ( clusters )

Consideration

By using a powerful clustering library, it can be said that once the features of the target are extracted by pattern recognition, grouping can be easily performed and it can be applied to various fields.