Clustering with scikit-learn (2)

Significance and purpose of clustering

Yesterday, I explained the outline of clustering and the flow of actually clustering using scikit-learn.

Clustering by scikit-learn (1)

Let's go back to the basics and explore what clustering is in the first place.

In many machine learning algorithms, features (features) are represented by vectors. In linear algebra, the set in which the sum and scalar product are defined internally is called a vector space, and its elements are called a vector.

Roughly speaking, clustering is a method of calculating how similar features are and grouping similar ones.

Regardless of whether the original data is characters or images, when the pattern is recognized and reduced to features, grouping can be performed without giving data to be a teacher.

For example, it can be applied to various technologies such as collecting an unspecified number of questionnaire answers between similar people and extracting the skin color part of an image.

Calculation of similarity

By reading this far, you can see that the key to clustering is how to find the similarity of sets.

I'll walk you through the code along with the scikit-learn tutorial. Clustering


labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]
print ( metrics.adjusted_mutual_info_score(labels_true, labels_pred) )
# => 0.225042310598

labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
print ( metrics.adjusted_mutual_info_score(labels_true, labels_pred) )
# => -0.105263430575

labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

print ( metrics.homogeneity_score(labels_true, labels_pred) )
# => 0.666666666667

print ( metrics.completeness_score(labels_true, labels_pred) )
# => 0.420619835714

As you can see, scikit-learn can find various similarities.

Clustering

Let's try clustering with yesterday's code. Since scikit-learn has a dataset, we will use it as it is. First, prepare the data set.


from sklearn import metrics
from sklearn.metrics import pairwise_distances
from sklearn import datasets

dataset = datasets.load_iris()
X = dataset.data
y = dataset.target

#Take a peek at the contents
print (X)
print (y)

Let's cluster with yesterday's code.


import numpy as np
from sklearn.cluster import KMeans

kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_

#Try to calculate the Euclidean distance
print ( metrics.silhouette_score(X, labels, metric='euclidean') )

#Cluster using yesterday's code
clusters = make_cluster(X)

#Output the result to a file
write_cluster(clusters, 'out.txt')

#Peep into the contents of the generated clustering
print ( clusters )

Consideration

By using a powerful clustering library, it can be said that once the features of the target are extracted by pattern recognition, grouping can be easily performed and it can be applied to various fields.

Recommended Posts

Clustering with scikit-learn (2)
Clustering with scikit-learn + DBSCAN
DBSCAN (clustering) with scikit-learn
Isomap with Scikit-learn
Clustering with python-louvain
DBSCAN with scikit-learn
PCA with Scikit-learn
kmeans ++ with scikit-learn
Clustering representative schools in summer 2016 with scikit-learn
Cross Validation with scikit-learn
Multi-class SVM with scikit-learn
Learn with chemoinformatics scikit-learn
Install scikit.learn with pip
Calculate tf-idf with scikit-learn
Try using scikit-learn (1) --K-means clustering
Neural network with Python (scikit-learn)
I tried clustering with PyCaret
Clustering ID-POS data with LDA
[Python] Linear regression with scikit-learn
Deep Embedded Clustering with Chainer 2.0
Robust linear regression with scikit-learn
Grid search of hyperparameters with Scikit-learn
Creating a decision tree with scikit-learn
Image segmentation with scikit-image and scikit-learn
Photo segmentation and clustering with DBSCAN
Identify outliers with RandomForestClassifier in scikit-learn
Laplacian eigenmaps with Scikit-learn (personal notes)
Non-negative Matrix Factorization (NMF) with scikit-learn
Try machine learning with scikit-learn SVM
Scikit-learn DecisionTreeClassifier with datetime type values
100 language processing knock-97 (using scikit-learn): k-means clustering
Let's tune the model hyperparameters with scikit-learn!
Revisited scikit-learn
[Scikit-learn] I played with the ROC curve
Multi-label classification by random forest with scikit-learn
[Python] Use string data with scikit-learn SVM
Implement a minimal self-made estimator with scikit-learn
Fill in missing values with Scikit-learn impute
Clustering books from Aozora Bunko with Doc2Vec
Visualize scikit-learn decision trees with Plotly's Treemap
Multivariable regression model with scikit-learn --SVR comparison verification