Clustering technique

Hierarchical clustering

Hierarchical clustering is to find the most similar combination in the data. It is a method of clustering in order, and it is characterized by having a hierarchical structure in the middle of the process.

See the figure below for a concrete example. There are 5 data points A, B, C, D and E. Collect the closest of these five data We will create one cluster.

　　　A,B,C,D,E
　　　→ (A,B), C,D,E

In this case, A and B are the closest points in the combination. Since it was judged by calculation, I made one cluster called (A, B).

Next, consider the newly created cluster as one data point and repeat this.

　　　→ (A,B), (C,D), E
　　　→ (A,B,C,D), E
　　　→ (A,B,C,D,E)

Finally, when you reach the cluster that collects all the data, you're done. It is a representation of which cluster the data points were grouped together. As shown in the right figure below

Tree diagram(Dendrogram)is.

Non-hierarchical clustering

Non-hierarchical clustering, like hierarchical clustering, seeks out similar properties from data. Creates a cluster but does not have a hierarchical structure.

Given the data, the developer decides in advance how many clusters to divide Create a cluster from the data for that few minutes.

However, the optimum number of clusters is not determined for each data. Since it does not have a hierarchical structure, the amount of calculation is small compared to hierarchical clustering. This is an effective method when the amount of data is large.

A typical method of non-hierarchical clustering is

k-means method.

k-means method

A collection of data

Clusters are generated in the data for the specified number of clusters

sklearn.make in datasets_I would like to introduce you to the blobs function.

# sklearn.datasets make_Import blobs function
from sklearn.datasets import make_blobs
#X has one plot(x,y)However, Y is the cluster number to which the plot belongs.
X,Y = make_blobs(n_samples=150,   #Total number of data points
               n_features=2,          #Specification of feature quantity (number of dimensions) default:2 
               centers=3,             #Number of clusters
               cluster_std=0.5,       #Standard deviation within the cluster
               shuffle=True,          #Shuffle sample
               random_state=0)        #Specify the state of the random number generator

With the above code, X is the data point and Y is the label of the cluster to which the data point belongs.

About k-means method

A typical example of non-hierarchical clustering is the "k-means method".

The k-means method is a technique that allows data to be divided into n clusters with equal variance. An average value μ, which is the center of gravity of the data, is assigned to each cluster.

This center of gravity is called "centroid". To divide into clusters with equal distribution

We use an index called "SSE".

SSE is the sum of squares of the difference between the data points contained in each cluster and the centroid. (It corresponds to dispersion) The k-means method chooses centroids to equalize and minimize this SSE across all clusters.

The k-means algorithm has three steps.

First, kk (arbitrary number) of data points are extracted from the data group, and the kk points are used as the initial centroid. After initializing the centroid, repeat the two steps.
Allocate all data points to the nearest centroid.
Next, calculate the centroid of the data group assigned to each kk centroid and update the centroid as a new centroid.

At the end of step 3, calculate the distance between the previous centroid and the new centroid. When the distance is reduced to some extent, the above iterative process ends. In other words, iterate until the Centroid updates and is almost stuck.

Below is an example of data clustered by the k-means method.

sklearn's KMeans library

sklearn.The cluster KMeans class
Find the cluster in the data and assign a cluster number to each data.

# sklearn.Import KMeans class for cluster
from sklearn.cluster import KMeans

km = KMeans(n_clusters=3,            #Number of clusters
            init="random",           #Randomly set the initial value of Centroid default: "k-means++"
            n_init=10,               #K with different initial values of centroids-Number of executions of means
            max_iter=300,            # k-means Maximum number of times to repeat the algorithm
            tol=1e-04,               #Relative tolerance to determine convergence
            random_state=0)          #Random number generation initialization

Y_km = km.fit_predict(X) #Pass the data where the cluster exists and find the cluster number for each sample

Find the specified number of clusters from the data with the above code The cluster number is automatically stored in Y_km for each sample. There are many other functions in the KMeans class.

#Perform clustering calculations
km.fit(X[, y])
#Performs clustering calculations, converts X to the metric space used for analysis, and returns
km.fit_transform(X[, y])

#Returns the parameters used in the calculation
km.get_params([deep])
#Returns the cluster number to which the X sample belongs
km.predict(X)
#Set parameters
km.set_params(**params)
#Convert X to the metric space used for analysis and return
km.transform(X[, y])

About SSE

One of the performance evaluation functions of clustering

SSE(In-cluster error sum of squares)there is

By using SSE, the performance of various k-means clustering can be evaluated. The SSE formula is omitted, but the SSE value indicates how far the values in the cluster are.

In sklearn, KMeans class inertia_You can get the value of SSE through the attribute.

Because the sum of how much each data deviates from the center of gravity of the cluster to which it belongs (dispersion) is SSE. The smaller the SSE value, the better the clustering model.

#Access the sum of squares of error in the cluster
print ("Distortion: %.2f"% km.inertia_)

Elbow method

There is a problem of how to decide the number of clusters to be specified in k-means clustering.

There are some techniques that can be helpful when determining the number of clusters. this is

Called the elbow method
Plot how the SSE changes as the number of clusters increases
From the result k-This is a method to determine the number of clusters of means.

As you can see by executing the code below, there is a point where the SSE value bends sharply. The number of clusters at this time can be regarded as the optimum.

Because the shape of the plot looks like the elbow is bent It is called the elbow method.

However, in reality, it is as beautiful as the result of the problem. It's hard to get an elbow diagram that makes the graph look depressed.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

#Generation of sample data
X, Y = make_blobs(n_samples=150, n_features=2, centers=3,
                  cluster_std=0.5, shuffle=True, random_state=0)

distortions = []
for i in range(1, 11):                #Number of clusters 1~Calculate 10 at once
    km = KMeans(n_clusters=i,
                init="k-means++",     # k-means++Select cluster center by law
                n_init=10,
                max_iter=300,
                random_state=0)
    km.fit(X)                         #Perform clustering
    distortions.append(km.inertia_)   # km.fit and km.inertia_Can be obtained

#Graph plot
plt.plot(range(1, 11), distortions, marker="o")
plt.xticks(np.arange(1, 11, 1))
plt.xlabel("Number of clusters")
plt.ylabel("Distortion")
plt.show()

DBSCAN

DBSCAN algorithm

In contrast to the k-means method, another non-hierarchical clustering algorithm

There is "DBSCAN".

The "DBSCAN" algorithm determines the location of dense clusters (data agglomeration). Capture separately from low density areas. It shows its true value when the cluster size and shape are biased.

The k-means method clustered so that data was collected as much as possible in the center of the cluster. Therefore, the cluster inevitably takes a shape close to a circle (spherical). It is effective when the size and shape of the cluster are not biased. If the data has a bias in the size and shape of the cluster, good clustering tends to be impossible.

"DBSCAN" defines two parameters.

min_samples and eps.

The "DBSCAN" algorithm classifies data points into the following three types:

1.Core point
Min within radius eps of some data_Data points when there are as many data as the number of samples

2.Border point
Data that is not a core point but is within radius eps from the core point

3.Noise point
Data points that do not meet either

A cluster is formed from a collection of core points.

Border points are assigned to the cluster to which the closest core point belongs.

In this way, with the "DBSCAN" algorithm By classifying all data into 3 data Biased data and non-average clusters can now be categorized You can also remove noise correctly.

"DBSCAN" can use the DBSCAN class of sklearn.cluster. As the main parameter

eps、min_Specify the distance calculation method by sample and metric.

from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.2,
            min_samples=5,
            metric="euclidean")
Y_db = db.fit_predict(X)

While the k-means method is vulnerable to data with complex shapes like this one, DBSCAN can cluster arbitrary shapes.

Python: Unsupervised Learning: Non-hierarchical clustering