I implemented the K-means method (clustering method)

What is clustering?

Clustering is the classification of a set by some rule. In machine learning, clustering is categorized as "unsupervised learning."

There are several ways to calculate clustering, but they are grouped based on the similarity between the samples. The calculation methods for clustering can be broadly divided into "hierarchical clustering" and "non-hierarchical clustering". The K-means method implemented this time is classified as "non-hierarchical clustering".

What is K-means method?

It is a classification method that uses the average of clusters to determine the number of clusters. The outline of the algorithm of the K-means method is as follows.

  1. Determine k initial values ​​for the center of the cluster
  2. Find the center distance between all samples and k clusters and classify them into the closest clusters.
  3. Find the center of the k clusters formed
  4. Repeat steps 2 and 3 until the center does not change

スクリーンショット 2021-01-06 12.58.05.png

Implementation of K-means method

The python code is below.

#Installation of required libraries
import numpy as np
import pandas as pd

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
%matplotlib inline
sns.set_style('whitegrid')

#Class for normalization
from sklearn.preprocessing import StandardScaler

# k-Import what you need for the means method
from sklearn.cluster import KMeans

First import the required libraries. This time I will try to implement it using iris data.

#iris data
from sklearn.datasets import load_iris

#Data read
iris = load_iris()
iris.keys()

#Store in data frame
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['target'] = iris.target #Types of irises (correct label)
df_iris.head()

#Scatter plot of 2 variables (color coded by correct label)
plt.scatter(df_iris['petal length (cm)'], df_iris['petal width (cm)'], c=df_iris.target, cmap=mpl.cm.jet)
plt.xlabel('petal_length')
plt.ylabel('petal_width')

2変数の散布図.png

I tried to visualize it with two variables, "petal_length" and "petal_width". Next, I would like to visualize it with a scatter plot matrix.

#Scatterplot matrix (color coded by correct label)
sns.pairplot(df_iris, hue='target', height=1.5)

散布図行列.png

Next, I would like to determine the number of clusters using the elbow method. It is clear that iris data should be divided into three, but when actually using clustering, you have to decide the number of clusters yourself because of unsupervised learning. Therefore, there is an elbow method as one of the methods for determining the number of clusters.

# Elbow Method
wcss = []

for i in range(1, 10):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 30, random_state = 0)
    kmeans.fit(df_iris.iloc[:, 2:4])
    wcss.append(kmeans.inertia_)


plt.plot(range(1, 10), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') 
plt.show()

エルボー法.png

Looking at the results of the elbow method, you can see that there is no point in increasing the number of clusters by 3 or more.

I would like to start modeling from here.

#modeling
clf = KMeans(n_clusters=3, random_state=1)
clf.fit(df_iris.iloc[:, 2:4])

#Training data cluster number
clf.labels_

#Assign a cluster number to unknown data
#This time we are predicting for the training data, so`clf.labels_`Same result as
y_pred = clf.predict(df_iris.iloc[:, 2:4])
y_pred

#Compare the actual type with the result of clustering
fig, (ax1, ax2) = plt.subplots(figsize=(16, 4), ncols=2)

#Actual type distribution
ax1.scatter(df_iris['petal length (cm)'], df_iris['petal width (cm)'], c=df_iris.target, cmap=mpl.cm.jet)
ax1.set_xlabel('petal_length')
ax1.set_ylabel('petal_width')
ax1.set_title('Actual')
#Distribution of clusters classified by cluster analysis
ax2.scatter(df_iris['petal length (cm)'], df_iris['petal width (cm)'], c=y_pred, cmap=mpl.cm.jet)
ax2.set_xlabel('petal_length')
ax2.set_ylabel('petal_width')
ax2.set_title('Predict')

result.png

at the end

Thank you for reading to the end. This time, I implemented the K-means method.

If you have a request for correction, we would appreciate it if you could contact us.

Recommended Posts

I implemented the K-means method (clustering method)
I tried clustering ECG data using the K-Shape method
Understand k-means method
Clustering of clustering method
I implemented CycleGAN (1)
I implemented ResNet!
I tried the least squares method in Python
I implemented the inverse gamma function in python
I read and implemented the Variants of UKR
I implemented Human In The Loop ― Part ① Dashboard ―
[Deep Learning from scratch] I implemented the Affine layer
[Roughly] Clustering by KMeans
Qiskit: I implemented VQE
Clustering and principal component analysis by K-means method (beginner)
I implemented Python Logging
I tried the simplest method of multi-label document classification
I touched Wagtail (1) and let's override the save method.
I implemented the FloodFill algorithm with TRON BATTLE of CodinGame.
I implemented N-Queen in various languages and measured the speed
I examined the device tree
Try using scikit-learn (1) --K-means clustering
I implemented a method to calculate the evaluation index (specificity, NPV) that scikit-learn does not have
Classify data by k-means method
I tried clustering with PyCaret
I implemented VQE with Blueqat
I touched the Qiita API
Reuse the results of clustering
I tried the changefinder library!
I implemented Extreme learning machine
I downloaded the python source
I read the SHAP paper
I want to get the name of the function / method being executed
I got an AttributeError when mocking the open method in python
I passed the python engineer certification exam, so I released the study method
I investigated the X-means method that automatically estimates the number of clusters