Clustering is the classification of a set by some rule. In machine learning, clustering is categorized as "unsupervised learning."
There are several ways to calculate clustering, but they are grouped based on the similarity between the samples. The calculation methods for clustering can be broadly divided into "hierarchical clustering" and "non-hierarchical clustering". The K-means method implemented this time is classified as "non-hierarchical clustering".
It is a classification method that uses the average of clusters to determine the number of clusters. The outline of the algorithm of the K-means method is as follows.
The python code is below.
#Installation of required libraries
import numpy as np
import pandas as pd
#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
%matplotlib inline
sns.set_style('whitegrid')
#Class for normalization
from sklearn.preprocessing import StandardScaler
# k-Import what you need for the means method
from sklearn.cluster import KMeans
First import the required libraries. This time I will try to implement it using iris data.
#iris data
from sklearn.datasets import load_iris
#Data read
iris = load_iris()
iris.keys()
#Store in data frame
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['target'] = iris.target #Types of irises (correct label)
df_iris.head()
#Scatter plot of 2 variables (color coded by correct label)
plt.scatter(df_iris['petal length (cm)'], df_iris['petal width (cm)'], c=df_iris.target, cmap=mpl.cm.jet)
plt.xlabel('petal_length')
plt.ylabel('petal_width')
I tried to visualize it with two variables, "petal_length" and "petal_width". Next, I would like to visualize it with a scatter plot matrix.
#Scatterplot matrix (color coded by correct label)
sns.pairplot(df_iris, hue='target', height=1.5)
Next, I would like to determine the number of clusters using the elbow method. It is clear that iris data should be divided into three, but when actually using clustering, you have to decide the number of clusters yourself because of unsupervised learning. Therefore, there is an elbow method as one of the methods for determining the number of clusters.
# Elbow Method
wcss = []
for i in range(1, 10):
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 30, random_state = 0)
kmeans.fit(df_iris.iloc[:, 2:4])
wcss.append(kmeans.inertia_)
plt.plot(range(1, 10), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
Looking at the results of the elbow method, you can see that there is no point in increasing the number of clusters by 3 or more.
I would like to start modeling from here.
#modeling
clf = KMeans(n_clusters=3, random_state=1)
clf.fit(df_iris.iloc[:, 2:4])
#Training data cluster number
clf.labels_
#Assign a cluster number to unknown data
#This time we are predicting for the training data, so`clf.labels_`Same result as
y_pred = clf.predict(df_iris.iloc[:, 2:4])
y_pred
#Compare the actual type with the result of clustering
fig, (ax1, ax2) = plt.subplots(figsize=(16, 4), ncols=2)
#Actual type distribution
ax1.scatter(df_iris['petal length (cm)'], df_iris['petal width (cm)'], c=df_iris.target, cmap=mpl.cm.jet)
ax1.set_xlabel('petal_length')
ax1.set_ylabel('petal_width')
ax1.set_title('Actual')
#Distribution of clusters classified by cluster analysis
ax2.scatter(df_iris['petal length (cm)'], df_iris['petal width (cm)'], c=y_pred, cmap=mpl.cm.jet)
ax2.set_xlabel('petal_length')
ax2.set_ylabel('petal_width')
ax2.set_title('Predict')
Thank you for reading to the end. This time, I implemented the K-means method.
If you have a request for correction, we would appreciate it if you could contact us.
Recommended Posts