One of the cluster analysis methods, k-means, is executed with scikit-learn or implemented without scikit-learn.

K-means is a typical method for creating a cluster. I've written about this in the past, but this time I'd like to show you how to use scikit-learn and how to not use it for your study.

Data acquisition

Iris data, which is often used as study data for machine learning, is used.

import urllib.request

url = "https://raw.githubusercontent.com/maskot1977/ipython_notebook/master/toydata/iris.txt"

filename = url.split("/")[-1]
urllib.request.urlretrieve(url, filename)
('iris.txt', <http.client.HTTPMessage at 0x7fac1779a470>)
import pandas as pd

df = pd.read_csv(filename, delimiter="\t", index_col=0)
df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 0
2 4.9 3.0 1.4 0.2 0
3 4.7 3.2 1.3 0.2 0
4 4.6 3.1 1.5 0.2 0
5 5.0 3.6 1.4 0.2 0
... ... ... ... ... ...
146 6.7 3.0 5.2 2.3 2
147 6.3 2.5 5.0 1.9 2
148 6.5 3.0 5.2 2.0 2
149 6.2 3.4 5.4 2.3 2
150 5.9 3.0 5.1 1.8 2

150 rows × 5 columns

Let's use only the data in the left two columns.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], s=50)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()

kmeans_4_0.png

k-means with scikit-learn

In scikit-learn, which is famous as a library for machine learning, here is the module that performs k-means.

In k-means, the number of clusters n_clusters must be specified in advance.

from sklearn.cluster import KMeans 

kmeans_model = KMeans(n_clusters = 3).fit(df.iloc[:, :2]) 

Learning is complete with just this. You can check the clustering result as follows.

kmeans_model.labels_
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1,
       2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1,
       1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1,
       1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2], dtype=int32)

Let's draw the samples belonging to the same cluster in the same color.

%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=kmeans_model.labels_, s=50)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()

kmeans_10_0.png

k-means without scikit-lean

By the way, scikit-learn that can be used even if you do not understand the meaning is very convenient, but if you use it without understanding the meaning, you will not notice it even if you use it incorrectly, or you will interpret the result There is a risk of making a mistake. So let's take a look at the basics of what's going on. The calculation is performed according to the following procedure.

Give a label at random

The final goal is to have the label represent the cluster, but start with random.

import numpy as np

n_clusters = 3

df['labels'] = [np.random.randint(0, n_clusters) for x in range(len(df))]
df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species labels
1 5.1 3.5 1.4 0.2 0 2
2 4.9 3.0 1.4 0.2 0 0
3 4.7 3.2 1.3 0.2 0 2
4 4.6 3.1 1.5 0.2 0 1
5 5.0 3.6 1.4 0.2 0 2
... ... ... ... ... ... ...
146 6.7 3.0 5.2 2.3 2 0
147 6.3 2.5 5.0 1.9 2 0
148 6.5 3.0 5.2 2.0 2 2
149 6.2 3.4 5.4 2.3 2 2
150 5.9 3.0 5.1 1.8 2 2

150 rows × 6 columns

%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()

kmeans_13_0.png

Find the center of gravity

Find the center of gravity for each labeled population.

centroids = []
for i in range(n_clusters):
    df_mean = df[df['labels'] == i].mean()
    centroids.append([df_mean[0], df_mean[1]])

centroids
[[5.7, 2.9851851851851845],
 [5.765789473684211, 3.189473684210527],
 [5.968749999999999, 3.014583333333333]]

Let's draw the center of gravity with a star.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1], 
            c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()

kmeans_17_0.png

Redistribute labels

For each sample, choose the closest centroid and redistribute the labels based on the label for that centroid.

def nearest_centroid(centroids, x, y):
    min_dist = False
    min_id = False
    for i, xy in enumerate(centroids):
        dist = (xy[0] - x)**2 + (xy[1] - y)**2
        if i == 0 or min_dist > dist:
            min_dist = dist
            min_id = i

    return min_id
df['labels'] = [
                nearest_centroid(
                    centroids, 
                    df[df.columns[0]][x + 1], 
                    df[df.columns[1]][x + 1]
                    ) for x in range(len(df))
                    ]
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1], 
            c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()

kmeans_21_0.png

Find the center of gravity

Find the centroid for each label based on the redistributed labels.

centroids = []
for i in range(n_clusters):
    df_mean = df[df['labels'] == i].mean()
    centroids.append([df_mean[0], df_mean[1]])

centroids
[[5.20204081632653, 2.808163265306123],
 [5.239393939393939, 3.6272727272727274],
 [6.598529411764702, 2.9602941176470594]]
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1], 
            c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()

kmeans_24_0.png

Redistribute labels

Redistribute labels based on the recalculated centroid.

df['labels'] = [
                nearest_centroid(
                    centroids, 
                    df[df.columns[0]][x + 1], 
                    df[df.columns[1]][x + 1]
                    ) for x in range(len(df))
                    ]
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1], 
            c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()

kmeans_27_0.png

Find the center of gravity

Recalculate the centroid based on the redistributed label.

centroids = []
for i in range(n_clusters):
    df_mean = df[df['labels'] == i].mean()
    centroids.append([df_mean[0], df_mean[1]])

centroids
[[5.219148936170212, 2.7851063829787237],
 [5.16969696969697, 3.6303030303030304],
 [6.579999999999997, 2.9700000000000006]]
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1], 
            c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()

kmeans_30_0.png

Redistribute labels

Redistribute labels based on the recalculated centroid.

df['labels'] = [
                nearest_centroid(
                    centroids, 
                    df[df.columns[0]][x + 1], 
                    df[df.columns[1]][x + 1]
                    ) for x in range(len(df))
                    ]
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1], 
            c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()

kmeans_33_0.png

Find the center of gravity

Recalculate the centroid based on the redistributed label.

centroids = []
for i in range(n_clusters):
    df_mean = df[df['labels'] == i].mean()
    centroids.append([df_mean[0], df_mean[1]])

centroids
[[5.283333333333333, 2.7357142857142853],
 [5.105263157894737, 3.573684210526316],
 [6.579999999999997, 2.9700000000000006]]
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1], 
            c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()

kmeans_36_0.png

Repeat as above until the calculation settles down.

Task

For the above "code that does not use scikit-learn", tackle the following issues.

  1. Please improve the above code so that the calculation can be repeated until the calculation is settled.
  2. Increase the number of clusters and check the behavior.
  3. Execute using not only the left 2 columns but the left 4 columns.
  4. With the above algorithm, for example, even if n_clusters = 10, the final number of clusters may not be 10. Explain why and suggest how to modify the algorithm to avoid it.

Recommended Posts

One of the cluster analysis methods, k-means, is executed with scikit-learn or implemented without scikit-learn.
The most basic clustering analysis with scikit-learn
Try cluster analysis using the K-means method
I tried cluster analysis of the weather map
Predict the second round of summer 2016 with scikit-learn
[Python] Understand the self of the class. Learn the role of self from the execution result with or without self.
Here is one of the apps with "artificial intelligence" that I was interested in.
kmeans ++ with scikit-learn
Define your own distance function with k-means of scikit-learn
Calculate the regression coefficient of simple regression analysis with python
Understand the metropolitan hasting method (one of the methods in Markov chain Monte Carlo method) with implementation