One of the cluster analysis methods, k-means, is executed with scikit-learn or implemented without scikit-learn.

K-means is a typical method for creating a cluster. I've written about this in the past, but this time I'd like to show you how to use scikit-learn and how to not use it for your study.

K-means clustering (past article)
- https://qiita.com/maskot1977/items/34158d044711231c4292

Data acquisition

Iris data, which is often used as study data for machine learning, is used.

import urllib.request

url = "https://raw.githubusercontent.com/maskot1977/ipython_notebook/master/toydata/iris.txt"

filename = url.split("/")[-1]
urllib.request.urlretrieve(url, filename)

('iris.txt', <http.client.HTTPMessage at 0x7fac1779a470>)

import pandas as pd

df = pd.read_csv(filename, delimiter="\t", index_col=0)
df

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	0
2	4.9	3.0	1.4	0.2	0
3	4.7	3.2	1.3	0.2	0
4	4.6	3.1	1.5	0.2	0
5	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
146	6.7	3.0	5.2	2.3	2
147	6.3	2.5	5.0	1.9	2
148	6.5	3.0	5.2	2.0	2
149	6.2	3.4	5.4	2.3	2
150	5.9	3.0	5.1	1.8	2

150 rows × 5 columns

Let's use only the data in the left two columns.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], s=50)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()

k-means with scikit-learn

In scikit-learn, which is famous as a library for machine learning, here is the module that performs k-means.

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In k-means, the number of clusters n_clusters must be specified in advance.

from sklearn.cluster import KMeans 

kmeans_model = KMeans(n_clusters = 3).fit(df.iloc[:, :2])

Learning is complete with just this. You can check the clustering result as follows.

kmeans_model.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1,
       2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1,
       1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1,
       1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2], dtype=int32)

Let's draw the samples belonging to the same cluster in the same color.

%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=kmeans_model.labels_, s=50)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()

k-means without scikit-lean

By the way, scikit-learn that can be used even if you do not understand the meaning is very convenient, but if you use it without understanding the meaning, you will not notice it even if you use it incorrectly, or you will interpret the result There is a risk of making a mistake. So let's take a look at the basics of what's going on. The calculation is performed according to the following procedure.

Give a label at random

The final goal is to have the label represent the cluster, but start with random.

import numpy as np

n_clusters = 3

df['labels'] = [np.random.randint(0, n_clusters) for x in range(len(df))]
df

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	labels
1	5.1	3.5	1.4	0.2	0	2
2	4.9	3.0	1.4	0.2	0	0
3	4.7	3.2	1.3	0.2	0	2
4	4.6	3.1	1.5	0.2	0	1
5	5.0	3.6	1.4	0.2	0	2
...	...	...	...	...	...	...
146	6.7	3.0	5.2	2.3	2	0
147	6.3	2.5	5.0	1.9	2	0
148	6.5	3.0	5.2	2.0	2	2
149	6.2	3.4	5.4	2.3	2	2
150	5.9	3.0	5.1	1.8	2	2

150 rows × 6 columns

Find the center of gravity

Redistribute labels

For each sample, choose the closest centroid and redistribute the labels based on the label for that centroid.

Find the center of gravity

Redistribute labels

Find the center of gravity

Redistribute labels

Find the center of gravity

Task

For the above "code that does not use scikit-learn", tackle the following issues.

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	0
2	4.9	3.0	1.4	0.2	0
3	4.7	3.2	1.3	0.2	0
4	4.6	3.1	1.5	0.2	0
5	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
146	6.7	3.0	5.2	2.3	2
147	6.3	2.5	5.0	1.9	2
148	6.5	3.0	5.2	2.0	2
149	6.2	3.4	5.4	2.3	2
150	5.9	3.0	5.1	1.8	2

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	labels
1	5.1	3.5	1.4	0.2	0	2
2	4.9	3.0	1.4	0.2	0	0
3	4.7	3.2	1.3	0.2	0	2
4	4.6	3.1	1.5	0.2	0	1
5	5.0	3.6	1.4	0.2	0	2
...	...	...	...	...	...	...
146	6.7	3.0	5.2	2.3	2	0
147	6.3	2.5	5.0	1.9	2	0
148	6.5	3.0	5.2	2.0	2	2
149	6.2	3.4	5.4	2.3	2	2
150	5.9	3.0	5.1	1.8	2	2

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	0
2	4.9	3.0	1.4	0.2	0
3	4.7	3.2	1.3	0.2	0
4	4.6	3.1	1.5	0.2	0
5	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
146	6.7	3.0	5.2	2.3	2
147	6.3	2.5	5.0	1.9	2
148	6.5	3.0	5.2	2.0	2
149	6.2	3.4	5.4	2.3	2
150	5.9	3.0	5.1	1.8	2

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	labels
1	5.1	3.5	1.4	0.2	0	2
2	4.9	3.0	1.4	0.2	0	0
3	4.7	3.2	1.3	0.2	0	2
4	4.6	3.1	1.5	0.2	0	1
5	5.0	3.6	1.4	0.2	0	2
...	...	...	...	...	...	...
146	6.7	3.0	5.2	2.3	2	0
147	6.3	2.5	5.0	1.9	2	0
148	6.5	3.0	5.2	2.0	2	2
149	6.2	3.4	5.4	2.3	2	2
150	5.9	3.0	5.1	1.8	2	2

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	0
2	4.9	3.0	1.4	0.2	0
3	4.7	3.2	1.3	0.2	0
4	4.6	3.1	1.5	0.2	0
5	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
146	6.7	3.0	5.2	2.3	2
147	6.3	2.5	5.0	1.9	2
148	6.5	3.0	5.2	2.0	2
149	6.2	3.4	5.4	2.3	2
150	5.9	3.0	5.1	1.8	2

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	labels
1	5.1	3.5	1.4	0.2	0	2
2	4.9	3.0	1.4	0.2	0	0
3	4.7	3.2	1.3	0.2	0	2
4	4.6	3.1	1.5	0.2	0	1
5	5.0	3.6	1.4	0.2	0	2
...	...	...	...	...	...	...
146	6.7	3.0	5.2	2.3	2	0
147	6.3	2.5	5.0	1.9	2	0
148	6.5	3.0	5.2	2.0	2	2
149	6.2	3.4	5.4	2.3	2	2
150	5.9	3.0	5.1	1.8	2	2