K-means is a typical method for creating a cluster. I've written about this in the past, but this time I'd like to show you how to use scikit-learn and how to not use it for your study.
Iris data, which is often used as study data for machine learning, is used.
import urllib.request
url = "https://raw.githubusercontent.com/maskot1977/ipython_notebook/master/toydata/iris.txt"
filename = url.split("/")[-1]
urllib.request.urlretrieve(url, filename)
('iris.txt', <http.client.HTTPMessage at 0x7fac1779a470>)
import pandas as pd
df = pd.read_csv(filename, delimiter="\t", index_col=0)
df
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
2 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
3 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
4 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
5 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
... | ... | ... | ... | ... | ... |
146 | 6.7 | 3.0 | 5.2 | 2.3 | 2 |
147 | 6.3 | 2.5 | 5.0 | 1.9 | 2 |
148 | 6.5 | 3.0 | 5.2 | 2.0 | 2 |
149 | 6.2 | 3.4 | 5.4 | 2.3 | 2 |
150 | 5.9 | 3.0 | 5.1 | 1.8 | 2 |
150 rows × 5 columns
Let's use only the data in the left two columns.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], s=50)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()
In scikit-learn, which is famous as a library for machine learning, here is the module that performs k-means.
In k-means, the number of clusters n_clusters
must be specified in advance.
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters = 3).fit(df.iloc[:, :2])
Learning is complete with just this. You can check the clustering result as follows.
kmeans_model.labels_
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1,
2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1,
1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1,
1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2], dtype=int32)
Let's draw the samples belonging to the same cluster in the same color.
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=kmeans_model.labels_, s=50)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()
By the way, scikit-learn that can be used even if you do not understand the meaning is very convenient, but if you use it without understanding the meaning, you will not notice it even if you use it incorrectly, or you will interpret the result There is a risk of making a mistake. So let's take a look at the basics of what's going on. The calculation is performed according to the following procedure.
The final goal is to have the label represent the cluster, but start with random.
import numpy as np
n_clusters = 3
df['labels'] = [np.random.randint(0, n_clusters) for x in range(len(df))]
df
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | labels | |
---|---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | 0 | 2 |
2 | 4.9 | 3.0 | 1.4 | 0.2 | 0 | 0 |
3 | 4.7 | 3.2 | 1.3 | 0.2 | 0 | 2 |
4 | 4.6 | 3.1 | 1.5 | 0.2 | 0 | 1 |
5 | 5.0 | 3.6 | 1.4 | 0.2 | 0 | 2 |
... | ... | ... | ... | ... | ... | ... |
146 | 6.7 | 3.0 | 5.2 | 2.3 | 2 | 0 |
147 | 6.3 | 2.5 | 5.0 | 1.9 | 2 | 0 |
148 | 6.5 | 3.0 | 5.2 | 2.0 | 2 | 2 |
149 | 6.2 | 3.4 | 5.4 | 2.3 | 2 | 2 |
150 | 5.9 | 3.0 | 5.1 | 1.8 | 2 | 2 |
150 rows × 6 columns
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()
Find the center of gravity for each labeled population.
centroids = []
for i in range(n_clusters):
df_mean = df[df['labels'] == i].mean()
centroids.append([df_mean[0], df_mean[1]])
centroids
[[5.7, 2.9851851851851845],
[5.765789473684211, 3.189473684210527],
[5.968749999999999, 3.014583333333333]]
Let's draw the center of gravity with a star.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1],
c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()
For each sample, choose the closest centroid and redistribute the labels based on the label for that centroid.
def nearest_centroid(centroids, x, y):
min_dist = False
min_id = False
for i, xy in enumerate(centroids):
dist = (xy[0] - x)**2 + (xy[1] - y)**2
if i == 0 or min_dist > dist:
min_dist = dist
min_id = i
return min_id
df['labels'] = [
nearest_centroid(
centroids,
df[df.columns[0]][x + 1],
df[df.columns[1]][x + 1]
) for x in range(len(df))
]
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1],
c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()
Find the centroid for each label based on the redistributed labels.
centroids = []
for i in range(n_clusters):
df_mean = df[df['labels'] == i].mean()
centroids.append([df_mean[0], df_mean[1]])
centroids
[[5.20204081632653, 2.808163265306123],
[5.239393939393939, 3.6272727272727274],
[6.598529411764702, 2.9602941176470594]]
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1],
c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()
Redistribute labels based on the recalculated centroid.
df['labels'] = [
nearest_centroid(
centroids,
df[df.columns[0]][x + 1],
df[df.columns[1]][x + 1]
) for x in range(len(df))
]
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1],
c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()
Recalculate the centroid based on the redistributed label.
centroids = []
for i in range(n_clusters):
df_mean = df[df['labels'] == i].mean()
centroids.append([df_mean[0], df_mean[1]])
centroids
[[5.219148936170212, 2.7851063829787237],
[5.16969696969697, 3.6303030303030304],
[6.579999999999997, 2.9700000000000006]]
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1],
c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()
Redistribute labels based on the recalculated centroid.
df['labels'] = [
nearest_centroid(
centroids,
df[df.columns[0]][x + 1],
df[df.columns[1]][x + 1]
) for x in range(len(df))
]
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1],
c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()
Recalculate the centroid based on the redistributed label.
centroids = []
for i in range(n_clusters):
df_mean = df[df['labels'] == i].mean()
centroids.append([df_mean[0], df_mean[1]])
centroids
[[5.283333333333333, 2.7357142857142853],
[5.105263157894737, 3.573684210526316],
[6.579999999999997, 2.9700000000000006]]
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(6,6))
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=df['labels'], s=50)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1],
c=range(n_clusters), marker='*', s=500)
plt.xlabel(df.columns[0])
plt.ylabel(df.columns[1])
plt.show()
Repeat as above until the calculation settles down.
For the above "code that does not use scikit-learn", tackle the following issues.
n_clusters = 10
, the final number of clusters may not be 10
. Explain why and suggest how to modify the algorithm to avoid it.Recommended Posts