Hello, this is Motty. This time, classification (clustering) was done in Python.

What is classification?

Classification in statistics and machine learning refers to classifying data into groups of similar features. It is one of "unsupervised learning" because it is done without a standard in advance.

KMeans method

The K-means method is an algorithm that classifies into a given number of clusters (k) using the average of clusters. The classification structure is optimized by classifying each data according to how close it is to the center of gravity and updating the center of gravity sequentially.

2020-04-12 18.49.13.png

Implemented in Python

`KMeans.py`


import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs as mb


clf = KMeans(n_clusters = 3)
N = 100 #Number of sample

dataset = mb(centers = 3)
features = np.array(dataset[0])
pred = clf.fit_predict(features)

2020-04-12 18.51.40.png I was able to classify it neatly.

It should be noted that the data itself is clean, the number of K is appropriate, and the algorithm selection is appropriate. If the conditions are not met, it may not be possible to divide the data neatly in this way.

If there is an outlier

NOISE = [25,25]
features = np.append(features,NOISE).reshape(-1,2)

2020-04-12 18.56.36.png

If the number of clusters is not appropriate

dataset = mb(centers = 4)

2020-04-12 18.59.44.png

Cases where the classification algorithm is not suitable for KMeans in the first place

`makemoons.py`


import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs as mb

clf = KMeans(n_clusters = 2)

X1,y1 = make_moons(noise = 0.05, random_state=0)
pred1 = clf.fit_predict(X1)

for i in range(2):
    labels = X1[pred1 == i]
    plt.scatter(labels[:,0],labels[:,1])

plt.show()

2020-04-12 19.03.03.png

At the end

There are various classification algorithms, and this time I described one of them, the KMeans method. I would like to describe the classification of SVM and run ram forest later.

Classify data by k-means method