You will be an engineer in 100 days ――Day 83 ――Programming ――About machine learning 8

Click here until yesterday

You will become an engineer in 100 days --Day 76 --Programming --About machine learning

You will become an engineer in 100 days-Day 70-Programming-About scraping

You will become an engineer in 100 days --Day 66 --Programming --About natural language processing

You will become an engineer in 100 days --Day 63 --Programming --Probability 1

You will become an engineer in 100 days-Day 59-Programming-Algorithms

You will become an engineer in 100 days --- Day 53 --Git --About Git

You will become an engineer in 100 days --Day 42 --Cloud --About cloud services

You will become an engineer in 100 days --Day 36 --Database --About the database

You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1

You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1

You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1

You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1

This time is a continuation of the story about machine learning.

About clustering

I will explain what you can do with machine learning for the first time, but what you can do with machine learning There are basically three.

・ Regression ・ Classification ・ Clustering

Roughly speaking, it becomes prediction, but the part of what to predict changes.

・ Regression: Predict numerical values ・ Classification: Predict categories ・ Clustering: Make it feel good

Clustering becomes unsupervised learning I don't know the answer, but divide it into something nice You can do that.

The data used this time is the data of digits (numbers) attached to scikit-learn.

Data reading

First, let's read the numerical data. You can load the data with load_digits.

from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
%matplotlib inline

digits = load_digits()


(1797, 64)


When you output it as an image, it looks like the number 0.

This is an 8x8 size image.

Now let's display the source of this data.


array([[ 0., 0., 5., 13., 9., 1., 0., 0.], [ 0., 0., 13., 15., 10., 15., 5., 0.], [ 0., 3., 15., 2., 0., 11., 8., 0.], [ 0., 4., 12., 0., 0., 8., 8., 0.], [ 0., 5., 8., 0., 0., 9., 8., 0.], [ 0., 4., 11., 0., 1., 12., 7., 0.], [ 0., 2., 14., 5., 10., 12., 0., 0.], [ 0., 0., 6., 13., 10., 0., 0., 0.]])

The data itself is a list of numbers. The numbers represent 16 levels of grayscale, with 0 being black and 15 being white.

Let's divide the data of this numerical value into a good feeling. Let's load it into a data frame.

digits_df_tgt = pd.DataFrame(, columns=['target'])

digits_df = pd.DataFrame(
0 1 2 3 4 5 6 7 8 9 ... 54 55 56 57 58 59 60 61 62 63
0 0 0 5 13 9 1 0 0 0 0 ... 0 0 0 0 6 13 10 0 0 0
1 0 0 0 12 13 5 0 0 0 0 ... 0 0 0 0 0 11 16 10 0 0
2 0 0 0 4 15 12 0 0 0 0 ... 5 0 0 0 0 3 11 16 9 0
3 0 0 7 15 13 1 0 0 0 8 ... 9 0 0 0 7 13 13 9 0 0
4 0 0 0 1 11 0 0 0 0 0 ... 0 0 0 0 0 2 16 4 0 0

You can see that the data is converted from the numbers like this to the numbers for each pixel.

Perform clustering

Clustering can be broadly divided into two methods.

** Hierarchical clustering ** How to cluster in order from the most similar combination with a clustering method like a tournament table The process can be represented like a hierarchy, and finally a dendrogram (tree diagram) is created. There are many methods such as Ward's method, group averaging method, shortest distance method, etc.

** Non-hierarchical clustering ** It is one of the methods to form a cluster by collecting things with similar properties from a group of mixed things with different properties. There is K-means as a method.

Here, let's perform non-hierarchical clustering using K-means.

First, call the library.

K-means specifies only how many (K) to divide for the time being. Clustering by specifying a data frame.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=K).fit(digits_df) 
pred_label = kmeans.predict(digits_df)
pred_df = pd.DataFrame(pred_label,columns=['pred'])

With this, the prediction result has also come out. The prediction result is stored in pred_df.

Let's see the result

Let's see how it was divided.

calc = {i:{} for i in range(K)}
for pred , target in zip(pred_label ,
    #print('Forecast: {0} ,Actual measurement: {1}'.format(pred,target))
    if target in calc[pred]:
        calc[pred][target] =1

{0: {6: 177, 1: 2, 8: 2, 5: 1}, 1: {3: 154, 9: 6, 2: 13, 1: 1, 8: 2}, 2: {1: 55, 4: 7, 7: 2, 2: 2, 9: 20, 8: 5, 6: 1}, 3: {7: 170, 2: 3, 3: 7, 9: 7, 4: 7, 8: 2}, 4: {1: 99, 2: 8, 8: 100, 9: 1, 6: 2, 4: 4, 7: 2, 3: 7}, 5: {5: 43, 8: 53, 9: 139, 3: 13, 2: 2}, 6: {5: 136, 9: 7, 7: 5, 8: 7, 1: 1, 3: 2}, 7: {0: 177, 6: 1, 2: 1}, 8: {2: 148, 1: 24, 8: 3}, 9: {4: 163, 5: 2, 0: 1}}

K-means clustering classifies similar things into K pieces by looking at the nature of the numerical values. What we are giving out here is the cluster number. The cluster number 0 seems to have the most 6 numbers when it is the actual value.

Let's look at actual measurements and forecasts.

digits_df2 = pd.concat([pred_df,digits_df],axis=1)

index = list(digits_df[digits_df2['pred']==0].index)

[6, 16, 26, 34, 58, 65, 66, 67, 82, 88, 104, 106, 136, 146, 156, 164, 188, 195, 196, 197, 212, 223, 232, 234, 262, 272, 282, 290, 314, 321, 322, 323, 338, 344, 351, 360, 362, 392, 402, 412, 420, 444, 451, 452, 453, 468, 474, 481, 490, 522, 532, 542, 550, 563, 569, 574, 581, 582, 583, 586, 598, 604, 611, 620, 622, 652, 662, 672, 680, 704, 711, 712, 713, 728, 734, 741, 750, 752, 782, 784, 802, 810, 834, 841, 842, 843, 858, 864, 871, 880, 882, 911, 921, 931, 939, 960, 967, 968, 969, 984, 989, 996, 1005, 1007, 1035, 1045, 1055, 1063, 1085, 1092, 1093, 1094, 1109, 1115, 1122, 1131, 1133, 1163, 1173, 1183, 1191, 1215, 1222, 1223, 1224, 1239, 1245, 1252, 1261, 1263, 1293, 1303, 1313, 1321, 1345, 1352, 1353, 1354, 1361, 1369, 1375, 1382, 1391, 1393, 1421, 1431, 1441, 1449, 1473, 1480, 1481, 1482, 1497, 1503, 1510, 1519, 1521, 1561, 1569, 1577, 1601, 1608, 1609, 1610, 1623, 1629, 1636, 1645, 1647, 1673, 1683, 1693, 1701, 1725, 1732, 1733, 1734, 1749, 1755, 1762, 1771, 1773]

Let's look at the index value of the cluster number 0 to see what the numbers in that data are.


image.png image.png image.png

The one with cluster number 0 looks like 6. I think it was divided into good feelings.

Originally I do not know the answer, but I use it for purposes such as wanting to classify users into seven. It is a division that classifies similar users based on the characteristics of the data.


Today, I explained the mechanism of clustering. There are many other methods of clustering.

First of all, let's suppress the method for the first time by saying what is clustering.

17 days until you become an engineer

Author information

Otsu py's HP:



Recommended Posts

You will be an engineer in 100 days ――Day 81 ――Programming ――About machine learning 6
You will be an engineer in 100 days ――Day 82 ――Programming ――About machine learning 7
You will be an engineer in 100 days ――Day 79 ――Programming ――About machine learning 4
You will be an engineer in 100 days ――Day 76 ――Programming ――About machine learning
You will be an engineer in 100 days ――Day 80 ――Programming ――About machine learning 5
You will be an engineer in 100 days ――Day 78 ――Programming ――About machine learning 3
You will be an engineer in 100 days ――Day 84 ――Programming ――About machine learning 9
You will be an engineer in 100 days ――Day 83 ――Programming ――About machine learning 8
You will be an engineer in 100 days ――Day 77 ――Programming ――About machine learning 2
You will be an engineer in 100 days ――Day 85 ――Programming ――About machine learning 10
You will be an engineer in 100 days ――Day 71 ――Programming ――About scraping 2
You will be an engineer in 100 days ――Day 61 ――Programming ――About exploration
You will be an engineer in 100 days ――Day 74 ――Programming ――About scraping 5
You will be an engineer in 100 days ――Day 73 ――Programming ――About scraping 4
You will be an engineer in 100 days ――Day 75 ――Programming ――About scraping 6
You will be an engineer in 100 days --Day 68 --Programming --About TF-IDF
You will be an engineer in 100 days ――Day 70 ――Programming ――About scraping
You will be an engineer in 100 days --Day 63 --Programming --Probability 1
You will be an engineer in 100 days --Day 65 --Programming --Probability 3
You will be an engineer in 100 days --Day 64 --Programming --Probability 2
You will be an engineer in 100 days --Day 86 --Database --About Hadoop
You will be an engineer in 100 days ――Day 60 ――Programming ――About data structure and sorting algorithm
You will be an engineer in 100 days --Day 27 --Python --Python Exercise 1
You will be an engineer in 100 days --Day 34 --Python --Python Exercise 3
You will be an engineer in 100 days --Day 31 --Python --Python Exercise 2
You become an engineer in 100 days ――Day 67 ――Programming ――About morphological analysis
You become an engineer in 100 days ――Day 66 ――Programming ――About natural language processing
You will be an engineer in 100 days ――Day 24 ―― Python ―― Basics of Python language 1
You will be an engineer in 100 days ――Day 30 ―― Python ―― Basics of Python language 6
You will be an engineer in 100 days ――Day 25 ―― Python ―― Basics of Python language 2
You will be an engineer in 100 days --Day 29 --Python --Basics of the Python language 5
You will be an engineer in 100 days --Day 33 --Python --Basics of the Python language 8
You will be an engineer in 100 days --Day 26 --Python --Basics of the Python language 3
You will be an engineer in 100 days --Day 35 --Python --What you can do with Python
You will be an engineer in 100 days --Day 32 --Python --Basics of the Python language 7
You will be an engineer in 100 days --Day 28 --Python --Basics of the Python language 4
Become an AI engineer soon! Comprehensive learning of Python / AI / machine learning / deep learning / statistical analysis in a few days!
You have to be careful about the commands you use every day in the production environment.
Build an interactive environment for machine learning in Python
Programming learning record day 2
Until an engineer who was once frustrated about machine learning manages to use machine learning at work
[Machine learning] Let's summarize random forest in an easy-to-understand manner
Machine learning in Delemas (practice)
An introduction to machine learning
About machine learning mixed matrices
Python Machine Learning Programming> Keywords
Used in machine learning EDA
Learn machine learning anytime, anywhere in an on-demand Jupyter Notebook environment