Click here until yesterday
You will become an engineer in 100 days --Day 76 --Programming --About machine learning
You will become an engineer in 100 days-Day 70-Programming-About scraping
You will become an engineer in 100 days --Day 66 --Programming --About natural language processing
You will become an engineer in 100 days --Day 63 --Programming --Probability 1
You will become an engineer in 100 days-Day 59-Programming-Algorithms
You will become an engineer in 100 days --- Day 53 --Git --About Git
You will become an engineer in 100 days --Day 42 --Cloud --About cloud services
You will become an engineer in 100 days --Day 36 --Database --About the database
You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1
You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1
You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1
You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1
This time is a continuation of the story about machine learning.
I will explain what you can do with machine learning for the first time, but what you can do with machine learning There are basically three.
・ Regression ・ Classification ・ Clustering
Roughly speaking, it becomes prediction, but the part of what to predict changes.
・ Regression: Predict numerical values ・ Classification: Predict categories ・ Clustering: Make it feel good
Clustering becomes unsupervised learning I don't know the answer, but divide it into something nice
You can do that.
The data used this time is the data of digits (numbers) attached to scikit-learn.
First, let's read the numerical data.
You can load the data with load_digits.
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
%matplotlib inline
digits = load_digits()
print(digits.data.shape)
plt.gray() 
plt.matshow(digits.images[0]) 
plt.show()
(1797, 64)
When you output it as an image, it looks like the number 0.
This is an 8x8 size image.
Now let's display the source of this data.
digits.images[0]
array([[ 0., 0., 5., 13., 9., 1., 0., 0.], [ 0., 0., 13., 15., 10., 15., 5., 0.], [ 0., 3., 15., 2., 0., 11., 8., 0.], [ 0., 4., 12., 0., 0., 8., 8., 0.], [ 0., 5., 8., 0., 0., 9., 8., 0.], [ 0., 4., 11., 0., 1., 12., 7., 0.], [ 0., 2., 14., 5., 10., 12., 0., 0.], [ 0., 0., 6., 13., 10., 0., 0., 0.]])
The data itself is a list of numbers. The numbers represent 16 levels of grayscale, with 0 being black and 15 being white.
Let's divide the data of this numerical value into a good feeling. Let's load it into a data frame.
digits_df_tgt = pd.DataFrame(digits.target, columns=['target'])
digits_df_tgt.head()
digits_df = pd.DataFrame(digits.data)
digits_df.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 5 | 13 | 9 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 6 | 13 | 10 | 0 | 0 | 0 | 
| 1 | 0 | 0 | 0 | 12 | 13 | 5 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 11 | 16 | 10 | 0 | 0 | 
| 2 | 0 | 0 | 0 | 4 | 15 | 12 | 0 | 0 | 0 | 0 | ... | 5 | 0 | 0 | 0 | 0 | 3 | 11 | 16 | 9 | 0 | 
| 3 | 0 | 0 | 7 | 15 | 13 | 1 | 0 | 0 | 0 | 8 | ... | 9 | 0 | 0 | 0 | 7 | 13 | 13 | 9 | 0 | 0 | 
| 4 | 0 | 0 | 0 | 1 | 11 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 2 | 16 | 4 | 0 | 0 | 
You can see that the data is converted from the numbers like this to the numbers for each pixel.
Clustering can be broadly divided into two methods.
** Hierarchical clustering **
How to cluster in order from the most similar combination with a clustering method like a tournament table
The process can be represented like a hierarchy, and finally a dendrogram (tree diagram) is created.
There are many methods such as Ward's method, group averaging method, shortest distance method, etc.
** Non-hierarchical clustering ** It is one of the methods to form a cluster by collecting things with similar properties from a group of mixed things with different properties. There is K-means as a method.
Here, let's perform non-hierarchical clustering using K-means.
First, call the library.
K-means specifies only how many (K) to divide for the time being.
Clustering by specifying a data frame.
from sklearn.cluster import KMeans
K=10
kmeans = KMeans(n_clusters=K).fit(digits_df) 
pred_label = kmeans.predict(digits_df)
pred_df = pd.DataFrame(pred_label,columns=['pred'])
With this, the prediction result has also come out. The prediction result is stored in pred_df.
Let's see how it was divided.
calc = {i:{} for i in range(K)}
for pred , target in zip(pred_label , digits.target):
    #print('Forecast: {0} ,Actual measurement: {1}'.format(pred,target))
    if target in calc[pred]:
        calc[pred][target]+=1
    else:
        calc[pred][target] =1
        
{0: {6: 177, 1: 2, 8: 2, 5: 1}, 1: {3: 154, 9: 6, 2: 13, 1: 1, 8: 2}, 2: {1: 55, 4: 7, 7: 2, 2: 2, 9: 20, 8: 5, 6: 1}, 3: {7: 170, 2: 3, 3: 7, 9: 7, 4: 7, 8: 2}, 4: {1: 99, 2: 8, 8: 100, 9: 1, 6: 2, 4: 4, 7: 2, 3: 7}, 5: {5: 43, 8: 53, 9: 139, 3: 13, 2: 2}, 6: {5: 136, 9: 7, 7: 5, 8: 7, 1: 1, 3: 2}, 7: {0: 177, 6: 1, 2: 1}, 8: {2: 148, 1: 24, 8: 3}, 9: {4: 163, 5: 2, 0: 1}}
K-means clustering classifies similar things into K pieces by looking at the nature of the numerical values.
What we are giving out here is the cluster number. The cluster number 0 seems to have the most 6 numbers when it is the actual value.
Let's look at actual measurements and forecasts.
digits_df2 = pd.concat([pred_df,digits_df],axis=1)
digits_df2['pred'].value_counts()
index = list(digits_df[digits_df2['pred']==0].index)
print(index)
[6, 16, 26, 34, 58, 65, 66, 67, 82, 88, 104, 106, 136, 146, 156, 164, 188, 195, 196, 197, 212, 223, 232, 234, 262, 272, 282, 290, 314, 321, 322, 323, 338, 344, 351, 360, 362, 392, 402, 412, 420, 444, 451, 452, 453, 468, 474, 481, 490, 522, 532, 542, 550, 563, 569, 574, 581, 582, 583, 586, 598, 604, 611, 620, 622, 652, 662, 672, 680, 704, 711, 712, 713, 728, 734, 741, 750, 752, 782, 784, 802, 810, 834, 841, 842, 843, 858, 864, 871, 880, 882, 911, 921, 931, 939, 960, 967, 968, 969, 984, 989, 996, 1005, 1007, 1035, 1045, 1055, 1063, 1085, 1092, 1093, 1094, 1109, 1115, 1122, 1131, 1133, 1163, 1173, 1183, 1191, 1215, 1222, 1223, 1224, 1239, 1245, 1252, 1261, 1263, 1293, 1303, 1313, 1321, 1345, 1352, 1353, 1354, 1361, 1369, 1375, 1382, 1391, 1393, 1421, 1431, 1441, 1449, 1473, 1480, 1481, 1482, 1497, 1503, 1510, 1519, 1521, 1561, 1569, 1577, 1601, 1608, 1609, 1610, 1623, 1629, 1636, 1645, 1647, 1673, 1683, 1693, 1701, 1725, 1732, 1733, 1734, 1749, 1755, 1762, 1771, 1773]
Let's look at the index value of the cluster number 0 to see what the numbers in that data are.
plt.gray() 
plt.matshow(digits.images[6]) 
plt.matshow(digits.images[16]) 
plt.matshow(digits.images[26]) 
plt.show()
 
 

The one with cluster number 0 looks like 6.
I think it was divided into good feelings.
Originally I do not know the answer, but I use it for purposes such as wanting to classify users into seven. It is a division that classifies similar users based on the characteristics of the data.
Today, I explained the mechanism of clustering. There are many other methods of clustering.
First of all, let's suppress the method for the first time by saying what is clustering.
17 days until you become an engineer
Otsu py's HP: http://www.otupy.net/
Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw
Twitter: https://twitter.com/otupython
Recommended Posts