I tried it with the idea. In fact, I haven't figured out how it can be applied.
First import what you need
import numpy as np
from sklearn import datasets
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt
Load boston and try to visualize it with TSNE Visually see some clusters appear
boston = datasets.load_boston()
model = TSNE(n_components=2)
tsne_result = model.fit_transform(boston.data)
plt.plot(tsne_result[:,0], tsne_result[:,1], ".")
Let's cluster with kmeans once for comparison
from sklearn.cluster import MiniBatchKMeans
#Number of clusters`n_clusters`Looked at the TSNE graph and decided by feeling
kmeans = MiniBatchKMeans(n_clusters=10, max_iter=300)
kmeans_tsne = kmeans.fit_predict(tsne_result)
#Color it nicely
color=cm.brg(np.linspace(0,1,np.max(kmeans_tsne) - np.min(kmeans_tsne)+1))
for i in range(np.min(kmeans_tsne), np.max(kmeans_tsne)+1):
plt.plot(tsne_result[kmeans_tsne == i][:,0],
tsne_result[kmeans_tsne == i][:,1],
".",
color=color[i]
)
plt.text(tsne_result[kmeans_tsne == i][:,0][0],
tsne_result[kmeans_tsne == i][:,1][0],
str(i), color="black", size=16
)
Clusters (1,5), (2,8), and (4,7,9) are split, but structurally connected, which is not very desirable (for me).
Try clustering with DBSCAN
from sklearn.cluster import DBSCAN
# `eps`Is the result of trial and error
dbscan = DBSCAN(eps=3)
dbscan_tsne = dbscan.fit_predict(tsne_result)
#Color it nicely
color=cm.brg(np.linspace(0,1,np.max(dbscan_tsne) - np.min(dbscan_tsne)+1))
for i in range(np.min(dbscan_tsne), np.max(dbscan_tsne)+1):
plt.plot(tsne_result[dbscan_tsne == i][:,0],
tsne_result[dbscan_tsne == i][:,1],
".",
color=color[i+1]
)
plt.text(tsne_result[dbscan_tsne == i][:,0][0],
tsne_result[dbscan_tsne == i][:,1][0],
str(i), color="black", size=16
)
In DBSCAN, it is desirable because the connected islands are in the same cluster. (-1 is a cluster that contains things that are out of order)
In addition, generate a decision tree to try to explain each cluster well.
from sklearn import tree
clf = tree.DecisionTreeClassifier()
#dbscan-The label is because 1 cluster is generated-Start from 1
clf.classes_ = np.max(dbscan_tsne) - np.min(dbscan_tsne) + 1
clf.fit(boston.data, dbscan_tsne)
#Generate a graphviz dot file
with open("boston_tsne_dt.dot", 'w') as f:
tree.export_graphviz(
clf,
out_file=f,
feature_names=boston.feature_names,
filled=True,
rounded=True,
special_characters=True,
impurity=False,
proportion=False,
class_names=map(str, range(-1, np.max(dbscan_tsne) - np.min(dbscan_tsne)+1))
)
dot -T png boston_tsne_dt.dot > boston_tsne_dt.png
The result is shown in the figure below.
For reference, draw the target (house price) of each cluster.
plt.boxplot([boston.target[dbscan_tsne == i]
for i in range(np.min(dbscan_tsne),
np.max(dbscan_tsne)+1)],
labels=range(np.min(dbscan_tsne),
np.max(dbscan_tsne)+1)
)
To summarize what I was interested in,
However, when it comes to providing some information with this, I feel quite suspicious.
By the way, even if you mix boston.target
with the original data, the result will be quite close.