This article is an introduction & simple implementation / experimental article of the recently read paper (adopted by ICML 2020).
The title is "Explainable k-Means and k-Medians Clustering", and arxiv can be found from here.
This paper proposes an approach of ** obtaining clustering rules using a decision tree ** for k-means or k-medians, which are clustering methods. Obtaining clustering that can be explained by a decision tree has already been done in some previous studies, but the contribution of this paper is for the first time an algorithm that approximates k-means / k-medians with a theoretical guarantee. Is proposed **, and the upper and lower bounds are shown. If you are interested in the detailed summary of this paper (from this article), please read the outline from the link below. Summary of thesis
In this article, there were no experiments in this paper, so I'll try to implement the proposed algorithms, albeit briefly, and check them out (code. github / matsutakk / explainable-kmeans / blob / master / XAI_kmeans.ipynb)).
In this paper, k-means / k-medians is approximated by a decision tree as shown in the figure below. When the number of clusters is $ k $, the created tree will have $ k $ leaves so that clusters will be distributed to each leaf. By doing so, a tree with $ k-1 $ intermediate nodes is created, so some degree of explanation can be guaranteed (the tree does not become too deep).
The paper discusses two cases, $ k = 2 $ and $ k> 2 $. First, if $ k = 2 $, the number of leaves is 2, and the non-terminal node is 1 (only the root node), so one division rule should be obtained. In this paper, we propose an algorithm to obtain the optimum feature dimension $ i $ and threshold $ \ theta $ by sorting and dynamic programming. It looks like the following. It is claimed to be efficiently obtained by sorting and dynamic programming. For details, please see the pdf of the paper summary. I implemented this in python. Please visit here for the notebook. I did it for two datasets, but I will write only one in this article. First, put the dataset in the variable X.
# First dataset
#Give students' national, math, and English scores as an array
X = np.array([
[ 80, 85, 100 ],
[ 96, 100, 100 ],
[ 54, 83, 98 ],
[ 80, 98, 98 ],
[ 90, 92, 91 ],
[ 84, 78, 82 ],
[ 79, 100, 96 ],
[ 88, 92, 92 ],
[ 98, 73, 72 ],
[ 75, 84, 85 ],
[ 92, 100, 96 ],
[ 96, 92, 90 ],
[ 99, 76, 91 ],
[ 75, 82, 88 ],
[ 90, 94, 94 ],
[ 54, 84, 87 ],
[ 92, 89, 62 ],
[ 88, 94, 97 ],
[ 42, 99, 80 ],
[ 70, 98, 70 ],
[ 94, 78, 83 ],
[ 52, 73, 87 ],
[ 94, 88, 72 ],
[ 70, 73, 80 ],
[ 95, 84, 90 ],
[ 95, 88, 84 ],
[ 75, 97, 89 ],
[ 49, 81, 86 ],
[ 83, 72, 80 ],
[ 75, 73, 88 ],
[ 79, 82, 76 ],
[ 100, 77, 89 ],
[ 88, 63, 79 ],
[ 100, 50, 86 ],
[ 55, 96, 84 ],
[ 92, 74, 77 ],
[ 97, 50, 73 ],
])
Approximate k-means with the proposed method (k-medians is not done).
#K-means clustering
kmeans_model = KMeans(n_clusters=2, random_state=10).fit(X)
#Get the label that was classified
labels = kmeans_model.labels_
#Obtained by the approximation algorithm based on the proposed method
bests_split = optimal_threshold_2means(X)
approx_labels = clustering_2means_by_tree(bests_split,X)
print(bests_split)
print(approx_labels)
Execution result The above result means that the decision tree that divides the 0th feature as the threshold 70 is optimal. Then, how much does clustering by the rule of this decision tree approximate k-means clustering? Below is the calculation code.
approx_score(approx_labels,kmeans_model,X) #Approximation algorithm cost/ k-means cost
Execution result The above result means that the same clustering result as k-means in the number of clusters 2 can be clustered by the rule of whether the 0th feature is 70 or less. In other words, when a student asks this dataset of national language, math, and English, "Why did I get cluster 0?", "Because your national language score is 70 or less." I can explain it.
One of the good things about this paper is that it proposes an approximation algorithm with a theoretical guarantee. The authors prove that in k-means ($ k = 2 $), there is a decision tree approximation (division $ \ widehat {C} $) with an optimal one rule that satisfies the following equation. In other words, there is a division rule that has an approximation accuracy of a constant multiple. No matter how big the dimension is. This proof makes good use of Hall's theorem.
\operatorname{cost}(\widehat{C}) \leq 4 \cdot \operatorname{cost}(o p t)
In this paper, we discuss $ k = 2 $ and then $ k> 2 $. The algorithm when $ k> 2 $ is as follows. If you are interested in the flow and experiments of this, Summary of the paper and Code Take a look at .jupyter.org/github/matsutakk/explainable-kmeans/blob/master/XAI_kmeans.ipynb).
THANK YOU❤️