Do not use `cosinesimilarity``` for
`distanceFunction``` when using Kmeans in deeplearning4j.
deeplearning4j has a Kmeans function. It is used in the following form.
KMeansClustering kmc = KMeansClustering.setup(num, iter, distanceFunction);
ClusterSet cs = kmc.applyTo(pointsLst);
Where `num``` is the number of clusters, ```iter``` is the number of iterations (` `10``` is often used), and
destanceFunction``` is the distance function (
You can specify ʻeuclidean```, ``
manhattan, or `` `cosinesimilarity
).
Well, here is a land mine. If you specify `cosinesimilarity``` for
destanceFunction
``, you will not get the results you expected.
euclidean
.As you may have noticed, only `` `cosinesimilarityis similar. The other two are distances. That is, **
cosinesimilarity``` has a higher similarity as the value (MAX is 1), and the remaining two have a higher similarity as the value is smaller **.
The program of deeplearning4j looks like this. At `` `ClusterSet.class```
public Pair<Cluster, Double> nearestCluster(Point point) {
Cluster nearestCluster = null;
double minDistance = Float.MAX_VALUE;
double currentDistance;
for (Cluster cluster : getClusters()) {
currentDistance = cluster.getDistanceToCenter(point);
if (currentDistance < minDistance) {
minDistance = currentDistance;
nearestCluster = cluster;
}
}
return new Pair<>(nearestCluster, minDistance);
}
This function calculates the distance between the existing cluster and yourself, and classifies it into the cluster with the shortest distance. It's a very legitimate process, but it's ridiculous here because the interpretation of the value is reversed only for `` `cosinesimilarity```.
If you really want to use cosinesimilarity
, you have no choice but to create an extends class at present. I thought I'd send a pull request with exception handling, but I quit because the code was dirty. I'm wondering if it's a relief, but what should I do?
It was registered in issues. https://github.com/deeplearning4j/deeplearning4j/issues/2361
Apparently, it seems to be tidied up because it is "normal operation". Certainly ... it's normal operation, but it's not "correct processing". If you want to disable it or use it forcibly, it is better to set `cosinesimilarity``` to
cosinedistance``` and accept only `` s [cosinedistance] = 1-s [cosinesimilarity] ``
. I think.
Recommended Posts