The code I wrote

It's a childish result, but I've done Publish the code on Github.

References

-[1] Continuation / Easy-to-understand pattern recognition-Introduction to unsupervised learning- (Very easy-to-understand book!) -[2] D. Lashkari and P. Golland: Convex clustering with exemplar-based models (Original work of convex clustering method)

My rough understanding

The convex clustering method is a clustering method in which the number of classes is unknown based on the parameter estimation of the mixed normal distribution.

The difference from the parameter estimation of the mixed normal distribution (hereinafter referred to as soft K-means) by the usual EM algorithm lies in the parameters for which initial values must be assigned.

Method	initial value
soft K-means	Number of normal distributions and mean of each distribution
Convex clustering method	Variance of normal distribution(Unified with all distributions)

Therefore, it has the advantage of being less dependent on the initial value of the solution than soft K-means.

The likelihood of the normal distribution formed by each centroid is calculated by regarding all the observed data as centroids. The likelihood calculation is a one-time only, and only the prior probabilities are updated. With the update, the prior probabilities of centroids that are difficult to form a distribution approach zero. Even if there is a certain prior probability, the number of centroids remaining will be the final number of classes.

Motivation for implementation

I was interested because it was introduced in the literature [1]. It is attractive that it seems to be easy to implement and that it seems to be less dependent on the initial value. As an approach to making toys for my hobby, I experimented with the expectation that it would be easier to use than nonparametric Bayes.

Implementation

Implemented in Python + Numpy. Only the implementation part of the parameter update of the convex clustering method is shown.

`convex_clustering.py`


def calculateLikelihoodCombination(x, sigma, gaussianFunc):
    likelihood = np.empty(0)
    for i1 in x:
        likelihood = np.append(likelihood, gaussianFunc(x, i1, sigma))
    samples = len(x)
    return np.reshape(likelihood, (samples, samples))


def calculatePrior(prePrior, likelihood):
    samples = len(prePrior)
    molecule = (prePrior*likelihood.T).T
    denomin = np.sum(molecule, axis=0) + 1e-10
    return np.sum(molecule/denomin, axis=1)/samples


def runConvexClustering(x, hyperSigma, gaussian, plotter):
    # Pre-calculation of likelihood
    likelihood = calculateLikelihoodCombination(x, hyperSigma, gaussian)
    # Initialize prior probability.
    classNum = len(x)  # Number of samples is used as number of classes.
    firstPrior = 1.0/classNum  # Set non zero value.
    prior = np.repeat(firstPrior, classNum)

    # Update parameters.
    for dummy in range(1000):
        prior = calculatePrior(prior, likelihood)

It's so easy that you only need dozens of lines. I think it's shorter than the soft K-means code I wrote locally in the past.

Experimental conditions

factor	level
data	\mathbb{R}^{1}(1D data)、\mathbb{R}^{2}(2D data)
Number of normal distributions used for sampling	Four
Number of samples from one normal distribution	100 points
Size of variance of initial value(magnification)	usually(×1), Large(×2), Very big(×10)

The total number of samples in one experiment is 4 * 100 = 400 points. The magnification attached to the magnitude of the variance of the initial value is the magnification from the standard deviation of the normal distribution used for sampling.

result

The following table shows the state of clustering. The vertical axis of the table is the data, and the horizontal axis is the size of the variance of the initial value.

	usually	Large	Very big
\mathbb{R}^{1}
\mathbb{R}^{2}

Blue is the sample, and green is the shape of the convex combination of the normal distribution generated from the centroid of the class obtained by the convex clustering method.

In addition, the number of classes left after the cut is as follows. Similarly, the vertical axis of the table is the data, and the horizontal axis is the size of the variance of the initial value.

	usually	Large	Very big
\mathbb{R}^{1}	46	40	5
\mathbb{R}^{2}	23	11	15

Consideration

When the variance of the initial value is [normal, large], it seems that it is clustered reasonably well. I feel that clustering is more correct when it is [Large]. When applying to a real problem, it may be better to use a slightly larger variance than expected. When the variance of the initial value is [very large], four distributions are clustered as one distribution. But this is a fairly trivial result.

The number of centroids in the secondary data is closer to the number of normal distributions (= 4) used to generate the sample than in the primary data. This is also mentioned in Ref. [2]. The wider the space to be searched, the smaller the dependence on the initial value of the convex clustering method lives. However, I am a little worried that the number of centroids did not come close enough to 4 as in the literature [1] [2]. \ # I sincerely hope that it is not an implementation bug.

Summary

What I felt about the convex clustering method through implementation & experimentation is that "implementation is very easy, but there seems to be some tricks to apply it to actual problems." In the literature [1] [2], the advantages and disadvantages are stated in the discussion, and it seems difficult to use depending on the problem to be dealt with. I think it is a cost-effective clustering method, but I would like to consider more carefully whether it is an appropriate approach to the problem I want to solve.

Implementation and experiment of convex clustering method