Deep Embedded Clustering with Chainer 2.0

When I was investigating the dimensional compression method of data, my junior investigated and taught me the deep learning method "** Deep Embedded Clustering **" that learns dimensional compression + clustering at the same time, so let's implement it with Chainer. That is this article. The implemented code is published on Github. https://github.com/ymym3412/DeepEmbeddedClustering

What is Deep Embedded Clustering?

Deep Embedded Clustering is a clustering method proposed in the paper "Unsupervised Deep Embedding for Clustering Analysis". Other methods of dimensional compression and clustering are as follows.

--k-means, mixed Gauss model (GMM) --Fast speed and easy to apply to many problems --Distance metrics are limited to the original data space --In many cases, data distributed in high-dimensional space does not work. --K-means extension method --EM algorithm that alternates between clustering using k-means and maximizing the distribution between clusters --Conversions to lower dimensions are limited to linear transformations only --Spectral Clustering and how to extend it --Often works better than techniques such as k-means --An extended method has also been proposed that replaces the eigenvalue decomposition performed by Spectral Clustering with AutoEncoder. --Performance is good, but memory consumption is high --Graphlacian matrix calculation requires square or more calculation order for the number of data points

t-SNE --A method to minimize the KL divergence of the distribution of the original data space and the distribution of the feature space. --The basis of Deep Embedded Clustering --DEC uses the distribution in the feature space instead of the distribution in the original data space

Deep Embedded Clustering compared to the above method

--** Robust ** for hyperparameter selection --Higher accuracy than the conventional method --Calculation time is $ O (nk) $ --Proportional to the number of data and the number of centroids

There is an advantage such as.

Learning is roughly divided into two steps.

Pre-learning of neural network
Mapping and centroid optimization

The whole picture is as follows.

1. Pre-learning of neural network

Use Stacked (Denoising) AutoEncoder for pre-training of neural networks. The code defines one layer as the Denoising Autoencoder and the Stacked Auto Encoder as the Chain List.

class DenoisingAutoEncoder(chainer.Chain):
    def __init__(self, input_size, output_size, encoder_activation=True, decoder_activation=True):
        w = chainer.initializers.Normal(scale=0.01)
        super(DenoisingAutoEncoder, self).__init__()
        with self.init_scope():
            self.encoder = L.Linear(input_size, output_size, initialW=w)
            self.decoder = L.Linear(output_size, input_size, initialW=w)
        # encode,Parameter that determines the presence or absence of the activation function at the time of decode
        self.encoder_activation = encoder_activation
        self.decoder_activation = decoder_activation


    def __call__(self, x):
        if self.encoder_activation:
            h = F.relu(self.encoder(F.dropout(x, 0.2)))
        else:
            h = self.encoder(F.dropout(x, 0.2))

        if self.decoder_activation:
            h = F.relu(self.decoder(F.dropout(h, 0.2)))
        else:
            h = self.decoder(F.dropout(h, 0.2))
        return h


    def encode(self, x):
        if self.encoder_activation:
            h = F.relu(self.encoder(x))
        else:
            h = self.encoder(x)
        return h


    def decode(self, x):
        if self.decoder_activation:
            h = F.relu(self.decoder(x))
        else:
            h = self.decoder(x)
        return h


class StackedDenoisingAutoEncoder(chainer.ChainList):
    def __init__(self, input_dim):
        #The dimension of each layer is the dimension of the input according to the paper->500->500->2000->10
        super(StackedDenoisingAutoEncoder, self).__init__(
            DenoisingAutoEncoder(input_dim, 500, decoder_activation=False),
            DenoisingAutoEncoder(500, 500),
            DenoisingAutoEncoder(500, 2000),
            DenoisingAutoEncoder(2000, 10, encoder_activation=False)
        )

    def __call__(self, x):
        # encode
        models = []
        for dae in self.children():
            x = dae.encode(x)
            models.append(dae)

        # decode
        for dae in reversed(models):
            x = dae.decode(x)
        return x

Pre-learning is also divided into two steps, "Layer-Wise Pretrain" which learns each layer as an AutoEncoder and "Fine Turing" which learns the entire network.

# Layer-Wise Pretrain
print("Layer-Wise Pretrain")
#Get each layer and learn as AutoEncoder
for i, dae in enumerate(model.children()):
    print("Layer {}".format(i+1))
    train_tuple = tuple_dataset.TupleDataset(train, train)
    train_iter = iterators.SerialIterator(train_tuple, batchsize)
    clf = L.Classifier(dae, lossfun=mean_squared_error)
    clf.compute_accuracy = False
    if chainer.cuda.available and args.gpu >= 0:
        clf.to_gpu(gpu_id)
    optimizer = optimizers.MomentumSGD(lr=0.1)
    optimizer.setup(clf)
    updater = training.StandardUpdater(train_iter, optimizer, device=gpu_id)
    trainer = training.Trainer(updater, (50000, "iteration"), out="mnist_result")
    trainer.extend(extensions.LogReport())
    trainer.extend(extensions.PrintReport(['iteration', 'main/loss', 'elapsed_time']))
    #Learning rate for every 20000 iteration(lr)1/Set to 10
    trainer.extend(ChangeLearningRate(), trigger=(20000, "iteration"))
    trainer.run()
    #784 dimensions of training data according to learning for each layer->500->500->Convert to 2000
    train = dae.encode(train).data

# Finetuning
print("fine tuning")
#No Dropout during Fine Tuning
with chainer.using_config("train", False):
    train, _ = mnist.get_mnist()
    train, _ = convert.concat_examples(train, device=gpu_id)
    train_tuple = tuple_dataset.TupleDataset(train, train)
    train_iter = iterators.SerialIterator(train_tuple, batchsize)
    model = L.Classifier(model, lossfun=mean_squared_error)
    model.compute_accuracy = False
    if chainer.cuda.available and args.gpu >= 0:
        model.to_gpu(gpu_id)
    optimizer = optimizers.MomentumSGD(lr=0.1)
    optimizer.setup(model)
    updater = training.StandardUpdater(train_iter, optimizer, device=gpu_id)
    trainer = training.Trainer(updater, (100000, "iteration"), out="mnist_result")
    trainer.extend(extensions.LogReport())
    trainer.extend(extensions.PrintReport(['iteration', 'main/loss', 'elapsed_time']))
    trainer.extend(ChangeLearningRate(), trigger=(20000, "iteration"))
    trainer.run()

The Encode part of the neural network trained here is reused as it is as a mapping to the lower dimension of the data.

2. Mapping and centroid optimization

The characteristic of Deep Embedded Clustering is that it learns "reduction of data to a lower dimension" and "centroid of cluster" at the same time. A neural network is used to convert the original data $ x_i $ to a lower dimension. Let $ z_i $ be the low-dimensional version of $ x_i $, and multiply $ z_i $ by k-means to initialize the centroid $ u_j $ of the cluster. Neural network parameters that convert $ x_i $ to $ z_i $ and cluster centroid $ u_j $ are trained. Parameter training is performed to minimize the KL divergence of the two distributions Q and P represented below. In unsupervised learning, parameters cannot be adjusted by cross-validation, so $ \ alpha $ is fixed at 1.

q_{ij} = \frac{(1 + \|z_i - u_j\|^2 / \alpha)^{-\frac{\alpha+1}{2}}}{\sum_{j^{'}}(1 + \|z_i - u_{j^{'}}\|^2 / \alpha)^{-\frac{\alpha + 1}{2}}}

p_{ij} = \frac{q^2_{ij} / f_j}{\sum_{j^{'}}q^2_{ij^{'}} / f_{j^{'}}}\\
* However, f_j = \sum_i q_{ij}

Training continues until the number of data whose cluster allocation changes is less than 0.1%. This time I didn't have the right error function (I think), so I made it myself.

class TdistributionKLDivergence(chainer.Function):

    def forward(self, inputs):
        xp = cuda.get_array_module(*inputs)
        z, us = inputs[0], xp.array(inputs[1:], dtype=xp.float32)

        #Matrix whose ij component is the Euclidean distance between zi and uj
        dist_matrix = xp.linalg.norm(xp.vstack(chain.from_iterable(map(lambda v: repeat(v, us.shape[0]), z))) - xp.vstack(repeat(us, z.shape[0])), axis= 1).reshape(z.shape[0], us.shape[0])
        q_matrix = (self.tdistribution_kernel(dist_matrix).T / self.tdistribution_kernel(dist_matrix).sum(axis=1)).T
        p_matrix = self.compute_pmatrix(q_matrix)
        kl_divergence = (p_matrix * (xp.log(p_matrix) - xp.log(q_matrix))).sum()
        return xp.array(kl_divergence),


    def backward(self, inputs, grad_outputs):
        xp = cuda.get_array_module(*inputs)
        z, us = inputs[0], xp.array(inputs[1:], dtype=xp.float32)
        gloss, = grad_outputs
        gloss = gloss / z.shape[0]

        # z
        norms = xp.vstack(chain.from_iterable(map(lambda v: repeat(v, us.shape[0]), z))) - xp.vstack(repeat(us, z.shape[0]))
        #ij component(zi-uj)10 dimensional vector of
        #shape is({The number of data},{Centroid number},10)
        z_norm_matrix = norms.reshape(z.shape[0], us.shape[0], z.shape[1])

        dist_matrix = xp.linalg.norm(norms, axis= 1).reshape(z.shape[0], us.shape[0])
        q_matrix = (self.tdistribution_kernel(dist_matrix).T / self.tdistribution_kernel(dist_matrix).sum(axis=1)).T
        p_matrix = self.compute_pmatrix(q_matrix)

        #Calculate the gradient of zi and uj
        gz = 2 * ((((p_matrix - q_matrix) * self.tdistribution_kernel(dist_matrix)) * z_norm_matrix.transpose(2,0,1)).transpose(1,2,0)).sum(axis=1) * gloss
        gus = -2 * ((((p_matrix - q_matrix) * self.tdistribution_kernel(dist_matrix)) * z_norm_matrix.transpose(2,0,1)).transpose(1,2,0)).sum(axis=0) * gloss

        g = [gz]
        g.extend(gus)
        grads = tuple(g)
        return grads


    def tdistribution_kernel(self, norm):
        xp = cuda.get_array_module(norm)
        return xp.power((1 + norm), -1)


    def compute_pmatrix(self, q_matrix):
        xp = cuda.get_array_module(q_matrix)
        fj = q_matrix.sum(axis=0)
        matrix = xp.power(q_matrix, 2) / fj
        p_matrix = (matrix.T / matrix.sum(axis=1)).T
        return p_matrix


def tdistribution_kl_divergence(z, us):

    return TdistributionKLDivergence()(z, *us)

Try to move

I tried MNIST clustering using the trained model. The color is changed for each label, and the centroid of the cluster is plotted as a black triangle. 500 points were randomly extracted and compressed in two dimensions by t-SNE.

An experiment using MNIST is written in the paper, but it could not be separated so cleanly. Especially "4" "7" "9" are quite messy. I felt that it depends on the pre-learning and the initial position of the centroid to see if it divides cleanly.

At the end

This time, we implemented "Deep Embedded Clustering", a clustering method by deep learning, using Chainer 2.0. I haven't touched it since the version of Chainer went up to 2, so it was a good study. If you use GPU, the calculation time can be from pre-learning to optimization of mapping & centroid in about 40 minutes, so I felt that it would not be a burden. Finally, I would like to thank my juniors for introducing me to an interesting technique.

References

Unsupervised Deep Embedding for Clustering Analysis [ICML Reading Session] Unsupervised Deep Embedding for Clustering Analysis Chainer: Tutorial for Beginners Vol.1 I tried Stacked Auto-Encoder with Chainer Chainer Docs-Define your own function