When I was investigating the dimensional compression method of data, my junior investigated and taught me the deep learning method "** Deep Embedded Clustering **" that learns dimensional compression + clustering at the same time, so let's implement it with Chainer. That is this article. The implemented code is published on Github. https://github.com/ymym3412/DeepEmbeddedClustering
Deep Embedded Clustering is a clustering method proposed in the paper "Unsupervised Deep Embedding for Clustering Analysis". Other methods of dimensional compression and clustering are as follows.
--k-means, mixed Gauss model (GMM) --Fast speed and easy to apply to many problems --Distance metrics are limited to the original data space --In many cases, data distributed in high-dimensional space does not work. --K-means extension method --EM algorithm that alternates between clustering using k-means and maximizing the distribution between clusters --Conversions to lower dimensions are limited to linear transformations only --Spectral Clustering and how to extend it --Often works better than techniques such as k-means --An extended method has also been proposed that replaces the eigenvalue decomposition performed by Spectral Clustering with AutoEncoder. --Performance is good, but memory consumption is high --Graphlacian matrix calculation requires square or more calculation order for the number of data points
Deep Embedded Clustering compared to the above method
--** Robust ** for hyperparameter selection --Higher accuracy than the conventional method --Calculation time is $ O (nk) $ --Proportional to the number of data and the number of centroids
There is an advantage such as.
Learning is roughly divided into two steps.
The whole picture is as follows.
Use Stacked (Denoising) AutoEncoder for pre-training of neural networks. The code defines one layer as the Denoising Autoencoder and the Stacked Auto Encoder as the Chain List.
class DenoisingAutoEncoder(chainer.Chain):
def __init__(self, input_size, output_size, encoder_activation=True, decoder_activation=True):
w = chainer.initializers.Normal(scale=0.01)
super(DenoisingAutoEncoder, self).__init__()
with self.init_scope():
self.encoder = L.Linear(input_size, output_size, initialW=w)
self.decoder = L.Linear(output_size, input_size, initialW=w)
# encode,Parameter that determines the presence or absence of the activation function at the time of decode
self.encoder_activation = encoder_activation
self.decoder_activation = decoder_activation
def __call__(self, x):
if self.encoder_activation:
h = F.relu(self.encoder(F.dropout(x, 0.2)))
else:
h = self.encoder(F.dropout(x, 0.2))
if self.decoder_activation:
h = F.relu(self.decoder(F.dropout(h, 0.2)))
else:
h = self.decoder(F.dropout(h, 0.2))
return h
def encode(self, x):
if self.encoder_activation:
h = F.relu(self.encoder(x))
else:
h = self.encoder(x)
return h
def decode(self, x):
if self.decoder_activation:
h = F.relu(self.decoder(x))
else:
h = self.decoder(x)
return h
class StackedDenoisingAutoEncoder(chainer.ChainList):
def __init__(self, input_dim):
#The dimension of each layer is the dimension of the input according to the paper->500->500->2000->10
super(StackedDenoisingAutoEncoder, self).__init__(
DenoisingAutoEncoder(input_dim, 500, decoder_activation=False),
DenoisingAutoEncoder(500, 500),
DenoisingAutoEncoder(500, 2000),
DenoisingAutoEncoder(2000, 10, encoder_activation=False)
)
def __call__(self, x):
# encode
models = []
for dae in self.children():
x = dae.encode(x)
models.append(dae)
# decode
for dae in reversed(models):
x = dae.decode(x)
return x
Pre-learning is also divided into two steps, "Layer-Wise Pretrain" which learns each layer as an AutoEncoder and "Fine Turing" which learns the entire network.
# Layer-Wise Pretrain
print("Layer-Wise Pretrain")
#Get each layer and learn as AutoEncoder
for i, dae in enumerate(model.children()):
print("Layer {}".format(i+1))
train_tuple = tuple_dataset.TupleDataset(train, train)
train_iter = iterators.SerialIterator(train_tuple, batchsize)
clf = L.Classifier(dae, lossfun=mean_squared_error)
clf.compute_accuracy = False
if chainer.cuda.available and args.gpu >= 0:
clf.to_gpu(gpu_id)
optimizer = optimizers.MomentumSGD(lr=0.1)
optimizer.setup(clf)
updater = training.StandardUpdater(train_iter, optimizer, device=gpu_id)
trainer = training.Trainer(updater, (50000, "iteration"), out="mnist_result")
trainer.extend(extensions.LogReport())
trainer.extend(extensions.PrintReport(['iteration', 'main/loss', 'elapsed_time']))
#Learning rate for every 20000 iteration(lr)1/Set to 10
trainer.extend(ChangeLearningRate(), trigger=(20000, "iteration"))
trainer.run()
#784 dimensions of training data according to learning for each layer->500->500->Convert to 2000
train = dae.encode(train).data
# Finetuning
print("fine tuning")
#No Dropout during Fine Tuning
with chainer.using_config("train", False):
train, _ = mnist.get_mnist()
train, _ = convert.concat_examples(train, device=gpu_id)
train_tuple = tuple_dataset.TupleDataset(train, train)
train_iter = iterators.SerialIterator(train_tuple, batchsize)
model = L.Classifier(model, lossfun=mean_squared_error)
model.compute_accuracy = False
if chainer.cuda.available and args.gpu >= 0:
model.to_gpu(gpu_id)
optimizer = optimizers.MomentumSGD(lr=0.1)
optimizer.setup(model)
updater = training.StandardUpdater(train_iter, optimizer, device=gpu_id)
trainer = training.Trainer(updater, (100000, "iteration"), out="mnist_result")
trainer.extend(extensions.LogReport())
trainer.extend(extensions.PrintReport(['iteration', 'main/loss', 'elapsed_time']))
trainer.extend(ChangeLearningRate(), trigger=(20000, "iteration"))
trainer.run()
The Encode part of the neural network trained here is reused as it is as a mapping to the lower dimension of the data.
The characteristic of Deep Embedded Clustering is that it learns "reduction of data to a lower dimension" and "centroid of cluster" at the same time. A neural network is used to convert the original data $ x_i $ to a lower dimension. Let $ z_i $ be the low-dimensional version of $ x_i $, and multiply $ z_i $ by k-means to initialize the centroid $ u_j $ of the cluster. Neural network parameters that convert $ x_i $ to $ z_i $ and cluster centroid $ u_j $ are trained. Parameter training is performed to minimize the KL divergence of the two distributions Q and P represented below. In unsupervised learning, parameters cannot be adjusted by cross-validation, so $ \ alpha $ is fixed at 1.
q_{ij} = \frac{(1 + \|z_i - u_j\|^2 / \alpha)^{-\frac{\alpha+1}{2}}}{\sum_{j^{'}}(1 + \|z_i - u_{j^{'}}\|^2 / \alpha)^{-\frac{\alpha + 1}{2}}}
p_{ij} = \frac{q^2_{ij} / f_j}{\sum_{j^{'}}q^2_{ij^{'}} / f_{j^{'}}}\\
* However, f_j = \sum_i q_{ij}
Training continues until the number of data whose cluster allocation changes is less than 0.1%. This time I didn't have the right error function (I think), so I made it myself.
class TdistributionKLDivergence(chainer.Function):
def forward(self, inputs):
xp = cuda.get_array_module(*inputs)
z, us = inputs[0], xp.array(inputs[1:], dtype=xp.float32)
#Matrix whose ij component is the Euclidean distance between zi and uj
dist_matrix = xp.linalg.norm(xp.vstack(chain.from_iterable(map(lambda v: repeat(v, us.shape[0]), z))) - xp.vstack(repeat(us, z.shape[0])), axis= 1).reshape(z.shape[0], us.shape[0])
q_matrix = (self.tdistribution_kernel(dist_matrix).T / self.tdistribution_kernel(dist_matrix).sum(axis=1)).T
p_matrix = self.compute_pmatrix(q_matrix)
kl_divergence = (p_matrix * (xp.log(p_matrix) - xp.log(q_matrix))).sum()
return xp.array(kl_divergence),
def backward(self, inputs, grad_outputs):
xp = cuda.get_array_module(*inputs)
z, us = inputs[0], xp.array(inputs[1:], dtype=xp.float32)
gloss, = grad_outputs
gloss = gloss / z.shape[0]
# z
norms = xp.vstack(chain.from_iterable(map(lambda v: repeat(v, us.shape[0]), z))) - xp.vstack(repeat(us, z.shape[0]))
#ij component(zi-uj)10 dimensional vector of
#shape is({The number of data},{Centroid number},10)
z_norm_matrix = norms.reshape(z.shape[0], us.shape[0], z.shape[1])
dist_matrix = xp.linalg.norm(norms, axis= 1).reshape(z.shape[0], us.shape[0])
q_matrix = (self.tdistribution_kernel(dist_matrix).T / self.tdistribution_kernel(dist_matrix).sum(axis=1)).T
p_matrix = self.compute_pmatrix(q_matrix)
#Calculate the gradient of zi and uj
gz = 2 * ((((p_matrix - q_matrix) * self.tdistribution_kernel(dist_matrix)) * z_norm_matrix.transpose(2,0,1)).transpose(1,2,0)).sum(axis=1) * gloss
gus = -2 * ((((p_matrix - q_matrix) * self.tdistribution_kernel(dist_matrix)) * z_norm_matrix.transpose(2,0,1)).transpose(1,2,0)).sum(axis=0) * gloss
g = [gz]
g.extend(gus)
grads = tuple(g)
return grads
def tdistribution_kernel(self, norm):
xp = cuda.get_array_module(norm)
return xp.power((1 + norm), -1)
def compute_pmatrix(self, q_matrix):
xp = cuda.get_array_module(q_matrix)
fj = q_matrix.sum(axis=0)
matrix = xp.power(q_matrix, 2) / fj
p_matrix = (matrix.T / matrix.sum(axis=1)).T
return p_matrix
def tdistribution_kl_divergence(z, us):
return TdistributionKLDivergence()(z, *us)
I tried MNIST clustering using the trained model. The color is changed for each label, and the centroid of the cluster is plotted as a black triangle. 500 points were randomly extracted and compressed in two dimensions by t-SNE.
An experiment using MNIST is written in the paper, but it could not be separated so cleanly. Especially "4" "7" "9" are quite messy. I felt that it depends on the pre-learning and the initial position of the centroid to see if it divides cleanly.
This time, we implemented "Deep Embedded Clustering", a clustering method by deep learning, using Chainer 2.0. I haven't touched it since the version of Chainer went up to 2, so it was a good study. If you use GPU, the calculation time can be from pre-learning to optimization of mapping & centroid in about 40 minutes, so I felt that it would not be a burden. Finally, I would like to thank my juniors for introducing me to an interesting technique.
Unsupervised Deep Embedding for Clustering Analysis [ICML Reading Session] Unsupervised Deep Embedding for Clustering Analysis Chainer: Tutorial for Beginners Vol.1 I tried Stacked Auto-Encoder with Chainer Chainer Docs-Define your own function
Recommended Posts