Due to research, it was necessary to use 3D vector data as input for machine learning. As the data format, we get 3D vectors with different lengths every few seconds.
Focusing only on the direction of the vector, we normalize the length of the composite vector to 1. The main content is how to extract features from the direction data obtained in such a large amount of time series. In conclusion, the parameters are maximum likelihood estimated from von Mises-Fisher Distribution (vMFD), and the direction data on the hypersphere is clustered by Mixture of vMFD (movMFD). This modeled value is a kind of feature in the directional data. As those who are doing natural language processing will understand, this is a method closely related to the word frequency index tf-idf in documents. It can also be used in systems where the "vector orientation" changes over time in physical space.
The book I referred to this time is Chapter 7, "Abnormality detection of direction data" in "Abnormality detection and change detection". The program used is clara-labs sphere cluster. In addition, I used a part of the detailed explanation of this paper ("Clustering on the Unit Hypersphere using von Mises-Fisher Distributions"). It is. The purpose is to extract the features of D-dimensional vector data, but according to the example of spherecluster, we will do everything up to clustering after extracting the features. Thanks to this package, work efficiency has improved at once. thank you very much.
** This is the first time I have used von Mises-Fisher Distribution and direction information, so I would appreciate it if you could comment if you have any developmental opinions or misunderstandings. ** **
spherecluster is a python package created based on the algorithm introduced in this paper. It is mainly composed of two algorithms, Spherical K-means (spkmeans) and movMF, and there are two types of movMF, soft- and hard-. First, install this sphere cluster.
pip install spherecluster
it's simple. By the way, it depends on packages such as sklearn and numpy.
Algorithms Spherical Clustering A method that extends the concept of KMeans onto a sphere. The calculation method is the same as KMeans, only the input space is defined as a sphere.
vMFD is a general expression method that expresses points distributed on a hypersphere as a probability distribution like a normal distribution. See wikipedia
If you look at the formula, you can see that it is represented by two parameters [$ \ mu $, $ \ kappa $] for the input x. $ \ Mu $ is the average direction of the distributed points, and $ \ kappa $ is the degree of concentration around $ \ mu $. In other words, as shown in the figure below, when the degree of concentration $ \ kappa $ increases, the image is that points exist up to a position away from the average direction $ \ mu $ (solid line).
Wiki image | |
---|---|
Also, as you can see from the mathematical formula, the calculation of the regularization constant includes a complicated function called the first-class modified Bessel function, and it seems that it will take some time to understand.
The movMFD used this time is the Mixture Distribution of this vMFD. In this article, we will focus on actual usage examples rather than understanding mathematical formulas.
movMFD The likelihood of the mixture distribution composed of the Mixture Model and vMFD is solved by the EM algorithm.
An approach similar to the common Mixed Gaussian model. In other words, it is clustered as an independent vMFD on the sphere in this way.
Example1: small-mix Two-dimensional dummy data is clustered in four ways: KMeans, Spherical KMeans, soft-movMF, and hard-movMF. The actual program is here.
First, install the library.
import numpy as np
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn import metrics
import sys
sys.path.append('../spherecluster')
from spherecluster import SphericalKMeans
from spherecluster import VonMisesFisherMixture
from spherecluster import sample_vMF
Then, it generates data based on the following two vMFs.
# Generate small-mix dataset
mu_0 = np.array([-0.251, -0.968])
mu_1 = np.array([0.399, 0.917])
mus = [mu_0, mu_1]
kappa_0 = 8 # concentration parameter
kappa_1 = 2 # concentration parameter
kappas = [kappa_0, kappa_1]
num_points_per_class = 100
X_0 = sample_vMF(mu_0, kappa_0, num_points_per_class)
X_1 = sample_vMF(mu_1, kappa_1, num_points_per_class)
X = np.zeros((2 * num_points_per_class, 2))
X[:num_points_per_class, :] = X_0
X[num_points_per_class:, :] = X_1
labels = np.zeros((2 * num_points_per_class, ))
labels[num_points_per_class:] = 1
Then fit to K Mean, Spherical K-Means, soft-movMF, hard-mov MF.
At this time, the following error may occur.
ValueError: Data l2-norm must be 1
In the movMF function, the eigenvalue is calculated for each set of input data, but the absolute value of the eigenvalue may be larger than 1e-4 with a probability of about 5% in the data generated by the original sample_vMF. At this time, the program stops due to error processing, so here we will forcibly calculate the eigenvalues and eliminate the cause of the error.
If you run the following program before fitting, the error will disappear. (However, the data that caused the error will also disappear, so the total number of samples will also decrease.)
import scipy.sparse as sp
dummy = []
n_samples = X.shape[0]
for ee in range(n_samples):
if sp.issparse(X):
n = sp.linalg.norm(X[ee, :])
else:
n = np.linalg.norm(X[ee, :])
if np.abs(n - 1.) > 1e-4:
dummy.append([ee, n])
X = np.delete(X, np.array(dummy)[:, 0], 0)
The execution result is as follows.
Since the original data is neatly divided into two groups, there is no noticeable difference between the methods.
Example2: small-mix_3d Let's cluster the 3D data with the same contents as Example1. Click here for the original code.
If you look at the script, you can see that it is almost the same as Examaple1. Here is the result of executing it.
By the way, the estimation result is like this.
At least for the data generated at hand, it seems to show better performance than kmeans and spkmeans.
Introducing the data handled this time.
id is a value associated with each node, and this time it is a fixed value. [x, y, z] is a three-dimensional vector, azimuth, angle is set aside, and norm is the "length" of the composite of [x, y, z]. Suppose there are about 3000 samples of this data.
EDA
For the time being, only simple visualization is required.
The histogram and correlation diagram of each variable are as follows. Somehow it seems that each parameter has its own characteristics, but I don't know how to extract those characteristics.
Here is the result of visualization in 3D. I don't understand more and more.
With reference to sphere cluster example (make a sphere graph), extract only the direction data from the above scatter plot. Let's plot it on a 3D sphere.
Spherical plot of all data | First half and second half spherical plot |
---|---|
Somehow the features came out. Considering the influence of time, I tried to plot the time stamp of the data separately for the first half (blue) and the second half (red), but it seems that there is no time change especially in the first half and the second half.
Finally, movMF is fitted to the test data.
###############################################################################
# Mixture of von Mises Fisher clustering (soft)
vmf_soft = VonMisesFisherMixture(n_clusters=1, posterior_type='soft', n_init=20)
vmf_soft.fit(X)
print "center:{}".format(vmf_soft.cluster_centers_)
print "kappa:{}".format(vmf_soft.concentrations_[0])
The result is as follows.
center:[[-0.82651971 0.17847126 0.53386626]]
kappa:3.91851479297
We were able to successfully model the direction and its degree of variance from the 3D vector with the probability density distribution. For example, if the number of dimensions of X is matched with the number of words appearing in the document, it is possible to model the distribution of points on the D-dimensional hypersphere with movMF and perform clustering.
Recommended Posts