This time, I summarized the dimension reduction algorithm ** t-SNE ** (t-Distributed Stochastic Neighbor Embedding). t-SNE is a ** dimension reduction algorithm ** for converting and visualizing high-dimensional data into 2D or 3D, and was developed by Professor Hinton, who is also known as the father of deep learning. (Professor Hinton is amazing) This time, I will understand this t-SNE and improve the visualization power.
In understanding t-SNE, I referred to the following.
-Visualizing Data using t-SNE (original paper)
** t-SNE ** is a dimension reduction algorithm for dropping high-dimensional data into 2D or 3D. PCA and MDS are the classics when it comes to dimensionality reduction, but there were some problems with these linear dimensionality reductions.
In order to solve these problems, various non-linear dimensionality reduction technologies ** have been created to maintain the local structure of data (keeping similar data close even in lower dimensions). I did. t-SNE is an algorithm that follows that trend.
Below is an image of non-linear data.
The points of t-SNE are described. I would like to describe the specific contents of the process and the reasons for the advantages and disadvantages in detail in the explanation of the algorithm described later.
** Processing points **
--Convert so that the distance distribution in the higher dimension matches the distance distribution in the lower dimension as much as possible. --Assuming that the distance distribution follows the Student's t-distribution (SNE assumed a Gaussian distribution, but it was improved from that)
merit
--Very good capture of high-dimensional local structures --Capture the overall structure as much as possible
Demerit
--If you change the Perplexity (internal parameter), a completely different cluster will appear.
In understanding the t-SNE algorithm, we start by understanding its predecessor, the SNE algorithm. After that, we will sort out the problems of SNE and summarize the t-SNE created by solving the problems.
SNE starts by converting the distance between data points into conditional probabilities. Select $ x_ {j} $ as the neighborhood with the similarity of data points $ x_ {i} $ and $ x_ {j} $ given $ x_ {i} $ ** Conditional probability $ p_ { Expressed as j | i} $. ** Furthermore, we assume that $ x_j $ is stochastically selected based on a ** normal distribution centered on $ x_ {i} $ **.
Then $ p_ {j | i} $ can be expressed by the following formula.
p_{j|i} = \frac{\exp(-||x_{i} - x_{j}||^2/2\sigma_{i}^2)}{\sum_{k\neq i}\exp(-||x_{i} - x_{k}||^2/2\sigma_{i}^2)}
The above is obtained from the probability density function of the normal distribution. $ \ frac {1} {\ sqrt {2πσ ^ 2}} $ is a constant and is canceled by the denominator numerator.
\frac{1}{\sqrt{2πσ^2}}\exp{(-(x-μ)^2/2σ^2)}
We assume a normal distribution with mean $ x_ {i} $ and variance $ \ sigma_ {i} ^ 2 $, but the value of variance $ \ sigma_ {i} ^ 2 $ is adjusted by the parameters described below. .. Since the assumption of the distribution changes depending on the variance $ \ sigma_ {i} ^ 2 $, the output after the final dimension reduction also changes significantly.
Also, since this time the purpose is to express the distance between different data with conditional probabilities, we will place it as follows.
p_{i|i} = 0
The similarity between the dimensionally reduced data points $ y_ {i} $ and $ y_ {j} $ is also expressed as ** conditional probability $ q_ {j | i} $ as before. ** Similarly, we assume that $ y_j $ is stochastically selected based on a normal distribution centered on $ y_ {i} $, but unlike before ** the variance is $ \ frac {1} { Fixed with \ sqrt {2}} $ **. By fixing it, you can cancel the variance from the previous formula and simplify it.
$ q_ {j | i} $ can be expressed by the following formula.
q_{j|i} = \frac{\exp(-||y_{i} - y_{j}||^2)}{\sum_{k\neq i}\exp(-||y_{i} - y_{k}||^2)}
Place it as follows as before.
q_{i|i} = 0
Since the distance relationship between data points in the higher dimension and the distance relationship in the lower dimension after dimension reduction should match as much as possible,
The loss function $ C $ this time is expressed by the following formula.
C = \sum_{i}KL(P_{i}||Q_{i}) = \sum_{i} \sum_{j} p_{j|i}\log\frac{p_{j|i}}{q_{j|i}}
$ P_ {i} $ represents all conditional probabilities given $ x_ {i} $, $ Q_ {i} $ represents all conditional probabilities given $ y_ {i} $ Represents a probability.
hereKL divergence is an asymmetric indicatorSo
4.Perplexity The remaining parameters were $ \ sigma_ {i} ^ 2 $. This is the variance of the normal distribution centered on the high-dimensional data point $ x_ {i} $. It is a very important parameter because what kind of distribution is assumed depends on the variance $ \ sigma_ {i} ^ 2 $.
If the data is dense, it is appropriate to assume $ \ sigma_ {i} ^ 2 $ small, and if the data is sparse, it is appropriate to assume $ \ sigma_ {i} ^ 2 $ large. A parameter called Perplexity is used in terms of how to assume data.
SNE will binary search for $ \ sigma_ {i} ^ 2 $ that will generate $ P_ {i} $ with a fixed Perplexity specified by the user. Perplexity is defined as follows.
Perp(P_{i}) = 2^{H(P_{i})}
Also, $ H (P_ {i}) $ is the entropy of $ P_ {i} $ and is defined as follows.
H(P_{i}) = -\sum_{j}p_{j|i}\log_{2}p_{j|i}
For example, fix Perplexity with a value such as $ 40 $ and search for $ \ sigma_ {i} ^ 2 $ that matches the equation. The larger the Perplexity, the larger $ \ sigma_ {i} ^ 2 $, of course, and we assume that the data is sparse. Also, if Perplexity is small, $ \ sigma_ {i} ^ 2 $ is also small, assuming that the data is dense.
Perplexity can also be rephrased as ** the number that determines how many neighborhood points are considered from the data point $ x_ {i} $ **.
We will use the stochastic gradient descent method to minimize the loss function. The gradient uses the following, which is the value obtained by differentiating the loss function by $ y_ {i} $.
\frac{\delta C}{\delta y_{i}} = 2\sum_{j}(p_{j|i}-q_{j|i}+p_{i|j}-q_{i|j})(y_{i}-y_{j})
We will gradually move $ y_ {i} $ using this gradient, and the update formula is as follows.
Y^{(t)} = Y^{(t-1)} + \eta \frac{\delta C}{\delta Y} + \alpha(t)(Y^{(t-1)} - Y^{(t-2)})
The image is that $ Y ^ {(t-1)} $ is moved by the learning rate $ \ eta $ only in the direction of the gradient $ \ frac {\ delta C} {\ delta Y}
This is the flow of SNE.
The above is the flow of SNE, but SNE has the following two problems.
--Difficult to minimize the loss function --Crowding problem
Because the loss function uses KL divergence,
Crowding problem When a multidimensional object is dropped into a lower dimension, the number of data points that can be equidistant from each other decreases, so when the dimension is dropped, it becomes crowded in an attempt to maintain equidistantness **. It's a problem.
For example, in 3D, 4 data points with the number of dimensions + 1 can exist equidistant from each other, but when dropped into 2D, only 3 data points with the number of dimensions + 1 can exist equidistant. There is a possibility of crushing the gap that should have occurred when the number of dimensions is reduced. That is the Crowding problem.
T-SNE tried to solve these problems.
In order to solve the problems of SNE, the following features have been added to t-SNE.
--Symmetrize the loss function --Assuming Student's t distribution when considering the distance between low-dimensional data points
The closeness of the point $ x_ {i} $ and the point $ x_ {j} $ is represented by the joint probability distribution $ p_ {ij} $ as the symmetry processing of the loss function. (Note that the symmetrization is a simultaneous probability rather than a conditional probability)
p_{ij} = \frac{p_{i|j} + p_{j|i}}{2n}
p_{j|i} = \frac{\exp(-||x_{i} - x_{j}||^2/2\sigma_{i}^2)}{\sum_{k\neq i}\exp(-||x_{i} - x_{k}||^2/2\sigma_{i}^2)}
The closeness of the point $ y_ {i} $ and the point $ y_ {j} $ on the lower dimension is expressed by the joint probability distribution $ q_ {ij} $ as follows.
q_{ij} = \frac{(1+||y_{i} - y_{j}||^2)^{-1}}{\sum_{k\neq i}(1+||y_{i} - y_{j}||^2)^{-1}}
Here, in considering the distance of points on the lower dimension, we assume a student's t distribution with $ 1 $ degrees of freedom. ** The t distribution is characterized by a higher base than the normal distribution **. By using a high-base t distribution, medium-range data points can be modeled in high-dimensional space as longer distances, and ** Crowding problems can be avoided for medium-range data points. I can do it. ** **
t-SNE uses these $ p_ {ij} $ and $ q_ {ij} $ to minimize the loss function. The loss function is expressed as follows.
C = \sum_{i}KL(P ||Q) = \sum_{i} \sum_{j} p_{ji}\log\frac{p_{ji}}{q_{ji}}
This is also optimized using the stochastic gradient descent method as in SNE.
\frac{\delta C}{\delta y_{i}} = 4\sum_{j}(p_{ji}-q_{ji})(y_{i}-y_{j})(1+||y_{i} - y_{j}||^2)^{-1}
The update formula is the same as SNE.
This time, I would like to visualize the distribution situation using text data. Dimensionality reduction algorithms are very effective because text data tends to be high-dimensional when vectorized.
scikit-learn 0.21.3
This time, I think that the dataset uses "livedoor news corpus" to visualize the data distribution status. For details of the dataset and the method of morphological analysis, please refer to Posted in the previously posted article. I will.
In the case of Japanese, preprocessing that decomposes sentences into morphemes is required in advance, so after decomposing all sentences into morphemes, they are dropped into the following data frame.
After vectorizing the text data with TF-IDF, it is reduced to two dimensions using t-SNE.
import pickle
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
#It is assumed that the data frame after morpheme decomposition is already pickled and has
with open('df_wakati.pickle', 'rb') as f:
df = pickle.load(f)
#tf-Vectorization using idf
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df[3])
#t-Dimensionality reduction with SNE
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state = 0, perplexity = 30, n_iter = 1000)
X_embedded = tsne.fit_transform(X)
ddf = pd.concat([df, pd.DataFrame(X_embedded, columns = ['col1', 'col2'])], axis = 1)
article_list = ddf[1].unique()
colors = ["r", "g", "b", "c", "m", "y", "k", "orange","pink"]
plt.figure(figsize = (30, 30))
for i , v in enumerate(article_list):
tmp_df = ddf[ddf[1] == v]
plt.scatter(tmp_df['col1'],
tmp_df['col2'],
label = v,
color = colors[i])
plt.legend(fontsize = 30)
The result is here. It is possible to visualize the existence of clusters for each article medium.
Next I think it would be nice to summarize other dimensionality reduction algorithms. Thank you for reading until the end.
Recommended Posts