Do you know UMAP? I know

If you've heard of it but don't know it, this article will help you understand it.

What is UMAP

** It is a method of dimension reduction and visualization faster and higher performance than t-SNE **. Let's compare it with the commonly used t-SNE. The figure below is a visualization of Fashion MNIST.

(From Understanding UMAP Understanding UMAP)

Compared to t-SNE, UMAP shows that the clusters are clearly separated. Also, similar categories are located close to each other, and dissimilar categories are located far away from each other. (There are a lot of visualization examples in the explanation of Understanding UMAP and Understanding UMAP, so please see that for details. You can see the 3D figure above.)

** UMAP has almost constant execution time regardless of the number of embedded dimensions **. Unlike t-SNE, the execution time does not increase exponentially even if the embedded dimension increases, so it can be used not only for visualization but also for dimension reduction. Also, UMAP written in Python is faster than FIt-SNE, a fast implementation of t-SNE in C ++.

And ** supports embedding new samples **. In the case of t-SNE, it was difficult to add new data to the low-dimensional space, and the whole thing had to be executed again. On the other hand, with UMAP, only the difference is calculated and it can be embedded naturally.

** If you just want to use it as a tool, it's very easy. ** (Code is from [Official Document] howToUseUMAP)

!pip install umap-learn

import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import umap

#Try with Digits
digits = load_digits()

#Reduced to 2D with umap
reducer = umap.UMAP(random_state=42)
reducer.fit(digits.data)
embedding = reducer.transform(digits.data)

# plot
plt.scatter(embedding[:, 0], embedding[:, 1], c=digits.target, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('UMAP projection of the Digits dataset', fontsize=24);

It can also execute very high-dimensional and large amounts of data in a realistic amount of time. The figure below is a visualization of a binary vector obtained by factoring 30 million integers into prime factors. (I don't know if it's a meaningful visualization, but it's cool)

If you are satisfied with this, this is the end. Please use it conveniently as a tool.

Existing method

Before we get into the explanation of UMAP, let's take a quick look at existing methods.

So far, many dimensionality reduction methods have been proposed. It is required to be able to retain the global structure and local structure of data. And recently, a certain amount of calculation speed is also required.

Linear method (PCA, etc.)

Works well with artificial data, but has poor performance with high-dimensional real data. Swiss-roll doesn't work.

Non-linear method

UMAP belongs here.

Isomap
LLE
Laplacian Eignmap
t-SNE
LargeVis

Among them, ** t-SNE is often used as a method that can maintain both global structure and local structure well **. However, there was a problem with execution time, and LargeVis and others were proposed to solve it. In this article, I will try to explain UMAP in an engineering manner, but I think it will be easier to understand if you have some knowledge of Isomap and t-SNE. The papers listed in the existing methods are summarized in References.

Character definition

Here are some definitions for the future.

--Number of data $ N $ -$ n $ Dimension High Dimensional Data Set $ X $ - X = \\{x_1, x_2, ..., x_N\\}, \; x_i \in \mathbb{R}^n -$ m $ Dimension Low Dimensional Data Set $ Y $ ($ m << n $) - Y = \\{y_1, y_2, ..., y_N\\}, \; y_i \in \mathbb{R}^m

From now on, these characters will be used without notice.

Engineering understanding of UMAP

First, let's reconfirm the purpose of UMAP.

** Converting high-dimensional data $ X $ to low-dimensional data $ Y $. However, the local structure and global structure of $ X $ are preserved and converted. ** **

UMAP papers cannot be truly understood without knowledge of mathematics. However, the dissertation contains explanations for those who do not have knowledge of mathematics. UMAP is based on advanced mathematics, but in fact ** UMAP can be regarded as graph creation and operation on the graph **. This is because UMAP is based on an idea very similar to Isomap, t-SNE, Large Vis, etc.

The flow of the method is the following two steps.

Create a weighted k-nearest neighbor graph
Place the graph in a lower dimension

The figure is from [Large Vis Paper] [Large Vis Paper]

Step 1 focuses on the local structure of the data. Then, the graph constructed based on the local structure is arranged so as to represent the global structure in step 2 (image). Since the station distance can be locally regarded as the Euclidean distance in the manifold, the neighborhood bluff can also be imaged using the Euclidean distance. (Actually, it is not limited to distance, but it can also be used for things like dissimilarity.)

Let's look at them in order.

1. Create a weighted k-nearest neighbor graph

① Create a k neighborhood graph

First, let's make a neighborhood graph without weighting. Define distances (metrics) $ d $ to find the neighborhood (Imagine an Euclidean distance, etc.).

Then, the k neighborhoods in the metric $ d $ of $ x_i $ are written as follows. \\{x_{i_1}, ..., x_{i_k}\\}

The normal k-nearest neighbor method may be used for the neighborhood search, but the approximation method is used in UMAP. The normal k-nearest neighbor method costs $ O (N ^ 2) $, but the approximation method speeds up to $ O (N ^ {1.14}) $. For details, see the reference [Frontline of approximate nearest neighbor search] Approximate knn.

After the neighborhood search is completed, create a k-nearest neighbor graph by squeezing the neighborhoods along the edges. That is, the edge set $ E $ and the vertex set $ V $ are defined as follows.

E = \\{(x_i, x_{i_j}) \; | \; 0\leq i \leq N, \, 0\leq j \leq k \\} $ V $ corresponds to each point of the data.

In this way, a directed k-nearest neighbor graph $ G_1 $ was created without weights. $ G_1 = (V, E) $

To summarize a little ** In "(1) Creating a k-nearest neighbor graph", k neighborhoods were searched for each $ x_i $ of high-dimensional data $ X $, and an unweighted directed k-nearest neighbor graph $ G_1 $ was created. ** **

➁ Create a weighted k-nearest neighbor graph

To give weights, we first define a weighting function $ w $. The right-hand side is normalized by $ \ rho_i $ and $ \ sigma_i $.

w((x_i, x_{i_j})) = \exp(\dfrac{-\max(0, \, d(x_i, x_{i_j})-\rho_i)}{\sigma_i})

Where $ \ rho_i $ is the distance to the nearest neighbor. \rho_i = \min\\{d(x_i, x_{i_j}) \; |\; 1\leq j \leq k \\}

Also, $ \ sigma_i $ is obtained by a two-part search so as to satisfy the following equation. \Sigma_j w((x_i, x_{i_j})) = \log_2 (k) $ \ Log_2 (k) $ on the right side is empirically obtained.

(This $ log_2 {k} $ is something, but I think it has a perplexity-like meaning of t-SNE because the variance is calculated by a two-part search.)

Create a directed weighted graph $ G_2 $ using the weighting function above. G_2 = (V, E, w)

However, this $ G_2 $ is not used as it is. Create an undirected graph $ G $ by the following transformation. $ A $: Adjacency matrix of $ G_2 $ $ B = A + A ^ T-A \ circ A ^ T, \ circ: Hadamard product $

If you look at the elements of $ B $, you can see that they are targeting. B_{i,l} = w((x_i, x_l)) + w((x_l, x_i)) - w((x_i, x_l))*w((x_l, x_i)) \\ = B_{l. i}

(I think this is the same function as symmetrizing the conditional probability of SNE with t-SNE. The cost function becomes easier to handle.)

The graph $ G $ is created using the $ B $ created in this way as an adjacency matrix. Place this graph $ G $ in lower dimensions in step ➂.

To summarize a little ** In "➁ Create a weighted k-nearest neighbor graph", a weighted and undirected k-nearest neighbor graph $ G $ was created **.

2. Place in a lower dimension

Now that we have an adjacency matrix, we want to place it in a low-dimensional space. However, the method of drawing a graph from an adjacency matrix is not unique. Look at the two figures below.

Both are visualizations of the same graph (images are from [here] [mechanical model explanation]). Use a ** mechanical model ** for neat placement. Since the explanation will be long, I wrote the explanation of the dynamic model in [Qiita separate article (graph drawing algorithm and behind Networkx)] Qiita mechanical model. In the mechanical model, the vertices are arranged in a good way by the following algorithm.

Calculate the attractive / repulsive force acting between each vertex
Move the vertices based on the force calculated in 1. The magnitude of the moving displacement is less than or equal to the temperature parameter $ t $
Lower the temperature parameter
Repeat steps 1 to 3 a certain number of times

Since it is necessary to define attractive force and repulsive force in the mechanical model, it is defined as follows. I'll explain why this formula is a little later, so I'll accept it and move on.

f_a(y_i, y_j) = \dfrac{-2ab\|y_i - y_j\|^{2(b-1)}} {1+\|y_i - y_j\|^2} w((x_i, x_j))(y_i - y_j)

f_r(y_i, y_j) = \dfrac{b} {(\epsilon + \|y_i - y_j\|^2)(1 + \|y_i - y_j\|^2)} (1 - w((x_i, x_j))) (y_i - y_j)

$ a, b $: Hyperparameters $ \ Epsilon $: Small value to avoid division by zero

However, UMAP has streamlined the calculation of repulsive force. Considering a vertex $ x_i $,

--Attractive force acts on k nearby --Repulsive force works on N-k pieces other than the vicinity

Generally, $ k << N $, so the repulsive force calculation becomes a bottleneck. UMAP uses ** negative sampling ** to reduce the amount of calculation. Vertices that are not connected by an edge are considered to be connected by a negative edge.

Then, sample this negative edge in a good way and update it provisionally. UMAP uses the following vertex sampling distribution $ P $. This can be approximated to a uniform distribution in a sufficiently large dataset. (I don't know about this, so please tell me.)

P(x_i) = \dfrac{ \sum_{\{a \in A | d_0(a) = x_i\}}{1 - \mu(a)} }{ \sum_{\{b\in A | d_0(b) \neq x_i \}}{1 - \mu(b)} }

$ \ mu $: Membership function (degree of belonging to $ \ mu: A \ rightarrow [0, 1] $, $ A $) $ A $: Fuzzy topological expression (likely)

When updating the placement, instead of using all the negative edges, sample some and update. In other words, the algorithm is as follows.

Calculate the attractive force acting between each vertex. ** Repulsion is calculated by sampling with the distribution $ P $ **.
Move the vertices based on the force calculated in 1. The magnitude of the moving displacement is less than or equal to the temperature parameter $ t $
Lower the temperature parameter
Repeat steps 1 to 3 a certain number of times

I don't know if convergence is guaranteed, but empirically it seems to converge properly.

Is this understanding completely different from the original flow? ??

"UMAP's dissertation is originally insanely mathematical, but is there an explanation now?"

"How do you relate to Fuzzy topology and Riemannian manifolds?"

For those who think, write the relevance to the original flow. To do so, we first need to know the original mathematical flow ...

Original UMAP flow

Those who have knowledge of mathematics should understand this flow.

Formulation
Derivation of cost function
Minimize the cost function and embed it in lower dimensions

Let's take a quick look at each.

1. Formulation

To get an overview, I will quote a part from [Qiita reading UMAP papers] and [Qiita reading UMAP papers].

UMAP introduces the following concepts. (Hey)

fuzzy set --Fuzzy set

simplicial set --Simplicial complex

fuzzy simplicial set --Combination of 1 and 2

FinReal --Convert a simple substance to EPMet

FinSing --Convert EPMet to fuzzy simplicial set

And define the operation to transform the point sequence into Fuzzy topological expression.

First, Chapter 1 described how to estimate the Riemannian manifold from the data sequence. However, I did not really estimate the shape of the manifold itself, but estimated the distance over a finite space (FinEPMet).

In Chapter 3, we defined a fuzzy simplicial set that looks like a fuzzy complex, and defined an operation called FinSing to convert a FinEPMet to a fuzzy simplicial set.

I don't know if I just quote it, but I think it's probably because the mathematician says it (abandoned understanding). If you want to know about this area, the following article will be helpful. If you are comfortable with English, reading the dissertation is accurate.

-[Reading UMAP papers Qiita] [Reading UMAP papers Qiita] -[UMAP commentary article] Article to follow UMAP formula

2. Derivation of cost function

The cross entropy of the two Fuzzy topological expressions is defined as follows.

C((A, \mu), (A, \nu)) \rightarrow \sum_{a \in A}{( \mu(a)\log(\nu(a)) + (1 - \mu(a)) \log(1-\nu(a)) )}

$ \ mu, \ nu $: Membership function (degree of belonging to $ \ mu: A \ rightarrow [0, 1] $, $ A $) A：Fuzzy topological expression

The equal sign is not used because the right-hand side is the formula after omitting the parts that are not related to the minimization. (See the paper for details.)

3. Minimize the cost function and embed it in lower dimensions

--Convert high-dimensional data to fuzzy topological expression $ X $ --Convert low-dimensional data (embed destination) to fuzzy topological expression $ Y $ as well

The low-dimensional data $ Y $ is changed according to the gradient method so that the cross entropy between the Fuzzy topological expressions is minimized. The $ Y $ created in this way is the visualization result or the dimension reduction result.

To summarize so roughly,

** There is a cost function derived by a mathematician. ** ** ** Minimize this for math-based low-dimensional embedding. ** **

Connection with the engineering understanding up to now

"What was the explanation of the k-nearest neighbor graph and the dynamic model?" I'm sorry to have kept you waiting. ** The definition of attractive / repulsive force of the dynamic model is linked to the cost function **.

The cost function could be written as follows.

C((A, \mu), (A, \nu)) \rightarrow \sum_{a \in A}{( \mu(a)\log(\nu(a)) + (1 - \mu(a)) \log(1-\nu(a)) )}

$ \ mu, \ nu $: Membership function (degree of belonging to $ \ mu: A \ rightarrow [0, 1] $, $ A $) A：Fuzzy topological expression

If you look closely at the equation, you can see the relationship with the mechanical model. Since $ \ mu $ is a membership function for higher dimensional data, $ \ mu (a) $ belongs to $ A $ and $ (1- \ mu (a)) $ belongs to $ A $. Degree of not. That is, the first term of $ \ sum $ can be regarded as a term that acts on the neighborhood, and the second term can be regarded as a term that acts on other than the neighborhood.

The ** attractive force and repulsive force mentioned earlier are calculated back from the cost function ** (although not specified in the paper, after deriving the cost function even with the existing method LargeVis, the term corresponding to the repulsive force is efficiently sampled by negative sampling. Is calculated).

After all, knowledge of mathematics is required to get to the bottom of the definition of attractive / repulsive force. However, I feel that the understanding of "creating a graph and embedding it according to the force" is one step ahead of the understanding of "minimizing the cost function that is not well understood".

Finally

It's interesting that the mathematically difficult UMAP can be formulated with a simple (albeit somewhat technical) mechanical model. This is the maximum understanding for someone without a math background (I). There are some parts that I don't understand, so it would be helpful if you could point out the mistakes by hitting the paper without taking this article for granted.

References

[UMAP: UniformManifold ApproximationandProjectionfor DimensionReduction, arXiv] UMAP Paper : Here is everything

-[Reading UMAP papers Qiita] [Reading UMAP papers Qiita] -[UMAP commentary article] Article to follow UMAP formula : These two articles are trying to understand mathematical formulas

Understanding UMAP : There are abundant figures and it is easy to understand. It's interesting just to look at it, so please take a look.

-[UMAP document] [UMAP document]

HowToUseUMAP : If you want to use it as a tool, look here

[Dimensionality reduction with t-SNE (Rtsne) and UMAP (uwot) using R packages.] From tSNE to UMAP (tried system) : Get an overview of t-SNE and UMAP. Practical> Theory

[Graph drawing algorithm and behind Networkx] Qiita mechanical model : Explanation of typical mechanical model algorithms. (My article)

[Forefront of approximate nearest neighbor search] Approximate knn : Space division by tree, hash, hypothesis that "neighborhood is also neighborhood" is interesting

You can deepen your understanding by reading the following papers.

-[Barns-Hut-SNE] Barns-Hut-SNE paper -[(t-SNE) Visualizing Data using t-SNE] t-SNE paper -[(LargeVis) Visualizing Large-scale and High-dimensional Data] LargeVis Paper

(fruchterman-reingold algorithm) Graph drawing by force‐directed placement

I want to understand (engineering) UMAP stronger than t-SNE