Try using PHATE, a dimensionality reduction and visualization method for biological data

PHATE (Moon, KR, van Dijk, D., Wang, Z. et al. Nature Biotechnology 37, 1482–1492 (2019)) Try using -3). There are tons of dimensionality reduction methods used in biology papers, but each has its advantages and disadvantages. For example, the following methods are often used as typical ones.

PCA: Principal component analysis. Take out the axis with the largest variance. There are various advantages that other methods do not have, but especially when used for visualization, non-linear features cannot be captured, and local structures that reflect the global features of the distribution tend to be crushed as noise. There are drawbacks.
t-SNE (and UMAP): Search for low-dimensional placements so that the distance relationship with the local neighborhood is preserved for each point. Therefore, the local structure of the data distribution is a well-preserved visualization. When data is composed of multiple clusters, the visualization reflects those clusters well, but continuous "orbits" tend to be divided appropriately. Also, since we rarely see long-distance relationships, the relative distance relationships between multiple clusters change with random numbers.
MDS: Multidimensional scaling. Arranged to store all-to-all distance relationships between data points. It is vulnerable to noise because it does not directly look at the "structure" of the data (it does not distinguish between noise and characteristic structures such as clusters and orbits).
Diffusion map: Construct a transition probability matrix from the distance between data points and operate the diffusion process (random walk on the graph structure of the data). The distribution after t steps is obtained by the t-th power of the transition probability matrix. I want to obtain low-dimensional coordinates that reflect the difference in distribution as the Euclidean distance, but that is obtained as the eigenvector of the t-th power of the transition probability matrix. It has various advantages such as being resistant to noise, being able to handle from local features to global structures by the number of steps t, and making it easy to find structures such as "orbits", but it tends to compress different orbits to different dimensions. In other words, it is not suitable for visualization in low dimensions such as 2D and 3D.
Trajectory inference such as Monocle2: Since it is a method of estimating the tree structure, it is not clear in advance whether the premise holds. The estimation is also unstable.

etc. A method called PHATE (Potential of Heat diffusion for Affinity-based Transition Embedding) was developed to eliminate these drawbacks. In addition, M-PHATE, which is an extension of PHATE, has been developed as a more general embedding of tensors. This is a low-dimensional visualization method for data that includes the time evolution of the same group (such as weight information for each epoch of a neural network).

Installation is easy, just `` `pip install phate``` See PHATE repository for R version, Matlab version, etc.

Visualization of tree structure data

Experiment with 100-dimensional tree structure data (+ noise) provided by PHATE. It consists of 20 "branches" and has a total of 2,000 data points. As for the structure, the data shows that the remaining 19 branches grow randomly from various parts of the first branch. Each branch faces various directions in 100-dimensional space.

import phate

tree_data, tree_clusters = phate.tree.gen_dla(seed=108)

The usage is the same as the scikit-learn model. First, set the parameters, create an instance of the model, and transform the data with `fit_transform```. There are three main parameters: knn```, `` decay, `` `t. `` `t``` can also be automatically determined from the data. Details will be described later.

phate_operator = phate.PHATE(knn=15, decay=40, t=100)

tree_phated = phate_operator.fit_transform(tree_data)

    Calculating PHATE...
      Running PHATE on 2000 cells and 100 genes.
      Calculating graph and diffusion operator...
        Calculating KNN search...
        Calculated KNN search in 0.50 seconds.
        Calculating affinities...
        Calculated affinities in 0.05 seconds.
      Calculated graph and diffusion operator in 0.56 seconds.
      Calculating diffusion potential...
      Calculated diffusion potential in 0.41 seconds.
      Calculating metric MDS...
      Calculated metric MDS in 10.43 seconds.
    Calculated PHATE in 11.41 seconds.

Compressed in two dimensions. Try plotting the results.

import matplotlib.pyplot as plt
import seaborn as sns

def plot_2d(coords, clusters, ax):
    for c, color in zip(sorted(list(set(clusters))), plt.cm.tab20.colors):
        samples = clusters == c
        ax.scatter(coords[samples, 0], coords[samples, 1], \
                   label=c, color=color)
    ax.legend(loc='upper right', bbox_to_anchor=(1.1, 1))
    sns.despine()

fig, ax = plt.subplots(figsize=(12, 9))
plot_2d(tree_phated, tree_clusters, ax=ax)
plt.show()

In this way, the structure of the data can be maintained and each branch can be properly separated and visualized. Compare the visualizations of PCA, t-SNE, UMAP, and PHATE with the same data.

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP

def compare_vis(data, clusters, pca_init_dim=100):
    print('Running PCA...')
    pca_op = PCA(n_components=2)
    data_pca = pca_op.fit_transform(data)
    print('Running t-SNE...')
    pca_init_op = PCA(n_components=pca_init_dim)
    tsne_op = TSNE(n_components=2)
    data_tsne = tsne_op.fit_transform(pca_init_op.fit_transform(data))
    print('Running UMAP...')
    pca_init_op = PCA(n_components=pca_init_dim)
    umap_op = UMAP(n_components=2)
    data_umap = umap_op.fit_transform(pca_init_op.fit_transform(data))
    print('Running PHATE...')
    phate_op = phate.PHATE(n_components=2)
    data_phated = phate_op.fit_transform(data)
    fig, axes = plt.subplots(figsize=(12, 36), nrows=4)
    plot_2d(data_pca, clusters, ax=axes[0])
    plot_2d(data_tsne, clusters, ax=axes[1])
    plot_2d(data_umap, clusters, ax=axes[2])
    plot_2d(data_phated, clusters, ax=axes[3])
    axes[0].set_title('PCA'); axes[1].set_title('t-SNE')
    axes[2].set_title('UMAP'); axes[3].set_title('PHATE')
    plt.show()    

compare_vis(tree_data, tree_clusters)

In PCA, the tree structure is not well understood (because of the large influence of noise), and in t-SNE and UMAP, the clusters are divided appropriately. Let's also look at embedding in 3D. As will be described later, in the PHATE algorithm, the dimension to be finally embedded is only related to the final MDS calculation, so there is no need to recalculate the whole, and the parameter (`n_components```) is updated to `. If you do transform```, only the final step will be executed.

%matplotlib notebook
from mpl_toolkits.mplot3d import axes3d

def plot_3d(coords, clusters, fig):
    ax = fig.gca(projection='3d')
    for c, color in zip(sorted(list(set(clusters))), plt.cm.tab20.colors):
        samples = clusters == c
        ax.scatter(coords[samples, 0], coords[samples, 1], coords[samples, 2], \
                   label=c, color=color)
    ax.legend()

phate_operator.set_params(n_components=3)
tree_phated = phate_operator.transform()

fig = plt.figure(figsize=(12, 9))
plot_3d(tree_phated, tree_clusters, fig)

Visualization of scRNA-seq data

Try it with the embryoid scRNA-seq data used for the demonstration in the paper. The data is a little too large, and also to evaluate the stability of the result when downsampling, 10% of the cells are randomly sampled and the result of performing some ordinary pretreatment is made in advance. It was. For details on scRNA-seq data preprocessing and algorithms such as PCA, t-SNE, and UMAP, see Course Video.

import numpy as np
import pandas as pd
import scipy

cells = np.genfromtxt('./data/cell_names.tsv', dtype='str', delimiter='\t')
genes = np.genfromtxt('./data/gene_names.tsv', dtype='str', delimiter='\t')
arr = scipy.io.mmread('./data/matrix.mtx')

EBData = pd.DataFrame(arr.todense(), index=cells, columns=genes)
EBData.head()

Gene expression table of 1,679 cells and 12,993 genes. Each cell is labeled as follows. Since it is taken every 3 days until the 27th day, the time series of differentiation should be observable.

sample_labels = EBData.index.map(lambda x:x.split('_')[1]).values
print(sample_labels)

    array(['Day 00-03', 'Day 00-03', 'Day 00-03', ..., 'Day 24-27',
           'Day 24-27', 'Day 24-27'], dtype=object)

phate_op = phate.PHATE()
EBData_phated = phate_op.fit_transform(EBData)

%matplotlib inline
fig, ax = plt.subplots(figsize=(12, 9))
plot_2d(EBData_phated, sample_labels, ax=ax)
plt.show()

Dimensionality reduction + visualization by PHATE clearly shows the process of differentiation. Especially, the spread of diversity from the 18th day to the 27th day. With this kind of feeling, PHATE's good point is that when there is an orbit, it is a proper orbit, when there is a cluster, it is like a cluster, and the spread of the cluster is also visualized so that they can be compared with each other. In addition, the figure of the paper that was dimensionally compressed using tens of thousands of cells even though only 10% of the total data was used (Paper Since the drawing is almost the same as Fig.1c) in -3), it can be seen that the sample size is also robust to some extent. I will try 3D as well.

phate_op.set_params(n_components=3)
EBData_phated = phate_op.transform()
fig = plt.figure(figsize=(9, 9))
plot_3d(EBData_phated, sample_labels, fig)
plt.show()

Compare the same scRNA-seq data with PCA, t-SNE, UMAP, etc.

compare_vis(EBData.values, sample_labels)

After all, t-SNE and UMAP have a strong cluster feeling, and it is difficult to understand the orbit.

Implementation of PHATE

Try to implement PHATE according to the method of the paper. Since the whole is composed by combining various methods, it is not clear which part of the processing makes a decisive contribution to the improvement of visualization. So, let's implement it by dividing it into several steps. To be honest, I'm not sure about the importance of each step or the necessity, but I'm not sure if it's this implementation ... It consists of the following four steps. If you know the Diffusion map, it may be easier to swallow because it is almost the same up to step 2.

Calculation of Diffusion operator

Convert to distance matrix, calculate local affinity by kernel biting, configure one-step transition probability matrix = diffusion operator

Determining the diffusion time scale "t"

Determine how many steps the diffusion proceeds from the amount of information contained in the t-th power of the diffusion operator.

Run the spreading process and calculate "potential distances"

The distance after diffusion is calculated by the newly proposed Potential distance, instead of measuring by the diffusion distance as in the diffusion map.

Low-dimensional embedding of Potential distance with MDS

Use MDS (Multidimensional Scaling) to transfer the distance relationship of potential distance to the distance relationship of arbitrary low-dimensional coordinates.

1. Calculation of Diffusion operator

First, the Euclidean distance matrix of the entire data is calculated. For each data point, an adaptive kernel similar to t-SNE or UMAP is applied so that data points that are close to each other have a large weight (= affinity) and data points that are far away have a small weight. Create a graph structure. Here comes the parameters `knn``` and ```alpha``` that have a significant effect on the overall result (called `decay``` in the original implementation of phate). .. The kernel is an ordinary Gaussian kernel, but the bandwidth is adaptively changed according to the density near the data, such as perplexity in t-SNE and n_neighbors in UMAP. In the case of PHATE, this adaptive bandwidth is simply set to the distance $ \ epsilon_k $ from each data point to the data point closest to the knn th. Another device is called the $ \ alpha $ -decaying kernel. If you change only the bandwidth, the affinity remains to a point far away when the width is wide (heavy-tail). Then, the points taken from the low density region will have almost the same affinity with all the other points. To prevent this, use the following kernel to control the affinity decay with the parameter $ \ alpha . $K_{k, \alpha}(x,y)=\frac{1}{2}exp\left( -\left(\frac{|x-y|_2}{\epsilon_k(x)}\right)^{\alpha}\right) + \frac{1}{2}exp\left( -\left(\frac{|x-y|_2}{\epsilon_k(y)}\right)^{\alpha}\right)$$
In the case of $ \ alpha = 2 $, it is a normal Gaussian kernel, but if $ \ alpha $ is large, the affinity decreases more rapidly. Both knn and $ \ alpha $ affect how the graphs are connected and should be determined carefully. If knn is too small, or $ \ alpha $ is too large and the graph is too sparse, the diffusion process will not proceed well. On the other hand, if knn is too large or $ \ alpha $ is too small, the whole will be connected with equal weight and the structure will not be captured. For the time being, PHATE is claimed to be robust due to the difference in parameters, but I think it's better to fix one and experiment with it.

import numpy as np
from scipy.spatial.distance import pdist, squareform

def diffusion_operator(data, knn=5, alpha=40):
    #Euclidean distance matrix construction
    dist = squareform(pdist(data))
    #Calculation of adaptive bandwidth epsilon. Sort the distance matrix and the knnth value.
    epsilons = np.sort(dist, axis=1)[:, knn]
    # alpha-decaying kernel
    kernel_mtx = np.exp(-1.* np.power(dist / epsilons[:, np.newaxis], alpha))
    # L1-Normalized with norm. It becomes a probability distribution for each row.
    l1norm_kernel_mtx = kernel_mtx / np.abs(kernel_mtx).sum(axis=1)[:, np.newaxis]
    #Symmetrization.
    transition_mtx = 0.5 * l1norm_kernel_mtx + 0.5 * l1norm_kernel_mtx.T
    return transition_mtx

2. Determining the diffusion time scale "t"

The last important parameter is the "t" that determines the timescale of diffusion. Decide how many steps the random walk based on affinity should proceed (= how many diffusion operators should be multiplied). The diffusion process also serves to denoise the data. Data points connected by short paths with high affinity are more likely to visit each other on a random walk, and outlier points that have only low affinity paths are less likely to visit, so the former will move each time the diffusion step progresses. Be emphasized. Therefore, the diffusion time scale can be seen as the degree of noise removal. If it is too small, a lot of noise will remain, and if it is too large, important signals may be removed as noise. In PHATE, as a method of automatically determining the diffusion time scale t from the data, the noise is completely removed by examining the "information amount" contained in the t-th power of the diffusion operator at various t, while the original data. Determine t at the timing when the structure of is not crushed. Specifically, the eigenvalues are calculated for each (symmetric conjugate) of the diffusion operator to the t-th power to calculate von Neumann entropy (VNE). As t increases, VNE decreases. The initial VNE reduction is due to denoising, and non-essential eigenvalues quickly go to zero. The eigenvalues that reflect the structure of the data decrease more slowly. Therefore, if you select t at the timing when the reduction rate switches, you can automatically select t that eliminates noise and does not collapse the structure. Here we calculate the VNE at various t and then perform two linear regressions on the left and right of each t to detect the "knee point" of the plot by choosing the point with the least total error. ..

from tqdm import tqdm_notebook as tqdm

def von_Neumann_entropy(P):
    # symmetric conjugate (diffusion affinity matrix)Calculation
    norm_factor = np.sqrt(P.sum(axis=1))
    A = P / norm_factor[:, np.newaxis] / norm_factor
    #Eigenvalue decomposition
    eig, _ = np.linalg.eigh(A)
    eig = eig[eig > 0]
    eig /= eig.sum()
    #VNE is the normalized eigenvalue Shannon entropy
    VNE = -1.0 * (eig * np.log(eig)).sum()
    return VNE

def find_knee_point(vals):
    y = vals
    x = np.arange(len(y))
    candidates = x[2:len(y) - 2]
    errors = np.zeros(len(candidates))
    errors[:] = np.nan
    #Calculate the error by linear regression on the left side and right side for each t
    #There seems to be a better way, but here I honestly calculate one by one...
    for i, pnt in enumerate(candidates):
        # left area linear regression (y = ax + b)
        #Calculate the least squares method according to the textbook on the left side of the point.
        x_mean = x[:pnt].mean()
        y_mean = y[:pnt].mean()
        a = ((x[:pnt] - x_mean) * (y[:pnt] - y_mean)).sum() / np.power(x[:pnt] - x_mean, 2).sum()
        b = y_mean - a * x_mean
        left_err = (a * x[:pnt] + b) - y[:pnt]
        # right area linear regression
        x_mean = x[pnt:].mean()
        y_mean = y[pnt:].mean()
        a = ((x[pnt:] - x_mean) * (y[pnt:] - y_mean)).sum() / np.power(x[pnt:] - x_mean, 2).sum()
        b = y_mean - a * x_mean
        right_err = (a * x[pnt:] + b) - y[pnt:]
        # total error
        errors[i] = np.abs(left_err).sum() + np.abs(right_err).sum()
    #Let the point with the smallest error be the knee point
    knee_ind = candidates[np.nanargmin(errors)]
    return knee_ind

def find_optimal_timescale(P, max_t=100):
    VNEs = np.zeros(max_t)
    for t in tqdm(range(1, max_t+1)):
        #Calculate the t-th power and VNE of the diffusion operator for each t.
        Pt = np.linalg.matrix_power(P, t)
        VNEs[t-1] = von_Neumann_entropy(Pt)
    knee_point_ind = find_knee_point(VNEs)
    optimal_t = knee_point_ind + 1
    fig, ax = plt.subplots()
    ax.plot(range(len(VNEs)), VNEs)
    ax.scatter(knee_point_ind, VNEs[knee_point_ind], marker='o', color='r')
    plt.show()
    return optimal_t

3. Run the spreading process and calculate "potential distances"

In the case of the Diffusion map, if the diffusion operator is decomposed into eigenvalues, the coordinates that reflect the diffusion distance (the squared distance of each point of the probability distribution after t steps) can be obtained as it is. However, in this distance calculation, the sensitivity of the low probability value is low, so it is difficult to reflect the information on the distance between distant points (the global structure of the data). Therefore, the Euclidean distance is calculated after logarithmically converting the diffusion operator after t-step diffusion. This is called the "potential distance". The paper explains in detail what this newly introduced distance index means physically (but I wasn't sure).

def potential_distances(P, t):
    #multiply the diffusion operator to t
    Pt = np.linalg.matrix_power(P, t)
    # -log(P^t)Convert to Euclidean distance
    return squareform(pdist(-1.0 * np.log(Pt + 1e-7)))

4. Low-dimensional embedding of Potential distance with MDS

Instead of eigenvalue decomposition like the Diffusion map, the potential distance calculated in 3 is moved to a lower dimension suitable for visualization by MDS (multidimensional scaling). First, low-dimensional coordinates are calculated using classical MDS (= PCoA), and metric MDS is calculated using that as the initial value. metric MDS is executed by scikit-learn's SMACOF algorithm.

from sklearn.manifold import smacof

def mds(dist, n_components=2):
    # classical multidimensional scaling (= PCoA)
    n = len(dist)
    J = np.eye(n) - np.ones((n, n))/n
    B = -J.dot(dist**2).dot(J)/2
    evals, evecs = np.linalg.eigh(B)
    idx   = np.argsort(evals)[::-1]
    evals = evals[idx]
    evecs = evecs[:,idx]
    pos = np.where(evals > 0)[0]
    initial_coords  = evecs[:,pos].dot(np.diag(np.sqrt(evals[pos])))[:, :n_components]
    # metric multidimensional scaling (SMACOF algorithm)
    coords = smacof(dist, n_components=n_components, init=initial_coords, n_init=1)[0]
    return coords

Let's actually execute the above steps with tree structure data.

diffop = diffusion_operator(tree_data, knn=15, alpha=40)
optimal_t = find_optimal_timescale(diffop)
print(optimal_t)

potential_dist = potential_distances(diffop, optimal_t)
coords = mds(potential_dist)

fig, ax = plt.subplots(figsize=(12, 9))
plot_2d(coords, tree_clusters, ax=ax)
plt.show()

Results similar to the original implementation were obtained. Actually, the above algorithm itself is not calculated by devising ways to improve the stability of numerical calculation at each step, or by making various measures to save memory and calculate at high speed, but it is approximate. It seems that it is doing calculations, but it looks like this as a general rule.