PHATE (Moon, KR, van Dijk, D., Wang, Z. et al. Nature Biotechnology 37, 1482–1492 (2019)) Try using -3). There are tons of dimensionality reduction methods used in biology papers, but each has its advantages and disadvantages. For example, the following methods are often used as typical ones.
etc. A method called PHATE (Potential of Heat diffusion for Affinity-based Transition Embedding) was developed to eliminate these drawbacks. In addition, M-PHATE, which is an extension of PHATE, has been developed as a more general embedding of tensors. This is a low-dimensional visualization method for data that includes the time evolution of the same group (such as weight information for each epoch of a neural network).
Installation is easy, just `` `pip install phate``` See PHATE repository for R version, Matlab version, etc.
Experiment with 100-dimensional tree structure data (+ noise) provided by PHATE. It consists of 20 "branches" and has a total of 2,000 data points. As for the structure, the data shows that the remaining 19 branches grow randomly from various parts of the first branch. Each branch faces various directions in 100-dimensional space.
import phate
tree_data, tree_clusters = phate.tree.gen_dla(seed=108)
The usage is the same as the scikit-learn model. First, set the parameters, create an instance of the model, and transform the data with `fit_transform```. There are three main parameters:
knn```, ``
decay, `` `t
. `` `t``` can also be automatically determined from the data. Details will be described later.
phate_operator = phate.PHATE(knn=15, decay=40, t=100)
tree_phated = phate_operator.fit_transform(tree_data)
Calculating PHATE...
Running PHATE on 2000 cells and 100 genes.
Calculating graph and diffusion operator...
Calculating KNN search...
Calculated KNN search in 0.50 seconds.
Calculating affinities...
Calculated affinities in 0.05 seconds.
Calculated graph and diffusion operator in 0.56 seconds.
Calculating diffusion potential...
Calculated diffusion potential in 0.41 seconds.
Calculating metric MDS...
Calculated metric MDS in 10.43 seconds.
Calculated PHATE in 11.41 seconds.
Compressed in two dimensions. Try plotting the results.
import matplotlib.pyplot as plt
import seaborn as sns
def plot_2d(coords, clusters, ax):
for c, color in zip(sorted(list(set(clusters))), plt.cm.tab20.colors):
samples = clusters == c
ax.scatter(coords[samples, 0], coords[samples, 1], \
label=c, color=color)
ax.legend(loc='upper right', bbox_to_anchor=(1.1, 1))
sns.despine()
fig, ax = plt.subplots(figsize=(12, 9))
plot_2d(tree_phated, tree_clusters, ax=ax)
plt.show()
In this way, the structure of the data can be maintained and each branch can be properly separated and visualized. Compare the visualizations of PCA, t-SNE, UMAP, and PHATE with the same data.
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP
def compare_vis(data, clusters, pca_init_dim=100):
print('Running PCA...')
pca_op = PCA(n_components=2)
data_pca = pca_op.fit_transform(data)
print('Running t-SNE...')
pca_init_op = PCA(n_components=pca_init_dim)
tsne_op = TSNE(n_components=2)
data_tsne = tsne_op.fit_transform(pca_init_op.fit_transform(data))
print('Running UMAP...')
pca_init_op = PCA(n_components=pca_init_dim)
umap_op = UMAP(n_components=2)
data_umap = umap_op.fit_transform(pca_init_op.fit_transform(data))
print('Running PHATE...')
phate_op = phate.PHATE(n_components=2)
data_phated = phate_op.fit_transform(data)
fig, axes = plt.subplots(figsize=(12, 36), nrows=4)
plot_2d(data_pca, clusters, ax=axes[0])
plot_2d(data_tsne, clusters, ax=axes[1])
plot_2d(data_umap, clusters, ax=axes[2])
plot_2d(data_phated, clusters, ax=axes[3])
axes[0].set_title('PCA'); axes[1].set_title('t-SNE')
axes[2].set_title('UMAP'); axes[3].set_title('PHATE')
plt.show()
compare_vis(tree_data, tree_clusters)
In PCA, the tree structure is not well understood (because of the large influence of noise), and in t-SNE and UMAP, the clusters are divided appropriately.
Let's also look at embedding in 3D. As will be described later, in the PHATE algorithm, the dimension to be finally embedded is only related to the final MDS calculation, so there is no need to recalculate the whole, and the parameter (`n_components```) is updated to
`. If you do transform```, only the final step will be executed.
%matplotlib notebook
from mpl_toolkits.mplot3d import axes3d
def plot_3d(coords, clusters, fig):
ax = fig.gca(projection='3d')
for c, color in zip(sorted(list(set(clusters))), plt.cm.tab20.colors):
samples = clusters == c
ax.scatter(coords[samples, 0], coords[samples, 1], coords[samples, 2], \
label=c, color=color)
ax.legend()
phate_operator.set_params(n_components=3)
tree_phated = phate_operator.transform()
fig = plt.figure(figsize=(12, 9))
plot_3d(tree_phated, tree_clusters, fig)
Try it with the embryoid scRNA-seq data used for the demonstration in the paper. The data is a little too large, and also to evaluate the stability of the result when downsampling, 10% of the cells are randomly sampled and the result of performing some ordinary pretreatment is made in advance. It was. For details on scRNA-seq data preprocessing and algorithms such as PCA, t-SNE, and UMAP, see Course Video.
import numpy as np
import pandas as pd
import scipy
cells = np.genfromtxt('./data/cell_names.tsv', dtype='str', delimiter='\t')
genes = np.genfromtxt('./data/gene_names.tsv', dtype='str', delimiter='\t')
arr = scipy.io.mmread('./data/matrix.mtx')
EBData = pd.DataFrame(arr.todense(), index=cells, columns=genes)
EBData.head()
Gene expression table of 1,679 cells and 12,993 genes. Each cell is labeled as follows. Since it is taken every 3 days until the 27th day, the time series of differentiation should be observable.
sample_labels = EBData.index.map(lambda x:x.split('_')[1]).values
print(sample_labels)
array(['Day 00-03', 'Day 00-03', 'Day 00-03', ..., 'Day 24-27',
'Day 24-27', 'Day 24-27'], dtype=object)
phate_op = phate.PHATE()
EBData_phated = phate_op.fit_transform(EBData)
%matplotlib inline
fig, ax = plt.subplots(figsize=(12, 9))
plot_2d(EBData_phated, sample_labels, ax=ax)
plt.show()
Dimensionality reduction + visualization by PHATE clearly shows the process of differentiation. Especially, the spread of diversity from the 18th day to the 27th day. With this kind of feeling, PHATE's good point is that when there is an orbit, it is a proper orbit, when there is a cluster, it is like a cluster, and the spread of the cluster is also visualized so that they can be compared with each other. In addition, the figure of the paper that was dimensionally compressed using tens of thousands of cells even though only 10% of the total data was used (Paper Since the drawing is almost the same as Fig.1c) in -3), it can be seen that the sample size is also robust to some extent. I will try 3D as well.
phate_op.set_params(n_components=3)
EBData_phated = phate_op.transform()
fig = plt.figure(figsize=(9, 9))
plot_3d(EBData_phated, sample_labels, fig)
plt.show()
Compare the same scRNA-seq data with PCA, t-SNE, UMAP, etc.
compare_vis(EBData.values, sample_labels)
After all, t-SNE and UMAP have a strong cluster feeling, and it is difficult to understand the orbit.
Try to implement PHATE according to the method of the paper. Since the whole is composed by combining various methods, it is not clear which part of the processing makes a decisive contribution to the improvement of visualization. So, let's implement it by dividing it into several steps. To be honest, I'm not sure about the importance of each step or the necessity, but I'm not sure if it's this implementation ... It consists of the following four steps. If you know the Diffusion map, it may be easier to swallow because it is almost the same up to step 2.
First, the Euclidean distance matrix of the entire data is calculated.
For each data point, an adaptive kernel similar to t-SNE or UMAP is applied so that data points that are close to each other have a large weight (= affinity) and data points that are far away have a small weight. Create a graph structure.
Here comes the parameters `knn``` and ```alpha``` that have a significant effect on the overall result (called
`decay``` in the original implementation of phate). ..
The kernel is an ordinary Gaussian kernel, but the bandwidth is adaptively changed according to the density near the data, such as perplexity in t-SNE and n_neighbors in UMAP.
In the case of PHATE, this adaptive bandwidth is simply set to the distance $ \ epsilon_k $ from each data point to the data point closest to the knn th.
Another device is called the $ \ alpha $ -decaying kernel. If you change only the bandwidth, the affinity remains to a point far away when the width is wide (heavy-tail). Then, the points taken from the low density region will have almost the same affinity with all the other points. To prevent this, use the following kernel to control the affinity decay with the parameter $ \ alpha
In the case of $ \ alpha = 2 $, it is a normal Gaussian kernel, but if $ \ alpha $ is large, the affinity decreases more rapidly.
Both knn and $ \ alpha $ affect how the graphs are connected and should be determined carefully. If knn is too small, or $ \ alpha $ is too large and the graph is too sparse, the diffusion process will not proceed well. On the other hand, if knn is too large or $ \ alpha $ is too small, the whole will be connected with equal weight and the structure will not be captured. For the time being, PHATE is claimed to be robust due to the difference in parameters, but I think it's better to fix one and experiment with it.
import numpy as np
from scipy.spatial.distance import pdist, squareform
def diffusion_operator(data, knn=5, alpha=40):
#Euclidean distance matrix construction
dist = squareform(pdist(data))
#Calculation of adaptive bandwidth epsilon. Sort the distance matrix and the knnth value.
epsilons = np.sort(dist, axis=1)[:, knn]
# alpha-decaying kernel
kernel_mtx = np.exp(-1.* np.power(dist / epsilons[:, np.newaxis], alpha))
# L1-Normalized with norm. It becomes a probability distribution for each row.
l1norm_kernel_mtx = kernel_mtx / np.abs(kernel_mtx).sum(axis=1)[:, np.newaxis]
#Symmetrization.
transition_mtx = 0.5 * l1norm_kernel_mtx + 0.5 * l1norm_kernel_mtx.T
return transition_mtx
The last important parameter is the "t" that determines the timescale of diffusion. Decide how many steps the random walk based on affinity should proceed (= how many diffusion operators should be multiplied). The diffusion process also serves to denoise the data. Data points connected by short paths with high affinity are more likely to visit each other on a random walk, and outlier points that have only low affinity paths are less likely to visit, so the former will move each time the diffusion step progresses. Be emphasized. Therefore, the diffusion time scale can be seen as the degree of noise removal. If it is too small, a lot of noise will remain, and if it is too large, important signals may be removed as noise. In PHATE, as a method of automatically determining the diffusion time scale t from the data, the noise is completely removed by examining the "information amount" contained in the t-th power of the diffusion operator at various t, while the original data. Determine t at the timing when the structure of is not crushed. Specifically, the eigenvalues are calculated for each (symmetric conjugate) of the diffusion operator to the t-th power to calculate von Neumann entropy (VNE). As t increases, VNE decreases. The initial VNE reduction is due to denoising, and non-essential eigenvalues quickly go to zero. The eigenvalues that reflect the structure of the data decrease more slowly. Therefore, if you select t at the timing when the reduction rate switches, you can automatically select t that eliminates noise and does not collapse the structure. Here we calculate the VNE at various t and then perform two linear regressions on the left and right of each t to detect the "knee point" of the plot by choosing the point with the least total error. ..
from tqdm import tqdm_notebook as tqdm
def von_Neumann_entropy(P):
# symmetric conjugate (diffusion affinity matrix)Calculation
norm_factor = np.sqrt(P.sum(axis=1))
A = P / norm_factor[:, np.newaxis] / norm_factor
#Eigenvalue decomposition
eig, _ = np.linalg.eigh(A)
eig = eig[eig > 0]
eig /= eig.sum()
#VNE is the normalized eigenvalue Shannon entropy
VNE = -1.0 * (eig * np.log(eig)).sum()
return VNE
def find_knee_point(vals):
y = vals
x = np.arange(len(y))
candidates = x[2:len(y) - 2]
errors = np.zeros(len(candidates))
errors[:] = np.nan
#Calculate the error by linear regression on the left side and right side for each t
#There seems to be a better way, but here I honestly calculate one by one...
for i, pnt in enumerate(candidates):
# left area linear regression (y = ax + b)
#Calculate the least squares method according to the textbook on the left side of the point.
x_mean = x[:pnt].mean()
y_mean = y[:pnt].mean()
a = ((x[:pnt] - x_mean) * (y[:pnt] - y_mean)).sum() / np.power(x[:pnt] - x_mean, 2).sum()
b = y_mean - a * x_mean
left_err = (a * x[:pnt] + b) - y[:pnt]
# right area linear regression
x_mean = x[pnt:].mean()
y_mean = y[pnt:].mean()
a = ((x[pnt:] - x_mean) * (y[pnt:] - y_mean)).sum() / np.power(x[pnt:] - x_mean, 2).sum()
b = y_mean - a * x_mean
right_err = (a * x[pnt:] + b) - y[pnt:]
# total error
errors[i] = np.abs(left_err).sum() + np.abs(right_err).sum()
#Let the point with the smallest error be the knee point
knee_ind = candidates[np.nanargmin(errors)]
return knee_ind
def find_optimal_timescale(P, max_t=100):
VNEs = np.zeros(max_t)
for t in tqdm(range(1, max_t+1)):
#Calculate the t-th power and VNE of the diffusion operator for each t.
Pt = np.linalg.matrix_power(P, t)
VNEs[t-1] = von_Neumann_entropy(Pt)
knee_point_ind = find_knee_point(VNEs)
optimal_t = knee_point_ind + 1
fig, ax = plt.subplots()
ax.plot(range(len(VNEs)), VNEs)
ax.scatter(knee_point_ind, VNEs[knee_point_ind], marker='o', color='r')
plt.show()
return optimal_t
In the case of the Diffusion map, if the diffusion operator is decomposed into eigenvalues, the coordinates that reflect the diffusion distance (the squared distance of each point of the probability distribution after t steps) can be obtained as it is. However, in this distance calculation, the sensitivity of the low probability value is low, so it is difficult to reflect the information on the distance between distant points (the global structure of the data). Therefore, the Euclidean distance is calculated after logarithmically converting the diffusion operator after t-step diffusion. This is called the "potential distance". The paper explains in detail what this newly introduced distance index means physically (but I wasn't sure).
def potential_distances(P, t):
#multiply the diffusion operator to t
Pt = np.linalg.matrix_power(P, t)
# -log(P^t)Convert to Euclidean distance
return squareform(pdist(-1.0 * np.log(Pt + 1e-7)))
Instead of eigenvalue decomposition like the Diffusion map, the potential distance calculated in 3 is moved to a lower dimension suitable for visualization by MDS (multidimensional scaling). First, low-dimensional coordinates are calculated using classical MDS (= PCoA), and metric MDS is calculated using that as the initial value. metric MDS is executed by scikit-learn's SMACOF algorithm.
from sklearn.manifold import smacof
def mds(dist, n_components=2):
# classical multidimensional scaling (= PCoA)
n = len(dist)
J = np.eye(n) - np.ones((n, n))/n
B = -J.dot(dist**2).dot(J)/2
evals, evecs = np.linalg.eigh(B)
idx = np.argsort(evals)[::-1]
evals = evals[idx]
evecs = evecs[:,idx]
pos = np.where(evals > 0)[0]
initial_coords = evecs[:,pos].dot(np.diag(np.sqrt(evals[pos])))[:, :n_components]
# metric multidimensional scaling (SMACOF algorithm)
coords = smacof(dist, n_components=n_components, init=initial_coords, n_init=1)[0]
return coords
Let's actually execute the above steps with tree structure data.
diffop = diffusion_operator(tree_data, knn=15, alpha=40)
optimal_t = find_optimal_timescale(diffop)
print(optimal_t)
potential_dist = potential_distances(diffop, optimal_t)
coords = mds(potential_dist)
fig, ax = plt.subplots(figsize=(12, 9))
plot_2d(coords, tree_clusters, ax=ax)
plt.show()
Results similar to the original implementation were obtained. Actually, the above algorithm itself is not calculated by devising ways to improve the stability of numerical calculation at each step, or by making various measures to save memory and calculate at high speed, but it is approximate. It seems that it is doing calculations, but it looks like this as a general rule.
Recommended Posts