MultiVelo Fig2

We will use the embryonic E18 mouse brain from 10X Multiome as an example.

CellRanger output files can be downloaded from 10X website. Crucially, the filtered feature barcode matrix folder, ATAC peak annotations TSV, and the feature linkage BEDPE file in the secondary analysis outputs folder will be needed in this demo.

Quantified unspliced and spliced counts from Velocyto can be downloaded from MultiVelo GitHub page.

We provide the cell annotations for this dataset in “cell_annotations.tsv” on the GitHub page. (To download from GitHub, click on the file, then click “Raw” on the top right corner. If it opens in your browser, you can download the page as a text file.)

Weighted nearest neighbors from Seurat can be downloaded from GitHub folder “seurat_wnn”, which contains a zip file of three files: “nn_cells.txt”, “nn_dist.txt”, and “nn_idx.txt”. Please unzip the archive after downloading. The R script used to generate such files can also be found in the same folder.

.
|-- MultiVelo_Fig2.ipynb
|-- cell_annotations.tsv
|-- outs
|   |-- analysis
|   |   `-- feature_linkage
|   |       `-- feature_linkage.bedpe
|   |-- filtered_feature_bc_matrix
|   |   |-- barcodes.tsv.gz
|   |   |-- features.tsv.gz
|   |   `-- matrix.mtx.gz
|   `-- peak_annotation.tsv
|-- seurat_wnn
|   |-- nn_cells.txt
|   |-- nn_dist.txt
|   `-- nn_idx.txt
`-- velocyto
    `-- 10X_multiome_mouse_brain.loom

Important Note:

This notebook is designed to run on the current commit of Multivelo on GitHub. It utilizes features that are not yet available on the current pip release. As such, it is recommended to download Multivelo from GitHub and use that instead of the version available on PyPI.

[1]:

%load_ext autoreload
%autoreload 2

import os
import scipy
import numpy as np
import pandas as pd
import sys
import multivelo as mv
import scanpy as sc
import scvelo as scv
import matplotlib.pyplot as plt

sys.path.append("/..")

[2]:

import time

mv.settings.LOG_FILENAME = "Fig2_" + str(time.time()) + ".txt"

[3]:

scv.settings.verbosity = 3
scv.settings.presenter_view = True
scv.set_figure_params('scvelo')
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 200)
np.set_printoptions(suppress=True)

Reading in unspliced and spliced counts

[4]:

adata_rna = scv.read("velocyto/10X_multiome_mouse_brain.loom", cache=True)
adata_rna.obs_names = [x.split(':')[1][:-1] + '-1' for x in adata_rna.obs_names]
adata_rna.var_names_make_unique()

[5]:

sc.pp.filter_cells(adata_rna, min_counts=1000)
sc.pp.filter_cells(adata_rna, max_counts=20000)

[6]:

# Top 1000 variable genes are used for downstream analyses.
scv.pp.filter_and_normalize(adata_rna, min_shared_counts=10, n_top_genes=1000)

Filtered out 21580 genes that are detected 10 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 1000 highly variable genes.
Logarithmized X.

[7]:

# Load cell annotations
cell_annot = pd.read_csv('cell_annotations.tsv', sep='\t', index_col=0)

[8]:

adata_rna = adata_rna[cell_annot.index,:]
adata_rna.obs['celltype'] = cell_annot['celltype']

[9]:

# We subset for lineages towards neurons and astrocytes.
adata_rna = adata_rna[adata_rna.obs['celltype'].isin(['RG, Astro, OPC',
                                                      'IPC',
                                                      'V-SVZ',
                                                      'Upper Layer',
                                                      'Deeper Layer',
                                                      'Ependymal cells',
                                                      'Subplate'])]
adata_rna

[9]:

View of AnnData object with n_obs × n_vars = 3653 × 1000
    obs: 'n_counts', 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'celltype'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'gene_count_corr', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'

Preprocessing the ATAC counts

[10]:

adata_atac = sc.read_10x_mtx('outs/filtered_feature_bc_matrix/', var_names='gene_symbols', cache=True, gex_only=False)
adata_atac = adata_atac[:,adata_atac.var['feature_types'] == "Peaks"]

[11]:

# We aggregate peaks around each gene as well as those that have high correlations with promoter peak or gene expression.
# Peak annotation contains the metadata for all peaks.
# Feature linkage contains pairs of correlated genomic features.
adata_atac = mv.aggregate_peaks_10x(adata_atac,
                                    'outs/peak_annotation.tsv',
                                    'outs/analysis/feature_linkage/feature_linkage.bedpe')

CellRanger ARC identified as 1.0.0

Found 19006 genes with promoter peaks

[12]:

# Let's examine the total count distribution and remove outliers.
# plt.hist(adata_atac.X.sum(1), bins=100, range=(0, 100000));

[13]:

sc.pp.filter_cells(adata_atac, min_counts=2000)
sc.pp.filter_cells(adata_atac, max_counts=60000)

[14]:

# We normalize aggregated peaks with TF-IDF.
mv.tfidf_norm(adata_atac)

Finding shared barcodes and features between RNA and ATAC

[15]:

shared_cells = pd.Index(np.intersect1d(adata_rna.obs_names, adata_atac.obs_names))
shared_genes = pd.Index(np.intersect1d(adata_rna.var_names, adata_atac.var_names))
len(shared_cells), len(shared_genes)

[15]:

(3365, 936)

[16]:

# We reload in the raw data and continue with a subset of cells.
adata_rna = scv.read("velocyto/10X_multiome_mouse_brain.loom", cache=True)
adata_rna.obs_names = [x.split(':')[1][:-1] + '-1' for x in adata_rna.obs_names]
adata_rna.var_names_make_unique()

[17]:

adata_rna = adata_rna[shared_cells, shared_genes]
adata_atac = adata_atac[shared_cells, shared_genes]

[18]:

scv.pp.normalize_per_cell(adata_rna)
scv.pp.log1p(adata_rna)
scv.pp.moments(adata_rna, n_pcs=30, n_neighbors=50)

Normalized count data: X, spliced, unspliced.
computing neighbors
    finished (0:00:02) --> added
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:00) --> added
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)

[19]:

adata_rna.obs['celltype'] = cell_annot.loc[adata_rna.obs_names, 'celltype']
adata_rna.obs['celltype'] = adata_rna.obs['celltype'].astype('category')

[20]:

# Reorder the categories for color consistency with the manuscript.
all_clusters = ['Upper Layer',
                'Deeper Layer',
                'V-SVZ',
                'RG, Astro, OPC',
                'Ependymal cells',
                'IPC',
                'Subplate']
adata_rna.obs['celltype'] = adata_rna.obs['celltype'].cat.reorder_categories(all_clusters)

[21]:

scv.tl.umap(adata_rna)
scv.pl.umap(adata_rna, color='celltype')

Smoothing gene aggregagted peaks by neighbors

[22]:

# Write out filtered cells and prepare to run Seurat WNN --> R script can be found on Github.
adata_rna.obs_names.to_frame().to_csv('filtered_cells.txt', header=False, index=False)

[23]:

# Read in Seurat WNN neighbors.
nn_idx = np.loadtxt("seurat_wnn/nn_idx.txt", delimiter=',')
nn_dist = np.loadtxt("seurat_wnn/nn_dist.txt", delimiter=',')
nn_cells = pd.Index(pd.read_csv("seurat_wnn/nn_cells.txt", header=None)[0])

# Make sure cell names match.
np.all(nn_cells == adata_atac.obs_names)

[23]:

True

[24]:

mv.knn_smooth_chrom(adata_atac, nn_idx, nn_dist)

Running multi-omic dynamical model

MultiVelo incorporates chromatin accessibility information into RNA velocity and achieves better lineage predictions.

The detailed argument list can be shown with “help(mv.recover_dynamics_chrom)”.

[25]:

adata_rna_scv = adata_rna.copy()

[26]:

scv.tl.recover_dynamics(adata_rna_scv)
scv.tl.velocity(adata_rna_scv, mode="dynamical")
scv.tl.velocity_graph(adata_rna_scv, n_jobs=1)
scv.tl.latent_time(adata_rna_scv)
scv.pl.velocity_embedding_stream(adata_rna_scv, basis='umap', color='celltype')

recovering dynamics (using 1/36 cores)

    finished (0:05:54) --> added
    'fit_pars', fitted parameters for splicing dynamics (adata.var)
computing velocities
    finished (0:00:02) --> added
    'velocity', velocity vectors for each individual cell (adata.layers)
computing velocity graph (using 1/36 cores)

    finished (0:00:09) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
computing terminal states
    identified 1 region of root cells and 1 region of end points .
    finished (0:00:00) --> added
    'root_cells', root cells of Markov diffusion process (adata.obs)
    'end_points', end points of Markov diffusion process (adata.obs)
computing latent time using root_cells as prior
    finished (0:00:00) --> added
    'latent_time', shared time (adata.obs)
computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)

[27]:

scv.tl.recover_dynamics(adata_rna)
scv.tl.velocity(adata_rna, mode="dynamical")
scv.tl.velocity_graph(adata_rna, n_jobs=1)

recovering dynamics (using 1/36 cores)

    finished (0:05:51) --> added
    'fit_pars', fitted parameters for splicing dynamics (adata.var)
computing velocities
    finished (0:00:02) --> added
    'velocity', velocity vectors for each individual cell (adata.layers)
computing velocity graph (using 1/36 cores)

    finished (0:00:08) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)

[28]:

# This will take a while. Parallelization is high recommended.

mv.settings.VERBOSITY = 0

adata_result = mv.recover_dynamics_chrom(adata_rna,
                                        adata_atac,
                                        max_iter=5,
                                        init_mode="invert",
                                        parallel=True,
                                        n_jobs = 10,
                                        save_plot=False,
                                        rna_only=False,
                                        fit=True,
                                        n_anchors=500,
                                        extra_color_key='celltype'
                                        )

[29]:

# Save the result for use later on
adata_result.write("multivelo_result_fig2.h5ad")

[30]:

adata_result = sc.read_h5ad("multivelo_result_fig2.h5ad")

Computing velocity stream and latent time

[31]:

mv.velocity_graph(adata_result)
mv.latent_time(adata_result)

computing velocity graph (using 1/36 cores)

    finished (0:00:06) --> added
    'velo_s_norm_graph', sparse matrix with cosine correlations (adata.uns)
computing terminal states
    identified 1 region of root cells and 1 region of end points .
    finished (0:00:00) --> added
    'root_cells', root cells of Markov diffusion process (adata.obs)
    'end_points', end points of Markov diffusion process (adata.obs)
computing latent time using root_cells as prior
    finished (0:00:00) --> added
    'latent_time', shared time (adata.obs)

Fig 2a

[32]:

mv.velocity_embedding_stream(adata_result, basis='umap', color='celltype', show=True)

computing velocity embedding
    finished (0:00:00) --> added
    'velo_s_norm_umap', embedded velocity vectors (adata.obsm)

[33]:

scv.pl.scatter(adata_result, color='latent_time', color_map='gnuplot', size=80)

Fig 2b

[34]:

scv.pl.velocity_embedding_stream(adata_result, basis='umap', color='celltype')

computing velocity embedding
    finished (0:00:00) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)

Fig 2c

[35]:

gcc_result = scv.tl.score_genes_cell_cycle(adata_result, copy=True)
scv.pl.scatter(gcc_result, color='G2M_score - S_score', size=80, frameon=True, title="Cell cycle score")

calculating cell cycle phase
-->     'S_score' and 'G2M_score', scores of cell cycle phases (adata.obs)

Fig 2d

[36]:

gene_list = ["Eomes", "Tle4"]
mv.scatter_plot(adata_result, by='us', show_anchors=False, color_by='celltype', genes=gene_list)
mv.scatter_plot(adata_result, by='cu', show_anchors=False, color_by='celltype', genes=gene_list)

Fig 2e

[37]:

gene_list = ["Satb2", "Gria2"]
mv.scatter_plot(adata_result, color_by='c', genes=gene_list)

Fig 2f

[38]:

gene_list = ["Satb2", "Gria2"]
mv.scatter_plot(adata_result, by='us', color_by='celltype', genes=gene_list)
mv.scatter_plot(adata_result, by='cu', color_by='celltype', genes=gene_list)

Fig 2g

[39]:

# get the model 1 genes
adata1 = adata_result[:, adata_result.var['fit_model'].values == 1]

# get the model 2 genes
adata2 = adata_result[:, adata_result.var['fit_model'].values == 2]

# heatmaps!
scv.pl.heatmap(adata1, var_names=adata1.var_names, col_color='celltype', figsize=(8, 5))
scv.pl.heatmap(adata2, var_names=adata2.var_names, col_color='celltype', figsize=(8, 5))

Fig 2h

[40]:

mv.pie_summary(adata_result)

Fig 2i

[41]:

gene_list = ['Bmper', 'Robo2', 'Meis2', 'Grin2b', "Tle4", "Igfbpl1", "Tnc", "Epha5"]
mv.scatter_plot(adata_result, by='cus', n_cols=4, color_by='celltype', downsample=3, show_anchors=False, velocity_arrows=True, genes=gene_list)