Statistics

Leveraging multiomics to discover biology

In brief: I develop methods and analyses for multiomic data, with a particular focus on understanding its benefits of standard scRNA-seq.

Over the past decade, the increasing affordability of DNA sequencing has made RNA quantification more practical and spurred improvements in sequencing. Yet transcriptomics is only part of the sought-after comprehensive picture of single-cell biology. To hone understanding of biological regulation, it is necessary to quantify more modalities involved in information transfer: DNA and protein molecules, as well as transient and alternative RNA isoforms.

Simultaneously with advances in RNA sequencing, two factors have contributed to multimodal -omics: the development of analogous assays for chromatin accessibility and protein abundance, and the release of software infrastructure to quantify non-coding RNA molecules. If we run a scRNA-seq experiment, we get information about splicing for free. If we develop a slightly more sophisticated methodology, we can collect information about other modalities, either closer to gene regulation or to its effects.

Yet the correct way to jointly analyze such data is far from clear. Many data integration methods exist, but they tend to be descriptive and phenomenological, without encoding the causal relationships of the central dogma. For example, a single-nucleus RNA sequencing analysis will simply add spliced and unspliced RNA counts without accounting for their meaningful differences. One of the key goals of my Ph.D. work has been embracing multimodality: if we have interesting readouts, we should exploit them as much as possible.

My theoretical work shows that whole is greater than the sum of its parts. For instance, having spliced and unspliced molecule counts allows us to distinguish between transcriptional mechanisms and regulatory strategies which would otherwise be indistinguishable using a single modality.

I am particularly interested in the interplay and connections between mechanistic and descriptive models. One of my recent projects, biVI, integrates spliced and unspliced data by endowing a machine learning framework with a biologically meaningful representation of the relationship between these molecules. In other words, the mechanistic worldview can naturally represent well-understood parts of the biophysics, whereas neural networks can represent currently uncharacterized “black box” parts.

Multimodal data allows summary of regulatory strategies

Multimodal data allows disambiguation of models

Multimodal data can be integrated using a physically founded ML method

Multimodal data provides advantages in inference and model identification, and can be "integrated" using stochastic modeling.

References

2024

Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data

Maria Carilli*, Gennady Gorin*, Yongin Choi, Tara Chari, and Lior Pachter

Nature Methods, Jul 2024

Abs HTML

Here we present biVI, which combines the variational autoencoder framework of scVI with biophysical models describing the transcription and splicing kinetics of RNA molecules. We demonstrate on simulated and experimental single-cell RNA sequencing data that biVI retains the variational autoencoder’s ability to capture cell type structure in a low-dimensional space while further enabling genome-wide exploration of the biophysical mechanisms, such as system burst sizes and degradation rates, that underlie observations.
Biophysically interpretable inference of cell types from multimodal sequencing data

Tara Chari, Gennady Gorin, and Lior Pachter

Nature Computational Science, Sep 2024

Abs HTML

Multimodal, single-cell genomics technologies enable simultaneous measurement of multiple facets of DNA and RNA processing in the cell. This creates opportunities for transcriptome-wide, mechanistic studies of cellular processing in heterogeneous cell populations, such as regulation of cell fate by transcriptional stochasticity or tumor proliferation through aberrant splicing dynamics. However, current methods for determining cell types or ‘clusters’ in multimodal data often rely on ad hoc approaches to balance or integrate measurements, and assumptions ignoring inherent properties of the data. To enable interpretable and consistent cell cluster determination, we present meK-means (mechanistic K-means) which integrates modalities through a unifying model of transcription to learn underlying, shared biophysical states. With meK-means we can cluster cells with nascent and mature mRNA measurements, utilizing the causal, physical relationships between these modalities. This identifies shared transcription dynamics across cells, which induce the observed molecule counts, and provides an alternative definition for ‘clusters’ through the governing parameters of cellular processes.

2023

Length biases in single-cell RNA sequencing of pre-mRNA

Gennady Gorin, and Lior Pachter

Biophysical Reports, Mar 2023

Abs HTML

Single-cell RNA sequencing data can be modeled using Markov chains to yield genome-wide insights into transcriptional physics. However, quantitative inference with such data requires careful assessment of noise sources. We find that long pre-mRNA transcripts are over-represented in sequencing data. To explain this trend, we propose a length-based model of capture bias, which may produce false-positive observations. We solve this model and use it to find concordant parameter trends as well as systematic, mechanistically interpretable technical and biological differences in paired data sets.
Studying stochastic systems biology of the cell with single-cell genomics data

Gennady Gorin, John J. Vastola, and Lior Pachter

Cell Systems, Oct 2023

Abs HTML

Recent experimental developments in genome-wide RNA quantification hold considerable promise for systems biology. However, rigorously probing the biology of living cells requires a unified mathematical framework that accounts for single-molecule biological stochasticity in the context of technical variation associated with genomics assays. We review models for a variety of RNA transcription processes, as well as the encapsulation and library construction steps of microfluidics-based single-cell RNA sequencing, and present a framework to integrate these phenomena by the manipulation of generating functions. Finally, we use simulated scenarios and biological data to illustrate the implications and applications of the approach.

2022

Interpretable and tractable models of transcriptional noise for the rational design of single-molecule quantification experiments

Gennady Gorin*, John J. Vastola*, Meichen Fang, and Lior Pachter

Nature Communications, Dec 2022

Abs HTML

Abstract The question of how cell-to-cell differences in transcription rate affect RNA count distributions is fundamental for understanding biological processes underlying transcription. Answering this question requires quantitative models that are both interpretable (describing concrete biophysical phenomena) and tractable (amenable to mathematical analysis). This enables the identification of experiments which best discriminate between competing hypotheses. As a proof of principle, we introduce a simple but flexible class of models involving a continuous stochastic transcription rate driving a discrete RNA transcription and splicing process, and compare and contrast two biologically plausible hypotheses about transcription rate variation. One assumes variation is due to DNA experiencing mechanical strain, while the other assumes it is due to regulator number fluctuations. We introduce a framework for numerically and analytically studying such models, and apply Bayesian model selection to identify candidate genes that show signatures of each model in single-cell transcriptomic data from mouse glutamatergic neurons.

2020

Protein velocity and acceleration from single-cell multiomics experiments

Gennady Gorin, Valentine Svensson, and Lior Pachter

Genome Biology, Feb 2020

Abs HTML

The simultaneous quantification of protein and RNA makes possible the inference of past, present, and future cell states from single experimental snapshots. To enable such temporal analysis from multimodal single-cell experiments, we introduce an extension of the RNA velocity method that leverages estimates of unprocessed transcript and protein abundances to extrapolate cell states. We apply the model to six datasets and demonstrate consistency among cell landscapes and phase portraits. The analysis software is available as the protaccel Python package.
Intrinsic and extrinsic noise are distinguishable in a synthesis – export – degradation model of mRNA production

Gennady Gorin, and Lior Pachter

bioRxiv, Sep 2020

Abs HTML

Intrinsic and extrinsic noise sources in gene expression, originating respectively from transcriptional stochasticity and from differences between cells, complicate the determination of transcriptional models. In particularly degenerate cases, the two noise sources are altogether impossible to distinguish. However, the incorporation of downstream processing, such as the mRNA splicing and export implicated in gene expression buffering, recovers the ability to identify the relevant source of noise. We report analytical copy-number distributions, discuss the noise sources’ qualitative effects on lower moments, and provide simulation routines for both models.