publications
(*) denotes equal contribution.
2024
-
New and notable: Revisiting the ‘‘two cultures’’ through extrinsic noiseGennady Gorin, and Lior PachterBiophysical Journal, Jan 2024A recent article by Grima and Esmenjaud draws attention to the unexpectedly complex effects of extrinsic noise on inference of transcriptional kinetics. We contrast the authors’ mechanistic approach with the descriptive, data-scientific methods used in single-cell RNA sequencing, and discuss broader philosophical connections to Leo Breiman’s "two cultures" of statistics.
-
Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing dataNature Methods, Jul 2024Here we present biVI, which combines the variational autoencoder framework of scVI with biophysical models describing the transcription and splicing kinetics of RNA molecules. We demonstrate on simulated and experimental single-cell RNA sequencing data that biVI retains the variational autoencoder’s ability to capture cell type structure in a low-dimensional space while further enabling genome-wide exploration of the biophysical mechanisms, such as system burst sizes and degradation rates, that underlie observations.
-
Spectral neural approximations for models of transcriptional dynamicsBiophysical Journal, Sep 2024The advent of high-throughput transcriptomics provides an opportunity to advance mechanistic understanding of transcriptional processes and their connections to cellular function at an unprecedented, genome-wide scale. These transcriptional systems, which involve discrete stochastic events, are naturally modeled using chemical master equations (CMEs), which can be solved for probability distributions to fit biophysical rates that govern system dynamics. While CME models have been used as standards in fluorescence transcriptomics for decades to analyze single-species RNA distributions, there are often no closed-form solutions to CMEs that model multiple species, such as nascent and mature RNA transcript counts. This has prevented the application of standard likelihood-based statistical methods for analyzing high-throughput, multi-species transcriptomic datasets using biophysical models. Inspired by recent work in machine learning to learn solutions to complex dynamical systems, we leverage neural networks and statistical understanding of system distributions to produce accurate approximations to a steady-state bivariate distribution for a model of the RNA life cycle that includes nascent and mature molecules. The steady-state distribution to this simple model has no closed-form solution and requires intensive numerical solving techniques: our approach reduces likelihood evaluation time by several orders of magnitude. We demonstrate two approaches, whereby solutions are approximated by 1) learning the weights of kernel distributions with constrained parameters or 2) learning both weights and scaling factors for parameters of kernel distributions. We show that our strategies, denoted by kernel weight regression and parameter-scaled kernel weight regression, respectively, enable broad exploration of parameter space and can be used in existing likelihood frameworks to infer transcriptional burst sizes, RNA splicing rates, and mRNA degradation rates from experimental transcriptomic data.
-
Biophysically interpretable inference of cell types from multimodal sequencing dataTara Chari, Gennady Gorin, and Lior PachterNature Computational Science, Sep 2024Multimodal, single-cell genomics technologies enable simultaneous measurement of multiple facets of DNA and RNA processing in the cell. This creates opportunities for transcriptome-wide, mechanistic studies of cellular processing in heterogeneous cell populations, such as regulation of cell fate by transcriptional stochasticity or tumor proliferation through aberrant splicing dynamics. However, current methods for determining cell types or ‘clusters’ in multimodal data often rely on ad hoc approaches to balance or integrate measurements, and assumptions ignoring inherent properties of the data. To enable interpretable and consistent cell cluster determination, we present meK-means (mechanistic K-means) which integrates modalities through a unifying model of transcription to learn underlying, shared biophysical states. With meK-means we can cluster cells with nascent and mature mRNA measurements, utilizing the causal, physical relationships between these modalities. This identifies shared transcription dynamics across cells, which induce the observed molecule counts, and provides an alternative definition for ‘clusters’ through the governing parameters of cellular processes.
2023
-
The telegraph process is not a subordinatorGennady Gorin, and Lior PachterbioRxiv, Jan 2023Investigations of transcriptional models by Amrhein et al. outline a strategy for connecting steady-state distributions to process dynamics. We clarify its limitations: the strategy holds for a very narrow class of processes, which excludes an example given by the authors.
-
Length biases in single-cell RNA sequencing of pre-mRNAGennady Gorin, and Lior PachterBiophysical Reports, Mar 2023Single-cell RNA sequencing data can be modeled using Markov chains to yield genome-wide insights into transcriptional physics. However, quantitative inference with such data requires careful assessment of noise sources. We find that long pre-mRNA transcripts are over-represented in sequencing data. To explain this trend, we propose a length-based model of capture bias, which may produce false-positive observations. We solve this model and use it to find concordant parameter trends as well as systematic, mechanistically interpretable technical and biological differences in paired data sets.
-
Distinguishing biophysical stochasticity from technical noise in single-cell RNA sequencing using MonodGennady Gorin, and Lior PachterbioRxiv, Apr 2023We present the Python package Monod for the analysis of single-cell RNA sequencing count data through biophysical modeling. Monod naturally “integrates” unspliced and spliced count matrices, and provides a route to identifying and studying differential expression patterns that do not cause changes in average gene expression. The Monod framework is open-source and modular, and may be extended to more sophisticated models of variation and further experimental observables. The Monod package can be installed from the command line using pip install monod. The source code is available and maintained at https://github.com/pachterlab/monod. A separate repository, which contains sample data and Python notebooks for analysis with Monod, is accessible at https://github.com/pachterlab/monod_examples/. Structured documentation and tutorials are hosted at https://monod-examples.readthedocs.io/.
-
Assessing Markovian and Delay Models for Single-Nucleus RNA SequencingGennady Gorin, Shawn Yoshida, and Lior PachterBulletin of Mathematical Biology, Oct 2023The serial nature of reactions involved in the RNA life-cycle motivates the incorporation of delays in models of transcriptional dynamics. The models couple a transcriptional process to a fairly general set of delayed monomolecular reactions with no feedback. We provide numerical strategies for calculating the RNA copy number distributions induced by these models, and solve several systems with splicing, degradation, and catalysis. An analysis of single-cell and single-nucleus RNA sequencing data using these models reveals that the kinetics of nuclear export do not appear to require invocation of a non-Markovian waiting time.
-
Studying stochastic systems biology of the cell with single-cell genomics dataGennady Gorin, John J. Vastola, and Lior PachterCell Systems, Oct 2023Recent experimental developments in genome-wide RNA quantification hold considerable promise for systems biology. However, rigorously probing the biology of living cells requires a unified mathematical framework that accounts for single-molecule biological stochasticity in the context of technical variation associated with genomics assays. We review models for a variety of RNA transcription processes, as well as the encapsulation and library construction steps of microfluidics-based single-cell RNA sequencing, and present a framework to integrate these phenomena by the manipulation of generating functions. Finally, we use simulated scenarios and biological data to illustrate the implications and applications of the approach.
2022
-
Modeling bursty transcription and splicing with the chemical master equationGennady Gorin, and Lior PachterBiophysical Journal, Feb 2022Splicing cascades that alter gene products post-transcriptionally also affect expression dynamics. We study a class of processes and associated distributions that emerge from models of bursty promoters coupled to directed acyclic graphs of splicing. These solutions provide full time-dependent joint distributions for an arbitrary number of species with general noise behaviors and transient phenomena, offering qualitative and quantitative insights about how splicing can regulate expression dynamics. Finally, we derive a set of quantitative constraints on the minimum complexity necessary to reproduce gene coexpression patterns using synchronized burst models. We validate these findings by analyzing long-read sequencing data, where we find evidence of expression patterns largely consistent with these constraints.
-
RNA velocity unraveledPLOS Computational Biology, Sep 2022We perform a thorough analysis of RNA velocity methods, with a view towards understanding the suitability of the various assumptions underlying popular implementations. In addition to providing a self-contained exposition of the underlying mathematics, we undertake simulations and perform controlled experiments on biological datasets to assess workflow sensitivity to parameter choices and underlying biology. Finally, we argue for a more rigorous approach to RNA velocity, and present a framework for Markovian analysis that points to directions for improvement and mitigation of current problems.
-
Interpretable and tractable models of transcriptional noise for the rational design of single-molecule quantification experimentsNature Communications, Dec 2022Abstract The question of how cell-to-cell differences in transcription rate affect RNA count distributions is fundamental for understanding biological processes underlying transcription. Answering this question requires quantitative models that are both interpretable (describing concrete biophysical phenomena) and tractable (amenable to mathematical analysis). This enables the identification of experiments which best discriminate between competing hypotheses. As a proof of principle, we introduce a simple but flexible class of models involving a continuous stochastic transcription rate driving a discrete RNA transcription and splicing process, and compare and contrast two biologically plausible hypotheses about transcription rate variation. One assumes variation is due to DNA experiencing mechanical strain, while the other assumes it is due to regulator number fluctuations. We introduce a framework for numerically and analytically studying such models, and apply Bayesian model selection to identify candidate genes that show signatures of each model in single-cell transcriptomic data from mouse glutamatergic neurons.
2021
-
Analytic solution of chemical master equations involving gene switching. I: Representation theory and diagrammatic approach to exact solutionarXiv, Mar 2021The chemical master equation (CME), which describes the discrete and stochastic molecule number dynamics associated with biological processes like transcription, is difficult to solve analytically. It is particularly hard to solve for models involving bursting/gene switching, a biological feature that tends to produce heavy-tailed single cell RNA counts distributions. In this paper, we present a novel method for computing exact and analytic solutions to the CME in such cases, and use these results to explore approximate solutions valid in different parameter regimes, and to compute observables of interest. Our method leverages tools inspired by quantum mechanics, including ladder operators and Feynman-like diagrams, and establishes close formal parallels between the dynamics of bursty transcription, and the dynamics of bosons interacting with a single fermion. We focus on two problems: (i) the chemical birth-death process coupled to a switching gene/the telegraph model, and (ii) a model of transcription and multistep splicing involving a switching gene and an arbitrary number of downstream splicing steps. We work out many special cases, and exhaustively explore the special functionology associated with these problems. This is Part I in a two-part series of papers; in Part II, we explore an alternative solution approach that is more useful for numerically solving these problems, and apply it to parameter inference on simulated RNA counts data.
2020
-
Protein velocity and acceleration from single-cell multiomics experimentsGennady Gorin, Valentine Svensson, and Lior PachterGenome Biology, Feb 2020The simultaneous quantification of protein and RNA makes possible the inference of past, present, and future cell states from single experimental snapshots. To enable such temporal analysis from multimodal single-cell experiments, we introduce an extension of the RNA velocity method that leverages estimates of unprocessed transcript and protein abundances to extrapolate cell states. We apply the model to six datasets and demonstrate consistency among cell landscapes and phase portraits. The analysis software is available as the protaccel Python package.
-
Stochastic simulation and statistical inference platform for visualization and estimation of transcriptional kineticsPLOS ONE, Mar 2020Recent advances in single-molecule fluorescent imaging have enabled quantitative measurements of transcription at a single gene copy, yet an accurate understanding of transcriptional kinetics is still lacking due to the difficulty of solving detailed biophysical models. Here we introduce a stochastic simulation and statistical inference platform for modeling detailed transcriptional kinetics in prokaryotic systems, which has not been solved analytically. The model includes stochastic two-state gene activation, mRNA synthesis initiation and stepwise elongation, release to the cytoplasm, and stepwise co-transcriptional degradation. Using the Gillespie algorithm, the platform simulates nascent and mature mRNA kinetics of a single gene copy and predicts fluorescent signals measurable by time-lapse single-cell mRNA imaging, for different experimental conditions. To approach the inverse problem of estimating the kinetic parameters of the model from experimental data, we develop a heuristic optimization method based on the genetic algorithm and the empirical distribution of mRNA generated by simulation. As a demonstration, we show that the optimization algorithm can successfully recover the transcriptional kinetics of simulated and experimental gene expression data. The platform is available as a MATLAB software package at https://data.caltech.edu/records/1287.
-
Special function methods for bursty models of transcriptionGennady Gorin, and Lior PachterPhysical Review E, Aug 2020We explore a Markov model used in the analysis of gene expression, involving the bursty production of pre-mRNA, its conversion to mature mRNA, and its consequent degradation. We demonstrate that the integration used to compute the solution of the stochastic system can be approximated by the evaluation of special functions. Furthermore, the form of the special function solution generalizes to a broader class of burst distributions. In light of the broader goal of biophysical parameter inference from transcriptomics data, we apply the method to simulated data, demonstrating effective control of precision and runtime. Finally, we propose and validate a non-Bayesian approach for parameter estimation based on the characteristic function of the target joint distribution of pre-mRNA and mRNA.
-
Intrinsic and extrinsic noise are distinguishable in a synthesis – export – degradation model of mRNA productionGennady Gorin, and Lior PachterbioRxiv, Sep 2020Intrinsic and extrinsic noise sources in gene expression, originating respectively from transcriptional stochasticity and from differences between cells, complicate the determination of transcriptional models. In particularly degenerate cases, the two noise sources are altogether impossible to distinguish. However, the incorporation of downstream processing, such as the mRNA splicing and export implicated in gene expression buffering, recovers the ability to identify the relevant source of noise. We report analytical copy-number distributions, discuss the noise sources’ qualitative effects on lower moments, and provide simulation routines for both models.