In brief: I design and implement novel computational methods to efficiently and accurately fit sequencing datasets.
Despite the advantages of stochastic modeling, many challenges remain. Practically all interesting biophysical systems are intractable, and require approximation to solve. In other words, to do statistics, we need to compute likelihoods; these likelihoods cannot typically be written down in a simple form. To fit data, it is necessary to create new algorithms.
A considerable portion of my Ph.D. work entailed the development of solvers for biological stochastic process distributions. The Monod numerical package is designed to fit and analyze distributions for thousands of genes, using a generating function-based method tailored to multimodal data.
However, this approach requires numerically intensive integration and Fourier transformation, which limits its applicability to relatively simple systems and questions. To bypass these problems, I designed several alternative approaches, approximating spliced and unspliced RNA distributions by special functions and combinations of basis distributions.
Fitting multimodal data requires accurate solvers, based on a variety of principles.
References
2024
Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data
Here we present biVI, which combines the variational autoencoder framework of scVI with biophysical models describing the transcription and splicing kinetics of RNA molecules. We demonstrate on simulated and experimental single-cell RNA sequencing data that biVI retains the variational autoencoder’s ability to capture cell type structure in a low-dimensional space while further enabling genome-wide exploration of the biophysical mechanisms, such as system burst sizes and degradation rates, that underlie observations.
Spectral neural approximations for models of transcriptional dynamics
The advent of high-throughput transcriptomics provides an opportunity to advance mechanistic understanding of transcriptional processes and their connections to cellular function at an unprecedented, genome-wide scale. These transcriptional systems, which involve discrete stochastic events, are naturally modeled using chemical master equations (CMEs), which can be solved for probability distributions to fit biophysical rates that govern system dynamics. While CME models have been used as standards in fluorescence transcriptomics for decades to analyze single-species RNA distributions, there are often no closed-form solutions to CMEs that model multiple species, such as nascent and mature RNA transcript counts. This has prevented the application of standard likelihood-based statistical methods for analyzing high-throughput, multi-species transcriptomic datasets using biophysical models. Inspired by recent work in machine learning to learn solutions to complex dynamical systems, we leverage neural networks and statistical understanding of system distributions to produce accurate approximations to a steady-state bivariate distribution for a model of the RNA life cycle that includes nascent and mature molecules. The steady-state distribution to this simple model has no closed-form solution and requires intensive numerical solving techniques: our approach reduces likelihood evaluation time by several orders of magnitude. We demonstrate two approaches, whereby solutions are approximated by 1) learning the weights of kernel distributions with constrained parameters or 2) learning both weights and scaling factors for parameters of kernel distributions. We show that our strategies, denoted by kernel weight regression and parameter-scaled kernel weight regression, respectively, enable broad exploration of parameter space and can be used in existing likelihood frameworks to infer transcriptional burst sizes, RNA splicing rates, and mRNA degradation rates from experimental transcriptomic data.
Biophysically interpretable inference of cell types from multimodal sequencing data
Multimodal, single-cell genomics technologies enable simultaneous measurement of multiple facets of DNA and RNA processing in the cell. This creates opportunities for transcriptome-wide, mechanistic studies of cellular processing in heterogeneous cell populations, such as regulation of cell fate by transcriptional stochasticity or tumor proliferation through aberrant splicing dynamics. However, current methods for determining cell types or ‘clusters’ in multimodal data often rely on ad hoc approaches to balance or integrate measurements, and assumptions ignoring inherent properties of the data. To enable interpretable and consistent cell cluster determination, we present meK-means (mechanistic K-means) which integrates modalities through a unifying model of transcription to learn underlying, shared biophysical states. With meK-means we can cluster cells with nascent and mature mRNA measurements, utilizing the causal, physical relationships between these modalities. This identifies shared transcription dynamics across cells, which induce the observed molecule counts, and provides an alternative definition for ‘clusters’ through the governing parameters of cellular processes.
2023
Distinguishing biophysical stochasticity from technical noise in single-cell RNA sequencing using Monod
We present the Python package Monod for the analysis of single-cell RNA sequencing count data through biophysical modeling. Monod naturally “integrates” unspliced and spliced count matrices, and provides a route to identifying and studying differential expression patterns that do not cause changes in average gene expression. The Monod framework is open-source and modular, and may be extended to more sophisticated models of variation and further experimental observables. The Monod package can be installed from the command line using pip install monod. The source code is available and maintained at https://github.com/pachterlab/monod. A separate repository, which contains sample data and Python notebooks for analysis with Monod, is accessible at https://github.com/pachterlab/monod_examples/. Structured documentation and tutorials are hosted at https://monod-examples.readthedocs.io/.
2021
Analytic solution of chemical master equations involving gene switching. I: Representation theory and diagrammatic approach to exact solution
The chemical master equation (CME), which describes the discrete and stochastic molecule number dynamics associated with biological processes like transcription, is difficult to solve analytically. It is particularly hard to solve for models involving bursting/gene switching, a biological feature that tends to produce heavy-tailed single cell RNA counts distributions. In this paper, we present a novel method for computing exact and analytic solutions to the CME in such cases, and use these results to explore approximate solutions valid in different parameter regimes, and to compute observables of interest. Our method leverages tools inspired by quantum mechanics, including ladder operators and Feynman-like diagrams, and establishes close formal parallels between the dynamics of bursty transcription, and the dynamics of bosons interacting with a single fermion. We focus on two problems: (i) the chemical birth-death process coupled to a switching gene/the telegraph model, and (ii) a model of transcription and multistep splicing involving a switching gene and an arbitrary number of downstream splicing steps. We work out many special cases, and exhaustively explore the special functionology associated with these problems. This is Part I in a two-part series of papers; in Part II, we explore an alternative solution approach that is more useful for numerically solving these problems, and apply it to parameter inference on simulated RNA counts data.
2020
Special function methods for bursty models of transcription
We explore a Markov model used in the analysis of gene expression, involving the bursty production of pre-mRNA, its conversion to mature mRNA, and its consequent degradation. We demonstrate that the integration used to compute the solution of the stochastic system can be approximated by the evaluation of special functions. Furthermore, the form of the special function solution generalizes to a broader class of burst distributions. In light of the broader goal of biophysical parameter inference from transcriptomics data, we apply the method to simulated data, demonstrating effective control of precision and runtime. Finally, we propose and validate a non-Bayesian approach for parameter estimation based on the characteristic function of the target joint distribution of pre-mRNA and mRNA.