In brief: I design and implement novel computational methods to efficiently and accurately fit sequencing datasets.
Despite the advantages of stochastic modeling, many challenges remain. Practically all interesting biophysical systems are intractable, and require approximation to solve. In other words, to do statistics, we need to compute likelihoods; these likelihoods cannot typically be written down in a simple form. To fit data, it is necessary to create new algorithms.
A considerable portion of my Ph.D. work entailed the development of solvers for biological stochastic process distributions. The Monod numerical package is designed to fit and analyze distributions for thousands of genes, using a generating function-based method tailored to multimodal data.
However, this approach requires numerically intensive integration and Fourier transformation, which limits its applicability to relatively simple systems and questions. To bypass these problems, I designed several alternative approaches, approximating spliced and unspliced RNA distributions by special functions and combinations of basis distributions.
Fitting multimodal data requires accurate solvers, based on a variety of principles.
References
2023
Distinguishing biophysical stochasticity from technical noise in single-cell RNA sequencing using Monod
We present the Python package Monod for the analysis of single-cell RNA sequencing count data through biophysical modeling. Monod naturally “integrates” unspliced and spliced count matrices, and provides a route to identifying and studying differential expression patterns that do not cause changes in average gene expression. The Monod framework is open-source and modular, and may be extended to more sophisticated models of variation and further experimental observables. The Monod package can be installed from the command line using pip install monod. The source code is available and maintained at https://github.com/pachterlab/monod. A separate repository, which contains sample data and Python notebooks for analysis with Monod, is accessible at https://github.com/pachterlab/monod_examples/. Structured documentation and tutorials are hosted at https://monod-examples.readthedocs.io/.
Biophysical modeling with variational autoencoders for bimodal, single-cell RNA sequencing data
We motivate and present biVI, which combines the variational autoencoder framework of scVI with biophysically motivated, bivariate models for nascent and mature RNA distributions. While previous approaches to integrate bimodal data via the variational autoencoder framework ignore the causal relationship between measurements, biVI models the biophysical processes that give rise to observations. We demonstrate through simulated benchmarking that biVI captures cell type structure in a low-dimensional space and accurately recapitulates parameter values and copy number distributions. On biological data, biVI provides a scalable route for identifying the biophysical mechanisms underlying gene expression. This analytical approach outlines a generalizable strategy for treating multimodal datasets generated by high-throughput, single-cell genomic assays.
Biophysically Interpretable Inference of Cell Types from Multimodal Sequencing Data
Multimodal, single-cell genomics technologies enable simultaneous capture of multiple facets of DNA and RNA processing in the cell. This creates opportunities for transcriptome-wide, mechanistic studies of cellular processing in heterogeneous cell types, with applications ranging from inferring kinetic differences between cells, to the role of stochasticity in driving heterogeneity. However, current methods for determining cell types or ‘clusters’ present in multimodal data often rely on ad hoc or independent treatment of modalities, and assumptions ignoring inherent properties of the count data. To enable interpretable and consistent cell cluster determination from multimodal data, we present meK-Means (mechanistic K-Means) which integrates modalities and learns underlying, shared biophysical states through a unifying model of transcription. In particular, we demonstrate how meK-Means can be used to cluster cells from unspliced and spliced mRNA count modalities. By utilizing the causal, physical relationships underlying these modalities, we identify shared transcriptional kinetics across cells, which induce the observed gene expression profiles, and provide an alternative definition for ’clusters’ through the governing parameters of cellular processes.
2022
Spectral neural approximations for models of transcriptional dynamics
Transcriptional systems involving discrete, stochastic events are naturally modeled using Chemical Master Equations (CMEs). These can be solved for microstate probabilities over time and state space for a better understanding of biological rates and system dynamics. However, closed form solutions to CMEs are available in only the simplest cases. Probing systems of higher complexity is challenging due to the computational cost of finding solutions and often compromises accuracy by treating infinite systems as finite. We use statistical understanding of system behavior and the generalizability of neural networks to approximate steady-state joint distribution solutions for a two-species model of the life cycle of RNA. We define a set of kernel functions using moments of the system and learn optimal weights for kernel functions with a neural network trained to minimize statistical distance between approximated and numerically calculated distributions. We show that this method of kernel weight regression (KWR) approximation is as accurate as lower-order generating-function solutions to the system, but faster; KWR approximation reduces the time for likelihood evaluation by several orders of magnitude. KWR also generalizes to produce probability predictions for system rates outside of training sets, thereby enabling efficient transcriptional parameter exploration and system analysis.
2021
Analytic solution of chemical master equations involving gene switching. I: Representation theory and diagrammatic approach to exact solution
The chemical master equation (CME), which describes the discrete and stochastic molecule number dynamics associated with biological processes like transcription, is difficult to solve analytically. It is particularly hard to solve for models involving bursting/gene switching, a biological feature that tends to produce heavy-tailed single cell RNA counts distributions. In this paper, we present a novel method for computing exact and analytic solutions to the CME in such cases, and use these results to explore approximate solutions valid in different parameter regimes, and to compute observables of interest. Our method leverages tools inspired by quantum mechanics, including ladder operators and Feynman-like diagrams, and establishes close formal parallels between the dynamics of bursty transcription, and the dynamics of bosons interacting with a single fermion. We focus on two problems: (i) the chemical birth-death process coupled to a switching gene/the telegraph model, and (ii) a model of transcription and multistep splicing involving a switching gene and an arbitrary number of downstream splicing steps. We work out many special cases, and exhaustively explore the special functionology associated with these problems. This is Part I in a two-part series of papers; in Part II, we explore an alternative solution approach that is more useful for numerically solving these problems, and apply it to parameter inference on simulated RNA counts data.
2020
Special function methods for bursty models of transcription
We explore a Markov model used in the analysis of gene expression, involving the bursty production of pre-mRNA, its conversion to mature mRNA, and its consequent degradation. We demonstrate that the integration used to compute the solution of the stochastic system can be approximated by the evaluation of special functions. Furthermore, the form of the special function solution generalizes to a broader class of burst distributions. In light of the broader goal of biophysical parameter inference from transcriptomics data, we apply the method to simulated data, demonstrating effective control of precision and runtime. Finally, we propose and validate a non-Bayesian approach for parameter estimation based on the characteristic function of the target joint distribution of pre-mRNA and mRNA.