STADyUM: Simulating and Analyzing Transcription Dynamics

Rebecca Hassett

2026-06-11

Introduction

STADyUM is an R package for simulating and analyzing transcription dynamics. It provides tools for:

STADyUM seamlessly integrates with the Bioconductor ecosystem by leveraging standard Bioconductor data structures such as GRanges for genomic interval handling and S4Vectors for custom Bioconductor-style data structures. While existing Bioconductor packages like INSPEcT focuses primarily on modeling RNA synthesis, processing, and degradation using both nascent and mature RNA-seq data through systems of differential equations, STADyUM takes a different modeling approach. STADyUM provides a probabilistic framework tailored specifically for nascent RNA data alone (such as PRO-seq, GRO-seq, 4sU-DRB-seq) including functionality for modeling steric hindrance between RNA polymerases and modeling pause-site kinetics explicitly.

The focus of this tutorial will be on using our simulator, SimPol, for simulating RNA Pol II dynamics on a DNA template. SimPol tracks the independent movement of RNAPs along the DNA templates of a large number of cells. It accepts several key user-specified parameters, including the initiation rate, pause-escape rate, a constant or variable elongation rate, the mean and variance of pause sites across cells, as well as the center-to-center spacing constraint between RNAPs, the number of cells being simulated, the gene length, and the total time of transcription. After running for the specified time, SimPol outputs a SimulatePolymerase object that stores matrices at different time points recording the RNAP positions. This is stored in slots positionMatrices and finalPositionMatrix. It also stores an estimated vector of read counts per nucleotide in slot readCounts. More information on the simulator can be found in the sections A simple probabilistic model for transcription initiation, promoter-proximal pausing, and elongation, SimPol Simulator, Generation of synthetic NRS data of Zhao et al. (2023)

Installation

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("STADyUM")

Basic Usage

Simulate Polymerase

(A) Conceptual illustration of model focusing on the kinetic model for RNAP movement on the DNA template. (B) Graphical model representation with unobserved continuous-time Markov chain (Z) and observed read counts (X). Read counts at each site are conditionally independent and Poisson distributed. (C) Design of SimPol simulator tracking the movement of in silico RNAPs across N-bp DNA templates in C cells, then samples the synthetic read counts based on RNAP positions. (D) Example of synthetic nascent RNA sequencing data from SimPol alongside matched real PRO-seq data from the DNAJA1 gene on chromosome 9 of the human genome
(A) Conceptual illustration of model focusing on the kinetic model for RNAP movement on the DNA template. (B) Graphical model representation with unobserved continuous-time Markov chain (Z) and observed read counts (X). Read counts at each site are conditionally independent and Poisson distributed. (C) Design of SimPol simulator tracking the movement of in silico RNAPs across N-bp DNA templates in C cells, then samples the synthetic read counts based on RNAP positions. (D) Example of synthetic nascent RNA sequencing data from SimPol alongside matched real PRO-seq data from the DNAJA1 gene on chromosome 9 of the human genome
simpol <- simulatePolymerase(k=50, ksd=25, kMin=17, kMax=200, geneLen=1950,
alpha=2, beta=0.5, zeta=2000, zetaSd=1000, zetaMin=1500, zetaMax=2500,
cellNum=100, polSize=33, addSpace=17, time=39, timesToRecord=NULL)
#> Total time used for the simulation is 0 mins.

show(simpol)
#> A SimulatePolymerase object with:
#>   - gene length = 1950 
#>   - number of cells = 100 
#>   - pause sites mean = 55.22 
#>   - pause sites std dev = 22.38 
#>   - top 10 most occupied sites across all cells:
#>     position    0: 100 polymerases
#>     position   42: 5 polymerases
#>     position   30: 4 polymerases
#>     position   51: 4 polymerases
#>     position   59: 4 polymerases
#>     position    1: 3 polymerases
#>     position    6: 3 polymerases
#>     position    9: 3 polymerases
#>     position   21: 3 polymerases
#>     position   28: 3 polymerases
#>   - gene body average read counts = 0.05 
#>   - pause region average read counts = 2.33 
#>   - percent of nucleotides with zero reads = 94.5 %
#> 
#> To access the full simulation parameters, use: parameters(object)
#> 
#> To access the all sampled read counts, use: readCounts(object)

readCounts <- readCounts(simpol)

plotCombinedCells(simpol, file="combinedCellsLollipopPlot.png")

Estimates of transcription rates, such as chi and beta, are calculated given Expectation Maximization with likelihood formulas derived in Stationary Distribution and Inference at Steady-State section of Siepel (2022) and stated in Inference at Steady State section of Zhao et al. (2023). Note that chi is the estimator for the average read depth, excluding the pause peak and beta is the estimator for the ratio of chi to the read depth in the pause peak region. The estimator for beta is also given for a model adapted for varying pause sites across cells and is denoted by variable betaAdp whereas the fixed pause site estimate is denoted by betaOrg, refer to Allowing for variation in pause site section of Zhao et al. (2023).

Estimate Transcription Rates from Simulated Data

simRates <- estimateTranscriptionRates(simpol, name="simRatesb0.5k50",
stericHindrance=TRUE)
#> Starting EM algorithm...

show(simRates)
#> A SimulationTranscriptionRates object with:
#>   - Steric hindrance: TRUE 
#>   - Number of Trials with 5000 cells: 50 
#> 
#> Summary statistics for rate estimates:
#>   - chi (gene body RNAP density): 0.28 RNAPs/bp
#>   - betaOrg (ratio of gene body RNAP density to pause region
#>         RNAP density, fixed sites): 0.4874
#>   - betaAdp (ratio of gene body RNAP density to pause region
#>         RNAP density, adapted model): 0.0036
#>   - TSS Read Counts Averaged Across Trials: ~ 114 read
#>         counts
#>   - Gene Body Read Counts Averaged Across Trials: ~ 482 read
#>         counts
#>   - Landing Pad Read Counts Averaged Across Trials: ~ 3747
#>         read counts
#>   - phi (landing pad occupancy): 0.62
#> 
#> EM Algorithm Convergence:
#>   - Trials converged normally: 50/50 (100.0%)
#>   - Trials converged to single site: 0/50 (0.0%)
#>   - Trials reached max iterations without convergence: 0/50
#>     (0.0%)

Likelihood Ratio Tests for Simulated Data

Likelihood ratio tests can also be applied to simulated data. In this example, the simulated rates for the first simpol object is compared to the simulated rates from the second simpol object. As expected given the simpol parameters, the plots show that the second simulated rates have higher beta and higher mean pause sites.


simpol2 <- simulatePolymerase(k=100, ksd=25, kMin=75, kMax=200, geneLen=1950,
alpha=2, beta=10, zeta=2000, zetaSd=1000, zetaMin=1500, zetaMax=2500,
cellNum=100, polSize=33, addSpace=17, time=30, timesToRecord=NULL)
#> Total time used for the simulation is 0 mins.

simRates2 <- estimateTranscriptionRates(simpol2, name="simRatesb10k100",
stericHindrance=TRUE)
#> Starting EM algorithm...

simLRT <- likelihoodRatioTest(simRates, simRates2)
#> Warning: Unknown or uninitialised column: `geneId`.
#> Unknown or uninitialised column: `geneId`.

plotPauseSiteContourMapTwoConditions(simLRT, file="simPauseSitesContourMap.png",
                                      width=7, height=5, dpi=150)

BetaViolinPlot(simLRT, file="simBetaViolinPlot.png", width=5, height=4.5, dpi=150)

Session Info

sessionInfo()
#> R version 4.6.0 Patched (2026-05-01 r89994)
#> Platform: aarch64-apple-darwin23
#> Running under: macOS Tahoe 26.3.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] BiocFileCache_3.3.0  dbplyr_2.5.2         pracma_2.4.6        
#>  [4] ggplot2_4.0.3        plyranges_1.33.0     dplyr_1.2.1         
#>  [7] GenomicRanges_1.65.0 Seqinfo_1.3.0        IRanges_2.47.2      
#> [10] S4Vectors_0.51.3     BiocGenerics_0.59.7  generics_0.1.4      
#> [13] STADyUM_1.3.1       
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.3.0                   bitops_1.0-9               
#>   [3] httr2_1.2.2                 rlang_1.2.0                
#>   [5] magrittr_2.0.5              otel_0.2.0                 
#>   [7] matrixStats_1.5.0           compiler_4.6.0             
#>   [9] RSQLite_3.53.1              systemfonts_1.3.2          
#>  [11] vctrs_0.7.3                 pkgconfig_2.0.3            
#>  [13] crayon_1.5.3                fastmap_1.2.0              
#>  [15] backports_1.5.1             XVector_0.53.0             
#>  [17] labeling_0.4.3              utf8_1.2.6                 
#>  [19] Rsamtools_2.29.0            rmarkdown_2.31             
#>  [21] UCSC.utils_1.9.0            ragg_1.5.2                 
#>  [23] purrr_1.2.2                 bit_4.6.0                  
#>  [25] xfun_0.58                   cachem_1.1.0               
#>  [27] cigarillo_1.3.0             GenomeInfoDb_1.49.1        
#>  [29] jsonlite_2.0.0              progress_1.2.3             
#>  [31] blob_1.3.0                  DelayedArray_0.39.3        
#>  [33] BiocParallel_1.47.0         prettyunits_1.2.0          
#>  [35] broom_1.0.13                parallel_4.6.0             
#>  [37] R6_2.6.1                    bslib_0.11.0               
#>  [39] RColorBrewer_1.1-3          rtracklayer_1.73.0         
#>  [41] car_3.1-5                   jquerylib_0.1.4            
#>  [43] Rcpp_1.1.1-1.1              SummarizedExperiment_1.43.0
#>  [45] knitr_1.51                  BiocBaseUtils_1.15.1       
#>  [47] Matrix_1.7-5                tidyselect_1.2.1           
#>  [49] dichromat_2.0-0.1           abind_1.4-8                
#>  [51] yaml_2.3.12                 codetools_0.2-20           
#>  [53] curl_7.1.0                  lattice_0.22-9             
#>  [55] tibble_3.3.1                Biobase_2.73.1             
#>  [57] withr_3.0.2                 S7_0.2.2                   
#>  [59] evaluate_1.0.5              isoband_0.3.0              
#>  [61] Biostrings_2.81.3           pillar_1.11.1              
#>  [63] ggpubr_0.6.3                filelock_1.0.3             
#>  [65] MatrixGenerics_1.25.0       carData_3.0-6              
#>  [67] ggpointdensity_0.2.1        RCurl_1.98-1.19            
#>  [69] hms_1.1.4                   scales_1.4.0               
#>  [71] glue_1.8.1                  tools_4.6.0                
#>  [73] BiocIO_1.23.3               data.table_1.18.4          
#>  [75] ggsignif_0.6.4              GenomicAlignments_1.49.0   
#>  [77] XML_3.99-0.23               cowplot_1.2.0              
#>  [79] grid_4.6.0                  tidyr_1.3.2                
#>  [81] restfulr_0.0.17             Formula_1.2-5              
#>  [83] cli_3.6.6                   rappdirs_0.3.4             
#>  [85] textshaping_1.0.5           S4Arrays_1.13.0            
#>  [87] viridisLite_0.4.3           gtable_0.3.6               
#>  [89] rstatix_0.7.3               sass_0.4.10                
#>  [91] digest_0.6.39               SparseArray_1.13.2         
#>  [93] rjson_0.2.23                farver_2.1.2               
#>  [95] memoise_2.0.1               htmltools_0.5.9            
#>  [97] lifecycle_1.0.5             httr_1.4.8                 
#>  [99] bit64_4.8.2                 MASS_7.3-65
Siepel, Adam. 2022. “A Unified Probabilistic Modeling Framework for Eukaryotic Transcription Based on Nascent RNA Sequencing Data.” bioRxiv, ahead of print. https://doi.org/10.1101/2021.01.12.426408.
Zhao, Yixin, Lingjie Liu, Rebecca Hassett, and Adam Siepel. 2023. “Model-Based Characterization of the Equilibrium Dynamics of Transcription Initiation and Promoter-Proximal Pausing in Human Cells.” Nucleic Acids Research 51 (21): e106–6. https://doi.org/10.1093/nar/gkad843.