This tutorial demonstrates how to use the STADyUM package to estimate transcription rates from PRO-seq data. STADyUM (Statistical Analysis of Dynamics Under Multiple Measurements) models RNA polymerase dynamics by analyzing reads at promoter-proximal pause sites and gene bodies. Refer to Zhao et al. (2023) and Siepel (2022) for more details on modeling the transcription rates from real PRO-seq data.
In this workflow, we will:
This example uses human PRO-seq data from the human CD4 cell type to demonstrate transcription rate estimation
knitr::opts_chunk$set(dpi = 72, fig.width = 5, fig.height = 4)
library(STADyUM) # For transcription rate estimation
library(GenomicRanges) # For genomic data manipulation
library(dplyr)
library(plyranges)
#library(tidyverse) # For data manipulation and visualization
library(ggplot2) # For plotting
library(pracma)
library(BiocFileCache)# Required by Bioconductor to check for connection
if (!requireNamespace("BiocFileCache", quietly = TRUE) ||
!curl::has_internet()) {
knitr::opts_chunk$set(eval = FALSE)
message("No internet connection; vignette code not evaluated.")
}
bfc <- BiocFileCache(ask = FALSE)
zip_path <- bfcrpath(bfc, "https://zenodo.org/records/20618059/files/rate_estimation_vignette_data.zip")## adding rname 'https://zenodo.org/records/20618059/files/rate_estimation_vignette_data.zip'
unzip(zip_path, exdir = tempdir())
data_dir <- file.path(tempdir(), "vignettes", "lrt_test_data")
# TSN (Transcription Start Site Network) files
# These are GRanges objects containing TSS annotations for each sample
cd4_tsn_file <- file.path(data_dir, 'PROseq-HUMAN-CD4_tsn.RDS')
# Transcript annotation file
# Contains gene models with exon structure for all genes
human_tq_file <- file.path(data_dir, 'HUMAN.RDS')
# PRO-seq signal files (stranded bigWig format)
cd4_bw_plus <- file.path(data_dir, 'PROseq-HUMAN-CD4_plus.bw')
cd4_bw_minus <- file.path(data_dir, 'PROseq-HUMAN-CD4_minus.bw')We begin by loading transcription start site (TSS) annotations from the CD4 cell type. These annotations may contain multiple TSSs per gene (alternative promoters). For this analysis, we select a single representative TSS per gene.
# Load TSN data for both cell types
cd4_tsn <- readRDS(cd4_tsn_file)
# Display structure of TSN data
head(cd4_tsn)## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | score ensembl_gene_id type
## <Rle> <IRanges> <Rle> | <numeric> <character> <character>
## [1] 1 960633 + | 2 ENSG00000187961 multiple
## [2] 1 960649 + | 2 ENSG00000187961 multiple
## [3] 1 966792 + | 4 ENSG00000187583 single
## [4] 1 1000905 + | 3 ENSG00000187608 single
## [5] 1 1020120 + | 3 ENSG00000188157 single
## [6] 1 1232279 + | 40 ENSG00000176022 single
## -------
## seqinfo: 25 sequences from an unspecified genome
When multiple TSSs exist for a single gene, we retain only the most upstream TSS. This ensures consistent TSS usage across the analysis. The “most upstream” is defined as the TSS with the smallest genomic coordinate on the plus strand and the largest coordinate on the minus strand.
# Process CD4 data: select most upstream TSS per gene
cd4_tsn <- STADyUM:::keep_upstream_tss(cd4_tsn)
cat("CD4 TSS sites:", length(cd4_tsn), "\n")## CD4 TSS sites: 9818
We load the full transcript annotation to define gene bodies. From these, we extract the transcription termination sites (TTS) to define the 3’ boundary of gene bodies.
# Load transcript model for all genes
human_tq <- readRDS(human_tq_file)
# Extract gene ranges and reduce overlapping exons
# This creates a single range per gene (union of all exons)
gngrng <- human_tq@transcripts %>%
group_by(ensembl_gene_id) %>%
plyranges::reduce_ranges_directed() %>%
sort()
cat("Number of genes:", length(gngrng), "\n")## Number of genes: 18816
For STADyUM analysis, we define two counting regions: - Pause region: typically near the TSS (promoter-proximal pause site) - Gene body region: the elongating polymerase region
Here, we anchor at the transcription termination site (3’ end) and define regions relative to those anchors.
# Anchor at 3' end (transcription termination site) and set width to 1
# This creates a point coordinate at the TTS for each gene
bw_tts <- gngrng %>%
plyranges::anchor_3p() %>%
mutate(width = 1)
# Build read-count regions for CD4
# Returns list with 'pause' and 'gene_body' GRanges
cd4_count_region <- STADyUM:::build_readcount_regions(cd4_tsn, bw_tts)## Genes kept after length filter: 7868
## CD4 pause regions: 7868
## CD4 gene body regions: 7868
The estimateTranscriptionRates() function applies
STADyUM to count reads in pause and gene body regions. It estimates key
kinetic parameters:
# Estimate transcription rates for CD4
# Input: plus and minus strand bigWig files, pause/gene body regions, sample label
cd4_rate <- estimateTranscriptionRates(
cd4_bw_plus, cd4_bw_minus,
cd4_count_region$pause, cd4_count_region$gene_body,
"CD4"
)##
## Importing bigwig files...
##
## Processing plus and minus strands bigwig...
##
## Summarizing pause and gene body regions...
##
## Generating read counts table...
## estimating rates...
## CD4 rate estimation complete
## CD14 rate estimation complete
This plot shows the relationship between the pause release rate (beta) and transcription activity (chi). Each point represents one gene. Positively correlated genes may indicate coupled regulation of pause release and transcription initiation.
## CD4 genes with estimated rates: 7026
p1 <- plotScatterDensity(cd4_rate, "betaAdp", "chi",
log_x = TRUE, log_y = TRUE,
xlab = expression(log[10] ~ beta),
ylab = expression(log[10] ~ chi),
xlim = c(-6, -1), ylim = c(-4, 1))
if (requireNamespace("ggpubr", quietly = TRUE)) {
p1 <- p1 + scale_color_viridis_c() +
ggpubr::stat_cor(method = "spearman", label.x = -3.5, label.y = -3.8,
size = 5, aes(label = after_stat(r.label)))
}
print(p1)These plots show relationships between pause release rate (beta) and two measures of elongation variability. The first shows beta vs. mean elongation time; the second shows beta vs. elongation time standard deviation.
p1 <- plotScatterDensity(cd4_rate, "betaAdp", "fkMean",
log_x = TRUE,
xlab = expression(log[10] ~ beta),
ylab = expression(mu))
if (requireNamespace("ggpubr", quietly = TRUE)) {
p1 <- p1 + scale_color_viridis_c() +
ggpubr::stat_cor(method = "spearman", label.x = -2.2, label.y = 160,
size = 5, aes(label = after_stat(r.label)))
}
p2 <- plotScatterDensity(cd4_rate, "betaAdp", "fkSD",
log_x = TRUE,
xlab = expression(log[10] ~ beta),
ylab = expression(sigma),
color_var = "sdGroup",
color_values = c(Broad = "#1b9e77", Sharp = "#762a83"),
color_lab = expression(sigma ~ group))
if (requireNamespace("ggpubr", quietly = TRUE)) {
p2 <- p2 +
ggpubr::stat_cor(method = "spearman", label.x = -2.5, label.y = 75,
size = 3, aes(label = after_stat(r.label), group = 1),
color = "black")
}
p2_hist <- ggplot(rates(cd4_rate), aes(x = .data$fkSD)) +
geom_histogram(aes(y = after_stat(density)), bins = 50) +
geom_vline(xintercept = 38.5, linetype = "dashed", color = "black",
linewidth = 0.6) +
coord_flip() +
theme(axis.title = element_blank(), axis.text = element_blank(),
axis.ticks = element_blank())
print(p1)grid::grid.newpage()
grid::pushViewport(grid::viewport(layout = grid::grid.layout(1, 2, widths = grid::unit(c(4, 1), "null"))))
print(p2, vp = grid::viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p2_hist, vp = grid::viewport(layout.pos.row = 1, layout.pos.col = 2))p3 <- plotScatterDensity(cd4_rate, "fkMean", "fkSD",
xlab = expression(mu),
ylab = expression(sigma))
if (requireNamespace("ggpubr", quietly = TRUE)) {
p3 <- p3 + scale_color_viridis_c() +
ggpubr::stat_cor(method = "spearman", label.x = 112, label.y = 1,
size = 5, aes(label = after_stat(r.label)))
}
print(p3)This tutorial demonstrated a complete workflow for estimating transcription rates from PRO-seq data using STADyUM:
TSS Identification: We identified a single transcription start site using annotations.
Region Definition: We defined pause (promoter-proximal) and gene body regions as input for rate estimation.
Rate Estimation: STADyUM estimated three key transcription parameters (pause release rate, elongation characteristics, and activity) from the read distribution in these regions.
Visualization: We explored relationships between parameters, revealing how pause release and elongation dynamics are coupled in CD4 cells.
The estimated rates are now ready for downstream comparative analysis (e.g., identifying genes with cell-type-specific transcription dynamics) and integration with other genomic data.
## R version 4.6.0 Patched (2026-05-01 r89994)
## Platform: aarch64-apple-darwin23
## Running under: macOS Tahoe 26.3.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BiocFileCache_3.3.0 dbplyr_2.5.2 pracma_2.4.6
## [4] ggplot2_4.0.3 plyranges_1.33.0 dplyr_1.2.1
## [7] GenomicRanges_1.65.0 Seqinfo_1.3.0 IRanges_2.47.2
## [10] S4Vectors_0.51.3 BiocGenerics_0.59.7 generics_0.1.4
## [13] STADyUM_1.3.1
##
## loaded via a namespace (and not attached):
## [1] DBI_1.3.0 bitops_1.0-9
## [3] httr2_1.2.2 rlang_1.2.0
## [5] magrittr_2.0.5 otel_0.2.0
## [7] matrixStats_1.5.0 compiler_4.6.0
## [9] RSQLite_3.53.1 vctrs_0.7.3
## [11] pkgconfig_2.0.3 crayon_1.5.3
## [13] fastmap_1.2.0 backports_1.5.1
## [15] XVector_0.53.0 labeling_0.4.3
## [17] utf8_1.2.6 Rsamtools_2.29.0
## [19] rmarkdown_2.31 UCSC.utils_1.9.0
## [21] purrr_1.2.2 bit_4.6.0
## [23] xfun_0.58 cachem_1.1.0
## [25] cigarillo_1.3.0 GenomeInfoDb_1.49.1
## [27] jsonlite_2.0.0 progress_1.2.3
## [29] blob_1.3.0 DelayedArray_0.39.3
## [31] BiocParallel_1.47.0 prettyunits_1.2.0
## [33] broom_1.0.13 parallel_4.6.0
## [35] R6_2.6.1 bslib_0.11.0
## [37] RColorBrewer_1.1-3 rtracklayer_1.73.0
## [39] car_3.1-5 jquerylib_0.1.4
## [41] Rcpp_1.1.1-1.1 SummarizedExperiment_1.43.0
## [43] knitr_1.51 BiocBaseUtils_1.15.1
## [45] Matrix_1.7-5 tidyselect_1.2.1
## [47] dichromat_2.0-0.1 abind_1.4-8
## [49] yaml_2.3.12 codetools_0.2-20
## [51] curl_7.1.0 lattice_0.22-9
## [53] tibble_3.3.1 Biobase_2.73.1
## [55] withr_3.0.2 S7_0.2.2
## [57] evaluate_1.0.5 isoband_0.3.0
## [59] Biostrings_2.81.3 pillar_1.11.1
## [61] ggpubr_0.6.3 filelock_1.0.3
## [63] MatrixGenerics_1.25.0 carData_3.0-6
## [65] ggpointdensity_0.2.1 RCurl_1.98-1.19
## [67] hms_1.1.4 scales_1.4.0
## [69] glue_1.8.1 tools_4.6.0
## [71] BiocIO_1.23.3 data.table_1.18.4
## [73] ggsignif_0.6.4 GenomicAlignments_1.49.0
## [75] XML_3.99-0.23 cowplot_1.2.0
## [77] grid_4.6.0 tidyr_1.3.2
## [79] restfulr_0.0.17 Formula_1.2-5
## [81] cli_3.6.6 rappdirs_0.3.4
## [83] S4Arrays_1.13.0 viridisLite_0.4.3
## [85] gtable_0.3.6 rstatix_0.7.3
## [87] sass_0.4.10 digest_0.6.39
## [89] SparseArray_1.13.2 rjson_0.2.23
## [91] farver_2.1.2 memoise_2.0.1
## [93] htmltools_0.5.9 lifecycle_1.0.5
## [95] httr_1.4.8 bit64_4.8.2
## [97] MASS_7.3-65