SpaceTrooper is an R/Bioconductor package
for Quality Control (QC) of imaging-based spatial transcriptomics and
proteomics data. It provides multi-platform data harmonization,
cell-level QC, and visualization utilities. The package leverages SpatialExperiment
objects to support data from CosMx,
Xenium, and MERFISH technologies.
In this section, we show how to run the standard
SpaceTrooper workflow on Spatial Transcriptomics data, to
perform QC by computing the Quality Score (QS).
For demonstration purposes, we showcase the
CosMx_DBKero_Tiny dataset as the main tutorial example, a
small subset of the CosMx 1k breast cancer dataset analyzed also in the
paper.
SpaceTrooper provides three different reading functions,
one per each technology, that load data into a
SpatialExperiment object. For more details on how to
correctly specify the inputs, please refer to the SpaceTrooper
utilities vignette.
library(SpaceTrooper)
library(ggplot2)
cosmxFolder <- system.file(file.path("extdata", "CosMx_DBKero_Tiny"),
package="SpaceTrooper")
spe <- readCosmxSPE(cosmxFolder, sampleName="DBKero_CosMx")
spe## class: SpatialExperiment
## dim: 1010 905
## metadata(4): fov_positions fov_dim polygons technology
## assays(1): counts
## rownames(1010): RAMP2 CD83 ... NegPrb09 NegPrb10
## rowData names(0):
## colnames(905): f16_c1 f16_c10 ... f16_c98 f16_c99
## colData names(20): fov cellID ... sample_id cell_id
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## spatialCoords names(2) : CenterX_global_px CenterY_global_px
## imgData names(1): sample_id
In addition, we provide the code for loading Xenium, and MERFISH example data. Please, note that for these technologies the parameters are configured to automatically load polygons and compute missing metrics, so these steps don’t need to be run again. To proceed with these technologies, skip directly to Add QC metrics.
xeniumFolder <- system.file( "extdata", "Xenium_small", package="SpaceTrooper")
xen <- readXeniumSPE(xeniumFolder, computeMissingMetrics=TRUE,
keepPolygons=TRUE)
merfishFolder <- system.file("extdata", "Merfish_Tiny", package="SpaceTrooper")
mer <- readMerfishSPE(merfishFolder, boundariesType="parquet",
computeMissingMetrics=TRUE, keepPolygons=TRUE)In this section, we show how to plot each cell position onto the map of the Fields of View (FOVs), whose coordinates are provided only for CosMx technology.
IMPORTANT: we have noticed that according to CosMx version, there could be a misalignment of FOVs and cell centroids, that can be easily corrected with a single line of code. Therefore, it is crucial to generate this plot to check for any spatial shift, as such misalignment directly affects the QS computation.
The dataset is a subset with just a single FoV, whose number is displayed at the center. Cell centroids, shown in dark red, are all contained by FoV boundaries, thus no shift correction is needed here.
When an experiment has multiple FoVs, you can see the map and the topological organization of the FoVs, together with their numbers.
If any misalignment is observed, it can be corrected by adjusting the FoV coordinates directly. For instance, if the FoVs are shifted upward by one FoV height (which, for CosMx technology, corresponds to 4,256 pixels), you can correct this by subtracting that value from the FoVs y coordinates:
In this section we show how to load polygons after
SpatialExperiment creation, only for visualization
purposes. They are stored as an sf object within
colData.
This step is not mandatory for CosMx, because the pipeline can be executed even without them.
Please pay attention to any warnings. Cell polygons with fewer than
four vertices cannot be handled by the geometry packages used by
SpaceTrooper. Therefore, the corresponding cells are
discarded from the SpatialExperiment object.
The following sections will work the same way for all types of data.
The spatialPerCellQC function computes additional
metrics per each cell, that are saved inside the
SpatialExperimentand accessible through
colData(spe). It is mandatory to run this function before
computing QS.
The negProbList parameter is, by default, a vector
containing all control probe patterns encountered so far across the
supported technologies. Because these patterns continue to evolve, some
may not yet be included. If you find that your control probe patterns
are missing from the default list, you can define a custom vector, as
shown below.
By default, the function automatically removes 0 count cells, but
this can be handled with the rmZeros parameter.
## Removing 1 cells with 0 counts!
## [1] "fov" "cellID"
## [3] "Area" "AspectRatio"
## [5] "CenterX_local_px" "CenterY_local_px"
## [7] "Width" "Height"
## [9] "Mean.PanCK" "Max.PanCK"
## [11] "Mean.CD68" "Max.CD68"
## [13] "Mean.MembraneStain_B2M" "Max.MembraneStain_B2M"
## [15] "Mean.CD45" "Max.CD45"
## [17] "Mean.DAPI" "Max.DAPI"
## [19] "sample_id" "cell_id"
## [21] "polygons" "sum"
## [23] "detected" "subsets_NegPrb_sum"
## [25] "subsets_NegPrb_detected" "subsets_NegPrb_percent"
## [27] "total" "control_sum"
## [29] "control_detected" "target_sum"
## [31] "target_detected" "CenterX_global_px"
## [33] "CenterY_global_px" "ctrl_total_ratio"
## [35] "log2Ctrl_total_ratio" "CenterX_global_um"
## [37] "CenterY_global_um" "Area_um"
## [39] "dist_border_x" "dist_border_y"
## [41] "dist_border" "log2AspectRatio"
## [43] "SignalDensity" "log2SignalDensity"
Several metrics are added, both derived from expression assay and cell morphology. Some of them are directly used to compute QS:
For a better detailed explanation of the other metrics, please refer to the SpaceTrooper utilities vignette.
computeQCScore function calculates QS per each cell. QS
combines several metrics into a formula:
log2SignalDensity, corresponding to signal density,
Area_um or volume, i.e. cell size,
log2Ctrl_total_ratio, namely background signal and
log2AspectRatio combined with
dist_border, which jointly correspond to the border
effect (only for CosMx technology).
glmnet package is used to estimate the coefficients of
the formula, resulting in a robust score that captures low-quality
cells. During model training, the selected cells are used as good or bad
examples to learn how each term in the formula contributes to cell
quality.
IMPORTANT: please, pay attention to any warning. If a term has too few bad examples (fewer than 0.1% of the total number of cells), it is excluded from the formula and therefore not used in the QS computation.
The QS (stored as QC_score in coldata)
ranges from 0 to 1, with 0 meaning low-quality and 1 high-quality.
We are setting a seed to ensure reproducibility in this tutorial, because there are stochastic processes underlying QS computation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003429 0.786016 0.883948 0.798579 0.936674 0.988138
In this case, all the terms were used as no warning appeared.
Subsequently, it is possible also to assess which cells have a QS
lower than a certain threshold (default is 0.5) with the following
function. It creates a new column called low_qcscore inside
of colData.
##
## FALSE TRUE
## 807 97
Using this threshold, 97 cells are flagged as low-quality. We do not suggest a fixed default threshold, but it is advisable to check the QS distribution before setting any threshold.
SpaceTrooper comes with several functions to view cells and metrics.
To view the distribution of whatever quantitative metric,
plotMetricHist comes in handy.
The QS distribution exhibits a left tail starting around 0.75.
Cell visualization can be obtained by using either centroids (recommended when the dataset has a large number of cells) or polygons.
plotCentroids plots cell centroids that can be colored
by a certain metric contained in the colData slot, by using
the colour_by parameter.
Additionally, if you have a palette column in colData,
containing colors for each cell, it can be given to palette
parameter, so that it automatically matches the column passed in
colour_by. As an example, we are using the cell types
obtained as described in the paper
with their own color palette.
labf <- system.file(file.path("extdata", "CosMx_DBKero_Tiny",
"labels_tiny.tsv"), package="SpaceTrooper")
labs <- read.table(file=labf, sep="\t", header=TRUE)
spe$labels <- as.factor(labs[match(spe$cell_id, labs$cell_id),]$label)
spe$labels_colors <- as.factor(labs[match(spe$cell_id, labs$cell_id),]$lab_color)
plotCentroids(spe, colourBy="labels", size=3, palette="labels_colors")Cell centroids colored by cell types allow to view the spatial distribution of cell populations in the sample.
When possible, polygon visualization gives a better overview of the
cells’ spatial organization and morphological characteristics. Polygons
are stored as an sf object within colData, so
they can be viewed using standard ggplot2 functions. For
CosMx technology, the polygons must be explicitly loaded as described in
Load polygons. SpaceTrooper provides
plotPolygons function that works just like
plotCentroids but takes polygons instead of centroids.
We can see in yellow and darkviolet that
there are few cells with extreme values of either
log2SignalDensity, Area_um,
log2Ctrl_total_ratio and log2AspectRatio.
Since all plotting functions are based on ggplot2, you
can easily customize the graphical outputs by adding standard ggplot2
components.
We can see that the QS is able to detect both the aspects highlighted
by log2SignalDensity,Area_um,
log2Ctrl_total_ratio or log2AspectRatio. Cells
that showed either lower signal density, bigger size, higher background
signal or border effect also display low QS (darker color).
It’s up to the user to choose an appropriate threshold to flag cells according to the observed QS distribution.
plotPolygons(spe, colourBy="low_qcscore") +
scale_fill_manual(values=c("TRUE"="red", "FALSE" = "#c0c8cf"))You can reruncomputeQCScoreFlags to check how many cells
would be flagged using another threshold.
##
## FALSE TRUE
## 711 193
Using this threshold, 193 cells are flagged as low-quality.
plotPolygons(spe, colourBy="low_qcscore") +
scale_fill_manual(values=c("TRUE"="red", "FALSE" = "#c0c8cf"))The threshold is more stringent and more cells are flagged compared to the previous one. It’s up to you to choose a threshold that better suits your analysis needs.
In this vignette, we explored the main functionalities of the
SpaceTrooper package for imaging-based spatial omics data
QC.
For further insights on SpaceTrooper package usage, please refer to the SpaceTrooper utilities vignette.
Main steps shown are:
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggplot2_4.0.2 SpaceTrooper_1.1.6
## [3] SpatialExperiment_1.21.0 SingleCellExperiment_1.33.0
## [5] SummarizedExperiment_1.41.1 Biobase_2.71.0
## [7] GenomicRanges_1.63.1 Seqinfo_1.1.0
## [9] IRanges_2.45.0 S4Vectors_0.49.0
## [11] BiocGenerics_0.57.0 generics_0.1.4
## [13] MatrixGenerics_1.23.0 matrixStats_1.5.0
## [15] BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 gridExtra_2.3
## [3] rlang_1.1.7 magrittr_2.0.4
## [5] scater_1.39.2 e1071_1.7-17
## [7] compiler_4.5.2 DelayedMatrixStats_1.33.0
## [9] SpatialExperimentIO_1.3.0 sfheaders_0.4.5
## [11] vctrs_0.7.1 shape_1.4.6.1
## [13] pkgconfig_2.0.3 fastmap_1.2.0
## [15] backports_1.5.0 magick_2.9.0
## [17] XVector_0.51.0 labeling_0.4.3
## [19] scuttle_1.21.0 rmarkdown_2.30
## [21] ggbeeswarm_0.7.3 purrr_1.2.1
## [23] bit_4.6.0 xfun_0.56
## [25] glmnet_4.1-10 cachem_1.1.0
## [27] beachmat_2.27.2 jsonlite_2.0.0
## [29] rhdf5filters_1.23.3 DelayedArray_0.37.0
## [31] Rhdf5lib_1.33.0 BiocParallel_1.45.0
## [33] irlba_2.3.7 broom_1.0.12
## [35] parallel_4.5.2 R6_2.6.1
## [37] bslib_0.10.0 RColorBrewer_1.1-3
## [39] limma_3.67.0 car_3.1-5
## [41] jquerylib_0.1.4 iterators_1.0.14
## [43] Rcpp_1.1.1 assertthat_0.2.1
## [45] knitr_1.51 R.utils_2.13.0
## [47] splines_4.5.2 Matrix_1.7-4
## [49] tidyselect_1.2.1 viridis_0.6.5
## [51] abind_1.4-8 yaml_2.3.12
## [53] codetools_0.2-20 lattice_0.22-7
## [55] tibble_3.3.1 withr_3.0.2
## [57] S7_0.2.1 evaluate_1.0.5
## [59] sf_1.0-24 survival_3.8-6
## [61] units_1.0-0 proxy_0.4-29
## [63] pillar_1.11.1 BiocManager_1.30.27
## [65] ggpubr_0.6.2 carData_3.0-6
## [67] KernSmooth_2.23-26 foreach_1.5.2
## [69] sparseMatrixStats_1.23.0 scales_1.4.0
## [71] class_7.3-23 glue_1.8.0
## [73] maketools_1.3.2 tools_4.5.2
## [75] BiocNeighbors_2.5.3 robustbase_0.99-7
## [77] sys_3.4.3 data.table_1.18.2.1
## [79] ScaledMatrix_1.19.0 locfit_1.5-9.12
## [81] ggsignif_0.6.4 buildtools_1.0.0
## [83] cowplot_1.2.0 rhdf5_2.55.13
## [85] grid_4.5.2 tidyr_1.3.2
## [87] DropletUtils_1.31.0 edgeR_4.9.2
## [89] beeswarm_0.4.0 BiocSingular_1.27.1
## [91] HDF5Array_1.39.0 vipor_0.4.7
## [93] rsvd_1.0.5 Formula_1.2-5
## [95] cli_3.6.5 viridisLite_0.4.3
## [97] S4Arrays_1.11.1 arrow_23.0.0.1
## [99] dplyr_1.2.0 DEoptimR_1.1-4
## [101] gtable_0.3.6 R.methodsS3_1.8.2
## [103] rstatix_0.7.3 sass_0.4.10
## [105] digest_0.6.39 classInt_0.4-11
## [107] ggrepel_0.9.6 SparseArray_1.11.10
## [109] dqrng_0.4.1 rjson_0.2.23
## [111] farver_2.1.2 htmltools_0.5.9
## [113] R.oo_1.27.1 lifecycle_1.0.5
## [115] h5mread_1.3.1 statmod_1.5.1
## [117] bit64_4.6.0-1