CatsCradle SingleCellExperiment Quick Start

Clustering and annotation of genes

A typical analysis of scRNA-seq data involves dimensionality reduction (PCA, UMAP, tSNE) and the clustering of cells using the Louvain algorithm. All of this is done based on the similarities and differences in the genes cells express. Here we see a UMAP where the points are cells, coloured by their clusters.

library(SingleCellExperiment,quietly=TRUE)
library(CatsCradle,quietly=TRUE)
library(ggplot2,quietly=TRUE)
getExample = make.getExample()
S_sce = getExample('S_sce')
#> Warning: Layer 'counts' is empty
#> Warning: Layer 'scale.data' is empty
df = as.data.frame(reducedDim(S_sce,'UMAP'))
df$seurat_clusters = S_sce$seurat_clusters
g = ggplot(df,aes(x=UMAP_1,y=UMAP_2,color=seurat_clusters)) +
    geom_point() +
    ggtitle('SingleCellExperiment of cells colored by seurat_cluster')
print(g)

By transposing the expression matrix, we can use the same technology to produce an object where the samples are the genes and the features are cells. Genes now have their own UMAP and their own clusters. This is all done on the basis of the similarities and differences in the cells they are expressed in.

STranspose_sce = getExample('STranspose_sce')
print(STranspose_sce)
#> class: SingleCellExperiment 
#> dim: 540 2000 
#> metadata(0):
#> assays(3): counts logcounts scaledata
#> rownames(540): AAAGAACTCATCGCTC-1-1 AACAAAGTCGTAGGGA-1-1 ...
#>   TTCTTGATCTGAACGT-1-8 TTTCAGTAGCGTCTCG-1-8
#> rowData names(0):
#> colnames(2000): Ctsb H2-Ab1 ... Phkg1 Gm9530
#> colData names(6): orig.ident nCount_RNA ... seurat_clusters ident
#> reducedDimNames(2): PCA UMAP
#> mainExpName: RNA
#> altExpNames(0):
df = as.data.frame(reducedDim(STranspose_sce,'UMAP'))
df$seurat_clusters = STranspose_sce$seurat_clusters
h = ggplot(df,aes(x=umap_1,y=umap_2,color=seurat_clusters)) +
    geom_point() +
    ggtitle('SingleCellExperiment of genes colored by cluster')
print(h)

Gene clusters vs. cell clusters

There are a number of ways of investigating the relationship between gene clusters and cell clusters. One is by computing the average expression of each gene cluster in each cell cluster; this can be displayed as a heat map or a Sankey graph - the cat’s cradle of our CatsCradle. (See getAverageExpressionMatrix() and sankeyFromMatrix() in the CatsCradle vignette CatsCradle.)

Spatial co-location of genes on gene UMAP

Gene sets such as Hallmark tend to localise on gene UMAP, though they are not necessarily confined to specific gene clusters. Here we show HALLMARK_OXIDATIVE_PHOSPHORYLATION (subset to the genes in our SingleCellExperiment object) superimposed in green and black on the gene cluster UMAP.

hallmark = getExample('hallmark')
h = 'HALLMARK_OXIDATIVE_PHOSPHORYLATION'

idx = colnames(STranspose_sce) %in% hallmark[[h]]

k = ggplot(data=df,aes(x=umap_1,y=umap_2,color=seurat_clusters)) +
    geom_point() + 
    geom_point(data=df[idx,],aes(x=umap_1,y=umap_2),color='black',size=2.7) +
    geom_point(data=df[idx,],aes(x=umap_1,y=umap_2),color='green') +
    ggtitle(paste(h,'\non gene clusters'))

print(k)

pValue = getObjectSubsetClusteringPValue(STranspose_sce,idx)
pValue
#> [1] 0.001

The p-value here is limited by the default number of randomisation trials in getSubsetClusteringPValue(). Of the 50 Hallmark gene sets, 31 have clustering p-values less than 0.05.

The computation of these p-values is ultimately carried out by medianComplementPValue(). We wish to determine whether a subset X of a set of points S is randomly scattered across S or is somehow localised. To do this we consider the complement C of X in S. If X is “spread out” randomly across S, every point of C will be close to some point in X. Accordingly, the median distance from points in C to their nearest point in X will be small. If on the contrary, X is localised, many points in C will be distant from X, thereby producing a larger median complement distance. We produce a p-value by comparing the median complement distance for X to those for randomised sets of the same size as X.

Gene annotation

This allows one to predict functions of a given gene by looking at annotations of its neighbouring genes. If a gene has its own annotation and also has neighbours which have annotations, we can compare the actual annotation to the predicted annotation. These predictions perform well above chance as described in the section Predicting gene function of the CatsCradle vignetteCatsCradle .

Analysis of spatial transcriptomic data

CatsCradle provides extensive tools for the analysis of spatial transcriptomics data. Here we see cell type plotted on the cell centroids of our example data.

library(SpatialExperiment)
X_sce = getExample('X_sce')
cellCoords = as.data.frame(spatialCoords(X_sce))
cellCoords$cluster = X_sce$cluster
ell = ggplot(cellCoords,aes(x=x,y=y,color=cluster)) +
      geom_point() +
      ggtitle('Spatial coordinates colored by cell type')
print(ell)

With Moran’s I, we can see spatial autocorrelation of gene expression and compute p-values for this. See CatsCradle spatial vignette CatsCradleSpatial.

Neighbourhoods

The key concept in our analysis of this kind of data is the neighbourhood. In general, a neighbourhood is a contiguous set of cells. In all of our use cases, each neighbourhood is set of cells around a given cell. Accordingly, a neighbourhood can be referred to by the name of the cell at its centre.

The simplest type of neighbourhood arises from Delaunay triangulation where each neighbourhood consists of a cell and its immediate neighbours. This is an undirected graph in which each cell has an edge connecting it to each of these neighbours.

delaunayNeighbours = getExample('delaunayNeighbours')
head(delaunayNeighbours,10)
#>    nodeA nodeB
#> 1  16307 16316
#> 2  16314 16317
#> 3  16296 16299
#> 4  16299 16307
#> 5  16309 16316
#> 6  16309 16314
#> 7  12028 12032
#> 8  12032 16914
#> 9  12032 16257
#> 10  2975  3339

We can also make extended neighbourhoods, e.g., by finding the neighbours of each cell, their neighbours, and their neighbours. getExtendedNBHDs() produces a list characterising these neighbours by their combinatorial radius and collapseExtendedNBHDs() collapses these so that all the extended neighbours of each cell are now treated as neighbours.

In addition, neighbourhoods can be found on the basis of Euclidean distance.

For each cell type A, we can ask “What types of cells are found around cells of type A?” We can display the answer in a directed graph. Here we are using an extended neighbourhood. A fat arrow from type A to type B indicates that neighbourhoods around cells of type A have a high proportion of cells of type B. Here we only display an arrow from A to B when cells of type B compose at least 5 percent of the cells in neighbourhoods around type A.

NBHDByCTMatrixExtended = getExample('NBHDByCTMatrixExtended')
#> radius 2
#> radius 3
#> radius 4
clusters = getExample('clusters')
colours = getExample('colours')
cellTypesPerCellTypeMatrixExtended =
      computeCellTypesPerCellTypeMatrix(NBHDByCTMatrixExtended,clusters)

cellTypesPerCellTypeGraphFromCellMatrix(cellTypesPerCellTypeMatrixExtended,
                                        minWeight = 0.05, colours = colours)

#> IGRAPH ab39f20 DNW- 16 62 -- 
#> + attr: coords (g/n), name (v/c), color (v/c), weight (e/n), width
#> | (e/n)
#> + edges from ab39f20 (vertex names):
#>  [1] 0_Oligo->3_Endo       0_Oligo->10_Astro     0_Oligo->11_Oligo    
#>  [4] 0_Oligo->14_Ependymal 3_Endo ->0_Oligo      3_Endo ->5_Astro     
#>  [7] 3_Endo ->6_VLMC       3_Endo ->10_Astro     3_Endo ->11_Oligo    
#> [10] 3_Endo ->14_Ependymal 4_Astro->0_Oligo      4_Astro->5_Astro     
#> [13] 4_Astro->6_VLMC       4_Astro->7_Granule    4_Astro->10_Astro    
#> [16] 4_Astro->11_Oligo     4_Astro->14_Ependymal 4_Astro->20_VLMC     
#> [19] 5_Astro->6_VLMC       5_Astro->11_Oligo     6_VLMC ->10_Astro    
#> + ... omitted several edges

We can also test for statistical significance for the enrichment of a given cell type in the neighbourhoods around another cell type (see CatsCradleSpatial).

Neighourhood objects.

Each neighbourhood is the neighbourhood of a cell. Accordingly, neighbourhoods naturally want to be encoded into single cell objects. There are two ways to do this.

Neighbourhoods and cell types

Just as cells express genes, with poetic license we can say that neighbourhoods “express” cell types. A cell types by neighbourhoods matrix gives the counts of cells of each type in each neighbourhood. This can be taken as the counts matrix for a SingleCellExperiment object. We can then use the Louvain algorithm to cluster neighbourhoods into neighbourhood types. If we like, we can display these on a neighbourhood UMAP.

Here we attach these neighbourhood_clusters from a neighbourhood object (created by computeNBHDVsCTObject()) to our original spatial object and visualize the neighbourhood clusters on the tissue coordinates. (Again, we are using extended neighbourhoods.)

NBHDByCTSingleCellExtended_sce = getExample('NBHDByCTSingleCellExtended_sce')
#> Warning: Feature names cannot have underscores ('_'), replacing with dashes
#> ('-')
#> Warning: Data is of class matrix. Coercing to dgCMatrix.
#> Warning in svd.function(A = t(x = object), nv = npcs, ...): You're computing
#> too large a percentage of total singular values, use a standard svd instead.
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#> 
#> Number of nodes: 4261
#> Number of edges: 105168
#> 
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.9694
#> Number of communities: 8
#> Elapsed time: 0 seconds
cellCoords$neighbourhood_clusters =
  NBHDByCTSingleCellExtended_sce$neighbourhood_clusters
nbhdPlot = ggplot(cellCoords,aes(x=x,y=y,color=neighbourhood_clusters)) +
           geom_point() +
       ggtitle('Cell coordinates colored by neighbourhood_clusters')
print(nbhdPlot)

Aggregate gene expression

As well as expressing cell types, neighbourhoods also express genes. We can take the gene expression profile of a neighbourhood to be the total aggregate expression of each gene across all cells in the neighbourhood. This produces a genes by neighbourhoods count matrix. The function aggregateGeneExpression() takes as arguments an underlying Seurat object and a set of neighbourhoods and produces a Seurat object where the rows are the genes of the underlying Seurat object and the columns are the neighbourhoods. Since each neighbourhood corresponds to a cell in our usage, this is formally indistinguishable from a normal Seurat object. However, to avoid confusion, we have labelled the clusters as aggregation_clusters rather than the standard seurat_clusters.

This enables all the usual analyses: dimension reduction and Louvain clustering (here giving another clustering of the neighbourhoods). In addition, we can find the marker genes which characterise the transcriptomic differences between the neighbourhood types.

extendedNeighbours = getExample('extendedNeighbours')
smallXenium = getExample('smallXenium')
agg_sce = aggregateGeneExpression(smallXenium,extendedNeighbours,
                                    verbose=FALSE,
                    returnType='sce')
#> Warning: Data is of class matrix. Coercing to dgCMatrix.
cellCoords$aggregation_clusters = agg_sce$aggregation_clusters
aggPlot = ggplot(cellCoords,aes(x=x,y=y,color=aggregation_clusters)) +
           geom_point() +
       ggtitle('Cell coordinates colored by aggregation_clusters')
print(aggPlot)

Notice that since our neighbourhoods are all indexed by cells, we can compare the different methods of clustering neighbourhoods. Here we give a heatmap comparing the clustering based on cell types in our extended neighbourhoods to that based on aggregate gene expression. We see, for example, that aggregation clusters 7, 13, 14, 16 and 17 are subsumed into cell type neighbourhood cluster 0. More generally, we see that clustering based on aggregate gene expression tends to subdivide that based on the cell types in each neighbourhood.

library(pheatmap)
tab = table(cellCoords[,c('aggregation_clusters','neighbourhood_clusters')])
                             
M = matrix(as.numeric(tab),nrow=nrow(tab))
rownames(M) = paste('aggregation_cluster',rownames(tab))
colnames(M) = paste('nbhd_cluster',colnames(tab))
for(i in 1:nrow(M)) M[i,] = M[i,] / sum(M[i,])
pheatmap(M)

Another way of comparing clusterings is via the adjusted Rand index. This compares two classifications of the same set of objects. It can vary between -1 and 1. It takes the value 1 when these classifications denote the same subsets and 0 when they agree at a level no better than chance.

library(fossil,quietly=TRUE)
adjustedRandIndex = adj.rand.index(agg_sce$aggregation_clusters,
                       NBHDByCTSingleCellExtended_sce$neighbourhood_clusters)
adjustedRandIndex
#> [1] 0.2198536

Ligand-receptor analysis

The Delaunay triangulation gives us an opportunity to look at the ligand-receptor interactions between neighbouring cells. We use the Nichenetr ligand-receptor pairs to discover the pairs in our gene panel. Nichnetr lists over eleven thousand ligand-receptor pairs. Of these 28 appear in our example panel.

The main function for carrying out ligand-receptor analysis is performLigandReceptorAnalysis(). This produces a list with multiple entries. These include a data frame interactionsOnEdges. Here the first two columns are nodeA and nodeB which give the edge in question and next two columns give their corresponding clusters. The remaining columns are logicals, each corresponding to a particular ligand-receptor pair and indicating whether that pair is present on that edge. Notice that while the neighbour-neighbour relationship in the Delaunay triangulation is undirected, the ligand-receptor relationship is directed. Accordingly, in interactionsOnEdges, a given edge appears as both the pair cell A - cell B and as the pair cell B - cell A.

ligandReceptorResults = getExample('ligandReceptorResults')
ligandReceptorResults$interactionsOnEdges[1:10,1:10]
#>        clusterA     clusterB nodeA nodeB Pecam1_Pecam1 Cdh4_Cdh4 Col1a1_Cd44
#> 1        6_VLMC     11_Oligo 16307 16316          TRUE     FALSE       FALSE
#> 2       5_Astro      5_Astro 16314 16317         FALSE     FALSE       FALSE
#> 3        6_VLMC       6_VLMC 16296 16299         FALSE     FALSE       FALSE
#> 4        6_VLMC       6_VLMC 16299 16307          TRUE     FALSE       FALSE
#> 5       5_Astro     11_Oligo 16309 16316         FALSE      TRUE       FALSE
#> 6       5_Astro      5_Astro 16309 16314         FALSE     FALSE       FALSE
#> 7  18_Pyramidal 18_Pyramidal 12028 12032         FALSE     FALSE       FALSE
#> 8  18_Pyramidal    7_Granule 12032 16914         FALSE     FALSE       FALSE
#> 9  18_Pyramidal    7_Granule 12032 16257         FALSE     FALSE       FALSE
#> 10      4_Astro    7_Granule  2975  3339         FALSE     FALSE       FALSE
#>    Fn1_Cd44 Spp1_Cd44 Cort_Npy2r
#> 1     FALSE     FALSE      FALSE
#> 2     FALSE     FALSE      FALSE
#> 3     FALSE     FALSE      FALSE
#> 4     FALSE     FALSE      FALSE
#> 5     FALSE     FALSE      FALSE
#> 6     FALSE     FALSE      FALSE
#> 7     FALSE     FALSE      FALSE
#> 8     FALSE     FALSE      FALSE
#> 9     FALSE     FALSE      FALSE
#> 10    FALSE     FALSE      FALSE

Other fields are downstream of this data frame. For example, pValue tells whether a given ligand-receptor pair is statistically significantly enriched in the interactions between two specific cell types. For further details see the section Ligand receptor analysis in the Cats Cradle Spatial vignette CatsCradleSpatial.

CatsCradle, Seurat objects, SingleCellExperiment objects and

SpatialExperiment objects

CatsCradle provides rudimentary support for the Bioconductor classes SingleCellExperiment and its derived class SpatialExperiment.

Every function which accepts a Seurat object as input can also accept a SingleCellExperiment in its place. Any function which returns a Seurat object can also return a SingleCellExperiment, in one case, a SpatialExperiment, through the use of the parameter returnType. In each case, the internals of the functions are manipulating Seurat objects. Conversion to and from these is performed upon entry to and return from the function and these coversions are rather lightweight in nature. They have slightly more functionality than as.Seurat and as.SingleCellExperiment. In addition to invoking these functions, they carry over the nearest neighbour graphs, and in one case carry over the tissue coordinates from a spatial Seurat object to a SpatialExperiment object. Because of the shallow nature of these conversions, other information may be lost.

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SpatialExperiment_1.16.0    SingleCellExperiment_1.28.1
#>  [3] SummarizedExperiment_1.36.0 Biobase_2.66.0             
#>  [5] GenomicRanges_1.58.0        GenomeInfoDb_1.42.3        
#>  [7] IRanges_2.40.1              S4Vectors_0.44.0           
#>  [9] BiocGenerics_0.52.0         MatrixGenerics_1.18.1      
#> [11] matrixStats_1.5.0           fossil_0.4.0               
#> [13] shapefiles_0.7.2            foreign_0.8-88             
#> [15] maps_3.4.2.1                tictoc_1.2.1               
#> [17] pheatmap_1.0.12             ggplot2_3.5.1              
#> [19] Seurat_5.2.1                SeuratObject_5.0.2         
#> [21] sp_2.2-0                    CatsCradle_1.0.1           
#> 
#> loaded via a namespace (and not attached):
#>   [1] RcppAnnoy_0.0.22        splines_4.4.2           later_1.4.1            
#>   [4] bitops_1.0-9            tibble_3.2.1            polyclip_1.10-7        
#>   [7] fastDummies_1.7.5       lifecycle_1.0.4         globals_0.16.3         
#>  [10] lattice_0.22-6          MASS_7.3-64             magrittr_2.0.3         
#>  [13] plotly_4.10.4           sass_0.4.9              rmarkdown_2.29         
#>  [16] jquerylib_0.1.4         yaml_2.3.10             httpuv_1.6.15          
#>  [19] sctransform_0.4.1       spam_2.11-1             spatstat.sparse_3.1-0  
#>  [22] reticulate_1.40.0       cowplot_1.1.3           pbapply_1.7-2          
#>  [25] RColorBrewer_1.1-3      abind_1.4-8             zlibbioc_1.52.0        
#>  [28] Rtsne_0.17              purrr_1.0.4             msigdbr_7.5.1          
#>  [31] RCurl_1.98-1.16         pracma_2.4.4            GenomeInfoDbData_1.2.13
#>  [34] ggrepel_0.9.6           irlba_2.3.5.1           listenv_0.9.1          
#>  [37] spatstat.utils_3.1-2    BiocStyle_2.34.0        goftest_1.2-3          
#>  [40] RSpectra_0.16-2         spatstat.random_3.3-2   fitdistrplus_1.2-2     
#>  [43] parallelly_1.42.0       codetools_0.2-20        DelayedArray_0.32.0    
#>  [46] tidyselect_1.2.1        UCSC.utils_1.2.0        farver_2.1.2           
#>  [49] spatstat.explore_3.3-4  jsonlite_1.8.9          progressr_0.15.1       
#>  [52] ggridges_0.5.6          survival_3.8-3          tools_4.4.2            
#>  [55] ica_1.0-3               Rcpp_1.0.14             glue_1.8.0             
#>  [58] gridExtra_2.3           SparseArray_1.6.1       xfun_0.50              
#>  [61] EBImage_4.48.0          dplyr_1.1.4             withr_3.0.2            
#>  [64] BiocManager_1.30.25     fastmap_1.2.0           digest_0.6.37          
#>  [67] R6_2.6.0                mime_0.12               colorspace_2.1-1       
#>  [70] networkD3_0.4           scattermore_1.2         tensor_1.5             
#>  [73] jpeg_0.1-10             spatstat.data_3.1-4     tidyr_1.3.1            
#>  [76] generics_0.1.3          data.table_1.16.4       httr_1.4.7             
#>  [79] htmlwidgets_1.6.4       S4Arrays_1.6.0          uwot_0.2.2             
#>  [82] pkgconfig_2.0.3         gtable_0.3.6            rdist_0.0.5            
#>  [85] lmtest_0.9-40           XVector_0.46.0          htmltools_0.5.8.1      
#>  [88] dotCall64_1.2           fftwtools_0.9-11        scales_1.3.0           
#>  [91] png_0.1-8               spatstat.univar_3.1-1   geometry_0.5.2         
#>  [94] knitr_1.49              reshape2_1.4.4          rjson_0.2.23           
#>  [97] nlme_3.1-167            magic_1.6-1             cachem_1.1.0           
#> [100] zoo_1.8-12              stringr_1.5.1           KernSmooth_2.23-26     
#> [103] parallel_4.4.2          miniUI_0.1.1.1          RcppZiggurat_0.1.6     
#> [106] pillar_1.10.1           grid_4.4.2              vctrs_0.6.5            
#> [109] RANN_2.6.2              promises_1.3.2          xtable_1.8-4           
#> [112] cluster_2.1.8           evaluate_1.0.3          magick_2.8.5           
#> [115] cli_3.6.4               locfit_1.5-9.11         compiler_4.4.2         
#> [118] rlang_1.1.5             crayon_1.5.3            future.apply_1.11.3    
#> [121] labeling_0.4.3          plyr_1.8.9              stringi_1.8.4          
#> [124] viridisLite_0.4.2       deldir_2.0-4            babelgene_22.9         
#> [127] munsell_0.5.1           lazyeval_0.2.2          tiff_0.1-12            
#> [130] spatstat.geom_3.3-5     Matrix_1.7-2            RcppHNSW_0.6.0         
#> [133] patchwork_1.3.0         future_1.34.0           shiny_1.10.0           
#> [136] ROCR_1.0-11             Rfast_2.1.4             igraph_2.1.4           
#> [139] RcppParallel_5.1.10     bslib_0.9.0