This section gives an overview of the operations that can be
performed on a given set of metadata obtained particularly from
data-rich objects such as those obtained from
curatedTCGAData. There are several operations that work
with microRNA, methylation, mutation, and assays that have gene symbol
annotations.
qreduceTCGA
The qreduceTCGA function converts
RaggedExperiment mutation data objects to
RangedSummarizedExperiment using org.Hs.eg.db
and the qreduceTCGA utility function from
RaggedExperiment to summarize ‘silent’ and ‘non-silent’
mutations based on a ‘Variant_Classification’ metadata column in the
original object.
It uses ‘hg19’ transcript database (‘TxDb’) package internally to
summarize regions using qreduceAssay. The current genome
build (‘hg18’) in the data must be translated to ‘hg19’.
In this example, we first set the appropriate build name in the
mutation dataset COAD_Mutation-20160128 according to the NCBI
website and we then use seqlevelsStyle to match the
UCSC style in the chain.
rag <- "COAD_Mutation-20160128"
# add the appropriate genome annotation
genome(coad[[rag]]) <- "NCBI36"
# change the style to UCSC
seqlevelsStyle(rowRanges(coad[[rag]])) <- "UCSC"
# inspect changes
seqlevels(rowRanges(coad[[rag]]))
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
## [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
## [19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY"
## chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11
## "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18"
## chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
## "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18"
## chrX chrY
## "hg18" "hg18"
Now we use liftOver from rtracklayer to
translate ‘hg18’ builds to ‘hg19’ using the chain file obtained via
AnnotationHub. We use a query to find the file. You can
also query with terms such as
c("Homo sapiens", "chain", "hg18", "hg19") in the query
function. We are specifically looking for the chain file
“hg18ToHg19.over.chain.gz”.
library(AnnotationHub)
ah <- AnnotationHub()
query(ah, "hg18ToHg19.over.chain.gz")
## AnnotationHub with 1 record
## # snapshotDate(): 2025-10-29
## # names(): AH14220
## # $dataprovider: UCSC
## # $species: Homo sapiens
## # $rdataclass: ChainFile
## # $rdatadateadded: 2014-12-15
## # $title: hg18ToHg19.over.chain.gz
## # $description: UCSC liftOver chain file from hg18 to hg19
## # $taxonomyid: 9606
## # $genome: hg18
## # $sourcetype: Chain
## # $sourceurl: http://hgdownload.cse.ucsc.edu/goldenpath/hg18/liftOver/hg18To...
## # $sourcesize: NA
## # $tags: c("liftOver", "chain", "UCSC", "genome", "homology")
## # retrieve record with 'object[["AH14220"]]'
## downloading 1 resources
## retrieving 1 resource
## loading from cache
ranges19 <- rtracklayer::liftOver(rowRanges(coad[[rag]]), chain)
Note. The same can be done to convert
hg19 to hg38 (the same build that the Genomic
Data Commons uses) with the corresponding chain file as obtained
above.
This will give us a list of ranges, each element corresponding to a
single row in the RaggedExperiment. We remove rows that had
no matches in the liftOver process and replace the ranges
in the original RaggedExperiment with the replacement
method. Finally, we put the RaggedExperiment object back
into the MultiAssayExperiment.
re19 <- coad[[rag]][as.logical(lengths(ranges19))]
ranges19 <- unlist(ranges19)
genome(ranges19) <- "hg19"
rowRanges(re19) <- ranges19
# replacement
coad[["COAD_Mutation-20160128"]] <- re19
rowRanges(re19)
## GRanges object with 62523 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr20 1552407-1552408 +
## [2] chr1 161736152-161736153 +
## [3] chr7 100685895 +
## [4] chr7 103824453 +
## [5] chr7 104783644 +
## ... ... ... ...
## [62519] chr9 36369716 +
## [62520] chr9 37692640 +
## [62521] chr9 6007456 +
## [62522] chrX 123785782 +
## [62523] chrX 51487184 +
## -------
## seqinfo: 24 sequences from hg19 genome; no seqlengths
Now that we have matching builds, we can finally run the
qreduceTCGA function.
coad <- qreduceTCGA(coad, keep.assay = TRUE)
## 24 genes were dropped because they have exons located on both strands of the
## same reference sequence or on more than one reference sequence, so cannot be
## represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
## object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
##
## 'select()' returned 1:1 mapping between keys and columns
## Warning in .normarg_seqlevelsStyle(value): more than one seqlevels style
## supplied, using the 1st one only
## Warning in .normarg_seqlevelsStyle(value): cannot switch some hg19's seqlevels
## from UCSC to NCBI style
symbolsToRanges
In the cases where row annotations indicate gene symbols, the
symbolsToRanges utility function converts genes to genomic
ranges and replaces existing assays with
RangedSummarizedExperiment objects. Gene annotations are
given as ‘hg19’ genomic regions.
## 24 genes were dropped because they have exons located on both strands of the
## same reference sequence or on more than one reference sequence, so cannot be
## represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
## object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## 24 genes were dropped because they have exons located on both strands of the
## same reference sequence or on more than one reference sequence, so cannot be
## represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
## object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
## removing 363 sampleMap rows not in names(experiments)
## A MultiAssayExperiment object of 11 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 11:
## [1] COAD_CNASeq-20160128: RaggedExperiment with 40530 rows and 136 columns
## [2] COAD_miRNASeqGene-20160128: SummarizedExperiment with 705 rows and 221 columns
## [3] COAD_Mutation-20160128: RaggedExperiment with 62523 rows and 154 columns
## [4] COAD_Methylation_methyl27-20160128: SummarizedExperiment with 27578 rows and 202 columns
## [5] COAD_Methylation_methyl450-20160128: SummarizedExperiment with 485577 rows and 333 columns
## [6] COAD_Mutation-20160128_simplified: RangedSummarizedExperiment with 28618 rows and 154 columns
## [7] COAD_CNASeq-20160128_simplified: RangedSummarizedExperiment with 28618 rows and 136 columns
## [8] COAD_mRNAArray-20160128_ranged: RangedSummarizedExperiment with 14271 rows and 172 columns
## [9] COAD_mRNAArray-20160128_unranged: SummarizedExperiment with 3543 rows and 172 columns
## [10] COAD_RNASeq2GeneNorm-20160128_ranged: RangedSummarizedExperiment with 17112 rows and 191 columns
## [11] COAD_RNASeq2GeneNorm-20160128_unranged: SummarizedExperiment with 3389 rows and 191 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files