scater 1.6.2
This document provides details on the import of single-cell data into a SingleCellExperiment
object for use with the scater
package (frequent use case) and how to quantify single-cell gene expression from within an R session using the popular RNA-seq quantification tools Salmon
and kallisto
.
tximport
Lior Pachter’s group at Berkeley has recently released a new software tool called kallisto for rapid quantification of transcript abundance from RNA-seq data. Similarly, Rob Patro at Stony Brook University and colleagues have released Salmon, a similarly fast and accurate RNA-seq quantification tool. In scater
, a couple of wrapper functions to kallisto
and Salmon
enable easy and extremely fast quantification of transcript abundance from within R.
For simplicity, examples are shown below demonstrating the use of kallisto
, but the same could be done with Salmon
using almost identical interfaces (replacing runKallisto
with runSalmon
, readKallistoResults
with readSalmonResults
and so on).
################################################################################
### Tests and Examples
# Example if in the kallisto/test directory
setwd("/home/davis/kallisto/test")
kallisto_log <- runKallisto("targets.txt", "transcripts.idx", single_end=FALSE,
output_prefix="output", verbose=TRUE, n_bootstrap_samples=10)
sce_test <- readKallistoResults(kallisto_log, read_h5=TRUE)
sce_test
An example using real data from a project investigating cell cycle. Note that this analysis is not ‘live’ (the raw data are not included in the package), but it demonstrates what can be done with scater
and kallisto
.
setwd("/home/davis/021_Cell_Cycle/data/fastq")
system("wc -l targets.txt")
ave_frag_len <- mean(c(855, 860, 810, 760, 600, 690, 770, 690))
kallisto_test <- runKallisto("targets.txt",
"Mus_musculus.GRCm38.rel79.cdna.all.ERCC.idx",
output_prefix="kallisto_output_Mmus", n_cores=12,
fragment_length=ave_frag_len, verbose=TRUE)
sce_kall_mmus <- readKallistoResults(kallisto_test, read_h5=TRUE)
sce_kall_mmus
sce_kall_mmus <- readKallistoResults(kallisto_test)
sce_kall_mmus <- getBMFeatureAnnos(sce_kall_mmus)
head(fData(sce_kall_mmus))
head(pData(sce_kall_mmus))
sce_kall_mmus[["start_time"]]
counts(sce_kall_mmus)[sample(nrow(sce_kall_mmus), size=15), 1:6]
## Summarise expression at the gene level
sce_kall_mmus_gene <- summariseExprsAcrossFeatures(
sce_kall_mmus, exprs_values="tpm", summarise_by="feature_id")
datatable(fData(sce_kall_mmus_gene))
sce_kall_mmus_gene <- getBMFeatureAnnos(
sce_kall_mmus_gene, filters="ensembl_gene_id",
attributes=c("ensembl_gene_id", "mgi_symbol", "chromosome_name",
"gene_biotype", "start_position", "end_position",
"percentage_gc_content", "description"),
feature_symbol="mgi_symbol", feature_id="ensembl_gene_id",
biomart="ensembl", dataset="mmusculus_gene_ensembl")
datatable(fData(sce_kall_mmus_gene))
## Add gene symbols to featureNames to make them more intuitive
new_feature_names <- featureNames(sce_kall_mmus_gene)
notna_mgi_symb <- !is.na(fData(sce_kall_mmus_gene)$mgi_symbol)
new_feature_names[notna_mgi_symb] <- fData(sce_kall_mmus_gene)$mgi_symbol[notna_mgi_symb]
notna_ens_gid <- !is.na(fData(sce_kall_mmus_gene)$ensembl_gene_id)
new_feature_names[notna_ens_gid] <- paste(new_feature_names[notna_ens_gid],
fData(sce_kall_mmus_gene)$ensembl_gene_id[notna_ens_gid], sep="_")
sum(duplicated(new_feature_names))
featureNames(sce_kall_mmus_gene) <- new_feature_names
head(featureNames(sce_kall_mmus_gene))
tail(featureNames(sce_kall_mmus_gene))
sum(duplicated(fData(sce_kall_mmus_gene)$mgi_symbol))