ATACseqQC 1.2.2
Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) is an alternative or complementary technique to MNase-seq DNase, and FAIRE-seq for chromatin accessibility analysis. The results obtained from ATAC-seq are similar to those from DNase and FAIRE-seq. ATAC-seq is gaining popularity because it does not require cross-linking, has higher signal to noise ratio, requires a much smaller amount of biological material and is faster and easier to perform, compared to other techniques1.
To help researchers quickly assess the quality of ATAC-seq data, we have developed ATACseqQC package for easily making diagnostic plots following the published guidelines1. In addition, it has functions to preprocess ATACseq data for subsequent peak calling.
Here is an example to using ATACseqQC with a subset of published ATAC-seq data1. Currently, only bam input file format is supported.
## prepare the packages
library(BiocInstaller)
biocLite(c("ATACseqQC", "ChIPpeakAnno", "MotifDb",
"BSgenome.Hsapiens.UCSC.hg19", "TxDb.Hsapiens.UCSC.hg19.knownGene",
"phastCons100way.UCSC.hg19"))
The packages we imported in this vignette is designed for human. For different species and different assembly, the BSgenome, TxDb and phastCons should be changed accordingly. For example for mouse data, they should be BSgenome.Mmusculus.UCSC.mm10, TxDb.Mmusculus.UCSC.mm10.knownGene and phastCons60way.UCSC.mm10 (How to get phastCons60way.UCSC.mm10 please refer vignettes of GenomicScores). If there is no conservation score available, just forget the parameter correlated with this.
## load the library
library(ATACseqQC)
## input is bamFile
bamfile <- system.file("extdata", "GL1.bam",
package="ATACseqQC", mustWork=TRUE)
bamfile.labels <- gsub(".bam", "", basename(bamfile))
bamQC(bamfile, outPath=NULL)
## $totalQNAMEs
## [1] 44357
##
## $duplicateRate
## [1] 0.03002908
##
## $mitochondriaRate
## [1] 0
##
## $idxstats
## seqnames seqlength mapped unmapped
## 1 chr1 249250621 88714 0
## 2 chr2 243199373 0 0
## 3 chr3 198022430 0 0
## 4 chr4 191154276 0 0
## 5 chr5 180915260 0 0
## 6 chr6 171115067 0 0
## 7 chr7 159138663 0 0
## 8 chr8 146364022 0 0
## 9 chr9 141213431 0 0
## 10 chr10 135534747 0 0
## 11 chr11 135006516 0 0
## 12 chr12 133851895 0 0
## 13 chr13 115169878 0 0
## 14 chr14 107349540 0 0
## 15 chr15 102531392 0 0
## 16 chr16 90354753 0 0
## 17 chr17 81195210 0 0
## 18 chr18 78077248 0 0
## 19 chr19 59128983 0 0
## 20 chr20 63025520 0 0
## 21 chr21 48129895 0 0
## 22 chr22 51304566 0 0
## 23 chrX 155270560 0 0
## 24 chrY 59373566 0 0
## 25 chrM 16571 0 0
First, there should be a large proportion of reads with less than 100 bp, which represent the nucleosome-free region. Second, the fragment size distribution should have a clear periodicity, which is evident in the inset figure, indicative of nucleosome occupation (present in integer multiples).
## generate fragement size distribution
fragSize <- fragSizeDist(bamfile, bamfile.labels)
Tn5 transposase has been shown to bind as a dimer and insert two adaptors separated by 9 bp2.
Therefore, for downstream analysis, such as peak-calling and footprinting, all reads in input bamfile need to be shifted. The function shiftGAlignmentsList
can be used to shift the reads. By default, all reads aligning to the positive strand are offset by +4bp, and all reads aligning to the negative strand are offset by -5bp1.
The adjusted reads will be written into a new bamfile for peak calling or footprinting.
## bamfile tags
tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT")
## files will be output into outPath
outPath <- "splited"
dir.create(outPath)
## shift the bam file by the 5'ends
library(BSgenome.Hsapiens.UCSC.hg19)
seqlev <- "chr1" ## subsample data for quick run
which <- as(seqinfo(Hsapiens)[seqlev], "GRanges")
gal <- readBamFile(bamfile, tag=tags, which=which, asMates=TRUE)
gal1 <- shiftGAlignmentsList(gal)
shiftedBamfile <- file.path(outPath, "shifted.bam")
export(gal1, shiftedBamfile)
The shifted reads will be split into different bins, namely nucleosome free, mononucleosome, dinucleosome, and trinucleosome. Shifted reads that do not fit into any of the above bins will be discarded. Splitting reads is a time-consuming step because we are using random forest to classify the fragments based on fragment length, GC content and conservation scores3.
By default, we assign the top 10% of short reads (reads below 100_bp) as nucleosome-free regions and the top 10% of intermediate length reads as (reads between 180 and 247 bp) mononucleosome. This serves as the training set to classify the rest of the fragments using random forest. The number of the tree will be set to 2 times of square root of the length of the training set.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txs <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(phastCons100way.UCSC.hg19)
## run program for chromosome 1 only
txs <- txs[seqnames(txs) %in% "chr1"]
genome <- Hsapiens
## split the reads into NucleosomeFree, mononucleosome,
## dinucleosome and trinucleosome.
objs <- splitGAlignmentsByCut(gal1, txs=txs, genome=genome,
conservation=phastCons100way.UCSC.hg19)
Save the binned alignments into bam files.
null <- writeListOfGAlignments(objs, outPath)
## list the files generated by splitBam.
dir(outPath)
## [1] "NucleosomeFree.bam" "NucleosomeFree.bam.bai"
## [3] "dinucleosome.bam" "dinucleosome.bam.bai"
## [5] "inter1.bam" "inter1.bam.bai"
## [7] "inter2.bam" "inter2.bam.bai"
## [9] "inter3.bam" "inter3.bam.bai"
## [11] "mononucleosome.bam" "mononucleosome.bam.bai"
## [13] "others.bam" "others.bam.bai"
## [15] "shifted.bam" "shifted.bam.bai"
## [17] "trinucleosome.bam" "trinucleosome.bam.bai"
You can also do shift, split and save bams in one step by calling splitBam
.
objs <- splitBam(bamfile, tags=tags, outPath=outPath,
txs=txs, genome=genome,
conservation=phastCons100way.UCSC.hg19)
If you do not have conservation score, you may want to have a try simply split the bams with the fragment lengths without provide conservation argument.
By averaging the signal across all active TSSs, we should observe that nucleosome-free fragments are enriched at the TSS, whereas the nucleosome-bound fragments should be enriched both upstream and downstream of the active TSS and display characteristic phasing of upstream and downstream nucleosomes. Because ATAC-seq reads are concentrated at regions of open chromatin, users should see a strong nucleosome signal at the +1 nucleosome that decreases at the +2, +3 and +4 nucleosomes.
library(ChIPpeakAnno)
bamfiles <- file.path(outPath,
c("NucleosomeFree.bam",
"mononucleosome.bam",
"dinucleosome.bam",
"trinucleosome.bam"))
## Plot the cumulative percentage tag allocation in NucleosomeFree
## and mononucleosome bams.
cumulativePercentage(bamfiles[1:2], as(seqinfo(Hsapiens)["chr1"], "GRanges"))
TSS <- promoters(txs, upstream=0, downstream=1)
TSS <- unique(TSS)
## estimate the library size for normalization
(librarySize <- estLibSize(bamfiles))
## splited/NucleosomeFree.bam splited/mononucleosome.bam
## 34158 2103
## splited/dinucleosome.bam splited/trinucleosome.bam
## 1888 385
## calculate the signals around TSSs.
NTILE <- 101
dws <- ups <- 1010
sigs <- enrichedFragments(gal=objs[c("NucleosomeFree",
"mononucleosome",
"dinucleosome",
"trinucleosome")],
TSS=TSS,
librarySize=librarySize,
seqlev=seqlev,
TSS.filter=0.5,
n.tile = NTILE,
upstream = ups,
downstream = dws)
## log2 transformed signals
sigs.log2 <- lapply(sigs, function(.ele) log2(.ele+1))
#plot heatmap
featureAlignedHeatmap(sigs.log2, reCenterPeaks(TSS, width=ups+dws),
zeroAt=.5, n.tile=NTILE)
## get signals normalized for nucleosome-free and nucleosome-bound regions.
out <- featureAlignedDistribution(sigs,
reCenterPeaks(TSS, width=ups+dws),
zeroAt=.5, n.tile=NTILE, type="l")
## rescale the nucleosome-free and nucleosome signals to 0~1
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
out <- apply(out, 2, range01)
matplot(out, type="l", xaxt="n",
xlab="Position (bp)",
ylab="Fraction of signal")
axis(1, at=seq(0, 100, by=10)+1,
labels=c("-1K", seq(-800, 800, by=200), "1K"), las=3)
abline(v=seq(0, 100, by=10)+1, lty=2, col="gray")
ATAC-seq footprints infer factor occupancy genome-wide. The factorFootprints
function uses matchPWM
to predict the binding sites using the input position weight matrix (PWM). Then it calculates and plots the accumulated coverage for those binding sites to show the status of the occupancy genome-wide. Unlike CENTIPEDE4, the footprints generated here do not take into consideration the conservation (PhyloP). factorFootprints
function could also accept the possible binding sites as a GRanges object.
## foot prints
library(MotifDb)
CTCF <- query(MotifDb, c("CTCF"))
CTCF <- as.list(CTCF)
print(CTCF[[1]], digits=2)
## 1 2 3 4 5 6 7 8 9 10 11 12
## A 0.17 0.23 0.29 0.10 0.33 0.06 0.052 0.037 0.023 0.00099 0.245 0.00099
## C 0.42 0.28 0.30 0.32 0.11 0.33 0.562 0.005 0.960 0.99702 0.670 0.68901
## G 0.25 0.23 0.26 0.27 0.42 0.55 0.052 0.827 0.013 0.00099 0.027 0.00099
## T 0.16 0.27 0.15 0.31 0.14 0.06 0.334 0.131 0.004 0.00099 0.058 0.30900
## 13 14 15 16 17 18 19 20
## A 0.00099 0.050 0.253 0.004 0.172 0.00099 0.019 0.19
## C 0.99702 0.043 0.073 0.418 0.150 0.00099 0.063 0.43
## G 0.00099 0.017 0.525 0.546 0.055 0.99702 0.865 0.15
## T 0.00099 0.890 0.149 0.032 0.623 0.00099 0.053 0.23
sigs <- factorFootprints(shiftedBamfile, pfm=CTCF[[1]],
genome=genome,
min.score="90%", seqlev=seqlev,
upstream=100, downstream=100)
featureAlignedHeatmap(sigs$signal,
feature.gr=reCenterPeaks(sigs$bindingSites,
width=200+width(sigs$bindingSites[1])),
annoMcols="score",
sortBy="score",
n.tile=ncol(sigs$signal[[1]]))
sigs$spearman.correlation
## $`+`
##
## Spearman's rank correlation rho
##
## data: predictedBindingSiteScore and highest.sig.windows
## S = 6179300, p-value = 1.241e-06
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2499378
##
##
## $`-`
##
## Spearman's rank correlation rho
##
## data: predictedBindingSiteScore and highest.sig.windows
## S = 6521300, p-value = 5.731e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2084312
Here is the CTCF footprints for the full dataset.
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] MotifDb_1.20.0
## [2] phastCons100way.UCSC.hg19_3.6.0
## [3] GenomicScores_1.2.0
## [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [5] GenomicFeatures_1.30.0
## [6] AnnotationDbi_1.40.0
## [7] Biobase_2.38.0
## [8] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [9] BSgenome_1.46.0
## [10] rtracklayer_1.38.1
## [11] ChIPpeakAnno_3.12.4
## [12] VennDiagram_1.6.18
## [13] futile.logger_1.4.3
## [14] GenomicRanges_1.30.0
## [15] GenomeInfoDb_1.14.0
## [16] Biostrings_2.46.0
## [17] XVector_0.18.0
## [18] IRanges_2.12.0
## [19] ATACseqQC_1.2.2
## [20] S4Vectors_0.16.0
## [21] BiocGenerics_0.24.0
## [22] BiocStyle_2.6.1
##
## loaded via a namespace (and not attached):
## [1] ProtGenerics_1.10.0 bitops_1.0-6
## [3] matrixStats_0.52.2 bit64_0.9-7
## [5] progress_1.1.2 httr_1.3.1
## [7] rprojroot_1.2 tools_3.4.3
## [9] backports_1.1.1 rGADEM_2.26.0
## [11] R6_2.2.2 splitstackshape_1.4.2
## [13] seqLogo_1.44.0 DBI_0.7
## [15] lazyeval_0.2.1 colorspace_1.3-2
## [17] ade4_1.7-8 motifStack_1.22.0
## [19] prettyunits_1.0.2 RMySQL_0.10.13
## [21] bit_1.1-12 curl_3.0
## [23] compiler_3.4.3 graph_1.56.0
## [25] grImport_0.9-0 DelayedArray_0.4.1
## [27] bookdown_0.5 scales_0.5.0
## [29] randomForest_4.6-12 RBGL_1.54.0
## [31] stringr_1.2.0 digest_0.6.12
## [33] Rsamtools_1.30.0 rmarkdown_1.8
## [35] pkgconfig_2.0.1 htmltools_0.3.6
## [37] ensembldb_2.2.0 limma_3.34.3
## [39] regioneR_1.10.0 htmlwidgets_0.9
## [41] rlang_0.1.4 RSQLite_2.0
## [43] BiocInstaller_1.28.0 shiny_1.0.5
## [45] BiocParallel_1.12.0 RCurl_1.95-4.8
## [47] magrittr_1.5 GO.db_3.5.0
## [49] GenomeInfoDbData_0.99.1 Matrix_1.2-12
## [51] Rcpp_0.12.14 munsell_0.4.3
## [53] stringi_1.1.6 yaml_2.1.15
## [55] MASS_7.3-47 SummarizedExperiment_1.8.0
## [57] zlibbioc_1.24.0 plyr_1.8.4
## [59] AnnotationHub_2.10.1 blob_1.1.0
## [61] lattice_0.20-35 splines_3.4.3
## [63] multtest_2.34.0 knitr_1.17
## [65] MotIV_1.34.0 seqinr_3.4-5
## [67] biomaRt_2.34.0 futile.options_1.0.0
## [69] XML_3.98-1.9 evaluate_0.10.1
## [71] data.table_1.10.4-3 lambda.r_1.2
## [73] idr_1.2 httpuv_1.3.5
## [75] assertthat_0.2.0 mime_0.5
## [77] xtable_1.8-2 AnnotationFilter_1.2.0
## [79] survival_2.41-3 tibble_1.3.4
## [81] GenomicAlignments_1.14.1 memoise_1.1.0
## [83] interactiveDisplayBase_1.16.0
1. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, dna-binding proteins and nucleosome position. Nature methods 10, 1213–1218 (2013).
2. Adey, A. et al. Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome biology 11, R119 (2010).
3. Chen, K. et al. DANPOS: Dynamic analysis of nucleosome position and occupancy by sequencing. Genome research 23, 341–351 (2013).
4. Pique-Regi, R. et al. Accurate inference of transcription factor binding from dna sequence and chromatin accessibility data. Genome research 21, 447–455 (2011).