Abstract
This Quick-Start is a runnable example showing the functionalities of the SpliceWiz workflow. Version 1.0.4
SpliceWiz is a graphical interface for differential alternative splicing and visualization in R. It differs from other alternative splicing tools as it is designed for users with basic bioinformatic skills to analyze datasets containing up to hundreds of samples! SpliceWiz contains a number of innovations including:
This vignette is a runnable working example of the SpliceWiz workflow. The purpose is to quickly demonstrate the basic functionalities of SpliceWiz.
We provide here a brief outline of the workflow for users to get started as quickly as possible. However, we also provide more details for those wishing to know more. Many sections will contain extra information that can be displayed when clicked on, such as these:
Note: for all runnable examples, first load the SpliceWiz library:
library(SpliceWiz)
To reduce false positives in novel splicing detection, SpliceWiz provides several filters to reduce the number of novel junctions fed into the analysis:
novelSplicing_minSamples
parameternovelSplicing_countThreshold
) in a smaller number of samples (set using novelSplicing_minSamplesAboveThreshold
)novelSplicing_requireOneAnnotatedSJ = TRUE
)Novel ASE detection is integrated into the SpliceWiz pipeline at the collation step. After compilation and processing of novel junctions / TJ’s, the novel transcripts are appended to the transcript annotation, which is then used to re-construct the SpliceWiz reference. This reference is contained in the “Reference” subfolder of the output folder of collateData()
function.
TL/DR - how to enable novel ASE mode
novelSplicing = TRUE
when running collateData()
. For example:
# Usual pipeline:
file.path(tempdir(), "Reference")
ref_path <-buildRef(
reference_path = ref_path,
fasta = chrZ_genome(),
gtf = chrZ_gtf()
)
file.path(tempdir(), "pb_output")
pb_path <-processBAM(
bamfiles = bams$path,
sample_names = bams$sample,
reference_path = ref_path,
output_path = pb_path
)
# Modified pipeline - collateData with novel ASE discovery:
file.path(tempdir(), "NxtSE_output")
nxtse_path <-collateData(
Experiment = expr,
reference_path = ref_path,
output_path = nxtse_path,
## NEW ##
novelSplicing = TRUE,
# switches on novel splice detection
novelSplicing_requireOneAnnotatedSJ = TRUE,
# novel junctions must share one annotated splice site
novelSplicing_minSamples = 3,
# retain junctions observed in 3+ samples (of any non-zero expression)
novelSplicing_minSamplesAboveThreshold = 1,
# only 1 sample required if its junction count exceeds a set threshold
novelSplicing_countThreshold = 10
# threshold for previous parameter
)
For individual sample coverage plots (i.e. when condition
is not set), junction counts for each sample are plotted. Samples with low junction counts (less than 0.01x of the track height) are omitted to reduce clutter.
For group-normalized coverage plots (where coverage of multiple samples in a condition group are combined), junctions are instead labeled by their “provisional PSIs”. These PSIs are calculated per junction (instead of per ASE). This is done by determining the ratio of junction counts as a proportion of all junction reads that share a common exon cluster as the junction being assessed.
TL/DR - how to enable junction plotting
plotJunctions = TRUE
from within plotCoverage()
# Retrieve example NxtSE object
SpliceWiz_example_NxtSE()
se <-
# Assign annotation of the experimental conditions
colData(se)$treatment <- rep(c("A", "B"), each = 3)
# Return a list of ggplot and plotly objects, also plotting junction counts
plotCoverage(
p <-se = se,
Event = "SE:SRSF3-203-exon4;SRSF3-202-int3",
tracks = colnames(se)[1:4],
## NEW ##
plotJunctions = TRUE
)#> Warning in geom_line(data = dfJn, aes_string(x = "x", y = "yarc", group = "junction", : Ignoring unknown aesthetics: label
#> Ignoring unknown aesthetics: label
#> Ignoring unknown aesthetics: label
#> Ignoring unknown aesthetics: label
if(interactive()) {
# Display as plotly object
$final_plot
pelse {
} # Display as ggplot
as_ggplot_cov(p)
}
# Plot by condition "treatment", including provisional PSIs
plotCoverage(
p <-se = se,
Event = "SE:SRSF3-203-exon4;SRSF3-202-int3",
tracks = c("A", "B"), condition = "treatment",
## NEW ##
plotJunctions = TRUE
)#> Warning in geom_line(data = dfJn, aes_string(x = "x", y = "yarc", group = "junction", : Ignoring unknown aesthetics: label
#> Ignoring unknown aesthetics: label
if(interactive()) {
# Display as plotly object
$final_plot
pelse {
} # Display as ggplot
as_ggplot_cov(p)
}
The basic steps of SpliceWiz are as follows:
To install SpliceWiz, start R (version “4.2.1”) and enter:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("SpliceWiz") BiocManager
libomp
via brew:
brew install libomp
To install all of these packages:
install.packages("DoubleExpSeq")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install(c("DESeq2", "limma", "satuRn")) BiocManager
library(SpliceWiz)
NxtIRFdata
data package. This data package contains the example “chrZ” genome / annotations and 6 example BAM files that are used in this working example. Also, NxtIRFdata provides pre-generated mappability exclusion annotations for building human and mouse SpliceWiz references
SpliceWiz offers a graphical user interface (GUI) for interactive users, e.g. in the RStudio environment. To start using SpliceWiz GUI:
if(interactive()) {
spliceWiz(demo = TRUE)
}
Note that the GUI demo mode is not supported on Bioconductor 3.13 or below.
Using the example FASTA and GTF files, use the buildRef()
function to build the SpliceWiz reference:
file.path(tempdir(), "Reference")
ref_path <-buildRef(
reference_path = ref_path,
fasta = chrZ_genome(),
gtf = chrZ_gtf()
)
(NEW in version >= 0.99.3) The SpliceWiz reference can be viewed as data frames using various getter functions. For example, to view the annotated alternative splicing events (ASE):
viewASE(ref_path) df <-
See ?View-Reference-methods
for a comprehensive list of getter functions
Reference
tab from the menu side bar. The following interface will be shown:
Reference GUI
The first step to building a SpliceWiz reference is to select a directory in which to create the reference.
SpliceWiz provides an interface to retrieve the genome sequence (FASTA) and transcriptome annotation (GTF) files from the Ensembl FTP server, by first selecting the “Release” and then “Species” from the drop-down boxes.
Alternatively, users can provide their own FASTA and GTF files.
Human (hg38, hg19) and mouse genomes (mm10, mm9) have the option of further refining IR analysis using built-in mappability exclusion annotations, allowing SpliceWiz to ignore intronic regions of low mappability.
Load Demo FASTA/GTF
(5), and then click Build Reference
(6)
chrZ_genome()
and chrZ_gtf()
returns the paths to the example genome (FASTA) and transcriptome (GTF) file included with the NxtIRFdata
package that contains the working example used by SpliceWiz:
# Provides the path to the example genome:
chrZ_genome()
#> [1] "/home/biocbuild/bbs-3.16-bioc/R/site-library/NxtIRFdata/extdata/genome.fa"
# Provides the path to the example gene annotation:
chrZ_gtf()
#> [1] "/home/biocbuild/bbs-3.16-bioc/R/site-library/NxtIRFdata/extdata/transcripts.gtf"
For intron retention, accurate assessment of intron depth is important. However, introns contain many repetitive regions that are difficult to map. We refer to these regions as “mappability exclusions”.
We adopt IRFinder’s algorithm to identify these mappability exclusions. This is determined empirically by generating synthetic reads systematically from the genome, then aligning these reads back to the same genome. Regions that contain less than the expected coverage depth of reads define “mappability exclusions”.
See the vignette: SpliceWiz cookbook for details on how to generate “mappability exclusions” for any genome.NxtIRFdata
package.
Simply specify the genome in the parameter genome_type
in the buildRef()
function (which accepts hg38
, hg19
, mm10
and mm9
).
Additionally, a reference for non-polyadenylated transcripts is used. This has a minor role in QC of samples (to assess the adequacy of polyA capture).
For example, assuming your genome file "genome.fa"
and a transcript annotation "transcripts.gtf"
are in the working directory, a SpliceWiz reference can be built using the built-in hg38
low mappability regions and non-polyadenylated transcripts as follows:
"./Reference"
ref_path_hg38 <-buildRef(
reference_path = ref_path_hg38,
fasta = "genome.fa",
gtf = "transcripts.gtf",
genome_type = "hg38"
)
The function SpliceWiz_example_bams()
retrieves 6 example BAM files from ExperimentHub and places a copy of these in the temporary directory.
SpliceWiz_example_bams() bams <-
level = 0
* The BAM files have generic names but are contained inside parent directories labeled by their sample names, e.g. “sample1/Unsorted.bam”, “sample2/Unsorted.bam”. In this case, level = 1
# as BAM file names denote their sample names
findBAMS(tempdir(), level = 0)
bams <-
# In the case where BAM files are labelled using sample names as parent
# directory names (which oftens happens with the STAR aligner), use level = 1
Process these BAM files using SpliceWiz:
file.path(tempdir(), "pb_output")
pb_path <-processBAM(
bamfiles = bams$path,
sample_names = bams$sample,
reference_path = ref_path,
output_path = pb_path
)
Experiment
tab from the menu side bar. The following interface will be shown:
Experiment GUI
Define Project Folders
to bring up the following drop-down box:
Define Project Folders
We need to define the folders that contain our reference, BAM files, as well as processBAM output files and the final compiled experiment (that generates the NxtSE object).
Choose Folder (Reference)
and select the Reference
directory (where the SpliceWiz reference was generated by the previous step. Then,Choose Folder (BAM files)
and select the bams
directory (where the demo BAM files have been generated).Choose Folder (processBAM output)
and select the pb_output
directory (which should currently be empty).Choose Folder (NxtSE files)
and select the NxtSE
directory(which should currently be empty).After our folders have been defined, on the right hand side, an interactive table should be displayed that looks like the following:
Running processBAM
Process BAM files
to open the dropdown menu, and then click Run processBAM()
. A prompt should then be displayed asking whether you wish to proceed. Click OK
to start running processBAM.
processBAM()
functionprocessBAM()
function can process one or more BAM files. This function is ultra-fast, relying on an internal native C++ function that uses OpenMP multi-threading (via the ompBAM
C++ API).
Input BAM files can be either read-name sorted or coordinate sorted (although SpliceWiz prefers the former). Indexing of coordinate-sorted BAMs are not necessary.
processBAM()
loads the SpliceWiz reference. Then, it reads each BAM file in their entirety, and quantifies the following:
processBAM()
generates two output files. The first is a gzipped text file containing all the quantitation data. The second is a COV
file which contains the per-nucleotide RNA-seq coverage of the sample.processBAM()
functionprocessBAM()
requires four parameters:
bamfiles
: The paths of the BAM filessample_names
: The sample names corresponding to the given BAM filesreference_path
: The directory containing the SpliceWiz referenceoutput_path
: The directory where the output of processBAM()
should go file.path(tempdir(), "pb_output")
pb_path <-processBAM(
bamfiles = bams$path,
sample_names = bams$sample,
reference_path = ref_path,
output_path = pb_path
)
processBAM()
also takes several optional, but useful, parameters:
n_threads
: The number of threads for multi-threadingoverwrite
: Whether existing files in the output directory should be overwrittenrun_featureCounts
: (Requires the Rsubread package) runs featureCounts to obtain gene counts (which outputs results as an RDS file)For example, to run processBAM()
using 2 threads, disallow overwrite of existing processBAM()
outputs, and run featureCounts afterwards, one would run the following:
# NOT RUN
# Re-run IRFinder without overwrite, and run featureCounts
require(Rsubread)
processBAM(
bamfiles = bams$path,
sample_names = bams$sample,
reference_path = ref_path,
output_path = pb_path,
n_threads = 2,
overwrite = FALSE,
run_featureCounts = TRUE
)
# Load gene counts
readRDS(file.path(pb_path, "main.FC.Rds"))
gene_counts <-
# Access gene counts:
$counts gene_counts
The helper function findSpliceWizOutput()
organises the output files of SpliceWiz’s processBAM()
function. It identifies matching "txt.gz"
and "cov"
files for each sample, and organises these file paths conveniently into a 3-column data frame:
findSpliceWizOutput(pb_path) expr <-
Using this data frame, collate the experiment using collateData()
. We name the output directory as NxtSE_output
as this folder will contain the data needed to import the NxtSE object:
file.path(tempdir(), "NxtSE_output")
nxtse_path <-collateData(
Experiment = expr,
reference_path = ref_path,
output_path = nxtse_path
)
collateData()
functioncollateData()
combines the processBAM()
output files of multiple samples and builds a single database. collateData()
creates a number of files in the chosen output directory. These outputs can then be imported into the R session as a NxtSE
data object for downstream analysis.
At minimum, collateData()
takes the following parameters:
Experiment
: The 2- or 3- column data frame. The first column should contain (unique) sample names. The second and (optional) third columns contain the "txt.gz"
and "cov"
file pathsreference_path
: The directory containing the SpliceWiz referenceoutput_path
: The directory where the output of processBAM()
should gocollateData()
can take some optional parameters:
IRMode
: Whether to use SpliceWiz’s SpliceOver
method, or IRFinder’s SpliceMax
method, to determine total spliced transcript abundance. Briefly, SpliceMax
considers junction reads that have either flanking splice site coordinate. SpliceOver
considers additional junction reads that splices across exon clusters in common. Exon clusters are groups of mutually-overlapping exons. SpliceOver
is the default option.overwrite
: Whether files in the output directory should be overwrittenn_threads
: Use multi-threaded operations where possiblelowMemoryMode
: Minimise memory usage where possible. Note that most of the collateData pipeline will be single-threaded if this is set to TRUE
.collateData()
is a memory-intensive operation when run using multiple threads. We estimate it can use up to 6-7 Gb per thread. lowMemoryMode
will minimise RAM usage to ~ 8 Gb, but will be slower and run on a single thread.novelSplicing = TRUE
from within the collateData()
function:
# Modified pipeline - collateData with novel ASE discovery:
file.path(tempdir(), "NxtSE_output_novel")
nxtse_path <-collateData(
Experiment = expr,
reference_path = ref_path,
output_path = nxtse_path,
novelSplicing = TRUE ## NEW ##
)
collateData()
uses split reads that are not annotated introns to help construct hypothetical minimal transcripts. These are then injected into the original transcriptome annotation (GTF) file, whereby the SpliceWiz reference is rebuilt. The new SpliceWiz reference (which contains these novel transcripts) is then used to collate the samples.
To reduce false positives in novel splicing detection, SpliceWiz provides several filters to reduce the number of novel junctions fed into the analysis:
novelSplicing_minSamples
parameternovelSplicing_countThreshold
) in a smaller number of samples (set using novelSplicing_minSamplesAboveThreshold
)novelSplicing_requireOneAnnotatedSJ = TRUE
).For example, if one wished to retain novel reads seen in 3 or more samples, or novel spliced reads with 10 or more counts in at least 1 sample, and requiring at least one end of a novel junction being an annotated splice site:
file.path(tempdir(), "NxtSE_output_novel")
nxtse_path <-collateData(
Experiment = expr,
reference_path = ref_path,
output_path = nxtse_path,
## NEW ##
novelSplicing = TRUE,
# switches on novel splice detection
novelSplicing_requireOneAnnotatedSJ = TRUE,
# novel junctions must share one annotated splice site
novelSplicing_minSamples = 3,
# retain junctions observed in 3+ samples (of any non-zero expression)
novelSplicing_minSamplesAboveThreshold = 1,
# only 1 sample required if its junction count exceeds a set threshold
novelSplicing_countThreshold = 10
# threshold for previous parameter
)
Construct Experiment
. Your interface should now look like this:
collateData GUI
This drop-down dialog box contains several parameters related to novel splicing detection:
To collate the experiment, click on (6) Run collateData()
. An alert pop-up should be displayed when this process is complete.
Before differential analysis can be performed, the collated experiment must be imported into the R session as an NxtSE
data object.
After running collateData()
, import the experiment using the makeSE()
function:
makeSE(nxtse_path) se <-
Analysis
and then click the Load Experiment
on the menu bar. The display should look like this:
Loading the NxtSE GUI
First, click the Open Folder containing NxtSE
to open the dropdown, then click Choose Folder (NxtSE)
. Select the NxtSE
directory. The interface should now look like this:
Loading the NxtSE GUI
To load the NxtSE object, click the Load NxtSE object
to open the dropdown, then click Load NxtSE object
to load the NxtSE object into the current session.
makeSE()
functionmakeSE()
function imports the compiled data generated by the collateData()
function. Data is imported as an NxtSE
object. Downstream analysis, including differential analysis and visualization, is performed using the NxtSE
object.makeSE()
functionmakeSE()
uses delayed operations to avoid consuming memory until the data is actually needed. This is advantageous in analysis of hundreds of samples on a computer with limited resources. However, it will be slower. To load all the data into memory, we need to “realize” the NxtSE object, as follows:
realize_NxtSE(se) se <-
Alternatively, makeSE()
can realize the NxtSE object at construction:
makeSE(nxtse_path, realize = TRUE) se <-
By default, makeSE()
constructs the NxtSE object using all the samples in the collated data. It is possible (and particularly useful in large data sets) to read only a subset of samples. In this case, construct a data frame object with the first column containing the desired sample names and parse this into the colData
parameter as shown:
colnames(se)[1:4]
subset_samples <- data.frame(sample = subset_samples)
df <- makeSE(nxtse_path, colData = df, RemoveOverlapping = TRUE) se_small <-
RemoveOverlapping = FALSE
.
colData(se)$condition <- rep(c("A", "B"), each = 3)
colData(se)$batch <- rep(c("K", "L", "M"), 2)
Review Annotations
. Then, click Edit Interactively
. Your interface should now look like this:
NxtSE Annotations GUI
Now, lets add an extra column. In the Annotation Columns
box which has just appeared, type in a name condition
, then click Add
. Then, click batch
and again click Add
. Then, click on each empty cell and fill in the fields as shown:
NxtSE Annotations GUI
Finally, the annotated NxtSE needs to be reloaded into memory. Click Load NxtSE object
and then Load NxtSE object
to reload the NxtSE object with the new annotations.
NB: Annotations can be assigned prior to loading the NxtSE object, but can only be done after the NxtSE folder is selected.
Alternatively, users can provide a file that contains the annotations in tabular format (e.g. a comma-separated values (CSV) or tab-separated values (TSV) file). Click onReview Annotations
and then Import Data Frame from File
. Then, select the demo_annotations.csv
file that has been pre-generated by SpliceWiz’s demo mode.
Save NxtSE to/from RDS file
, then click Save NxtSE as RDS
. Choose a file name and location and press OK
. This RDS file can be loaded as an NxtSE object in a later GUI session by clicking the Load NxtSE from RDS
button.
NxtSE
objectNxtSE
is a data object which contains all the required data for downstream analysis after all the BAM alignment files have been process and the experiment is collated.
se#> class: NxtSE
#> dim: 168 6
#> metadata(11): Up_Inc Down_Inc ... ref BuildVersion
#> assays(5): Included Excluded Depth Coverage minDepth
#> rownames(168): TRA2B/ENST00000453386_Intron8/clean
#> TRA2B/ENST00000453386_Intron7/clean ...
#> RI:SRSF2-203-exon2;SRSF2-202-intron2
#> RI:SRSF2-203-exon2;SRSF2-206-intron1
#> rowData names(18): EventName EventType ... is_annotated_IR
#> NMD_direction
#> colnames(6): 02H003 02H025 ... 02H043 02H046
#> colData names(2): condition batch
The NxtSE
object inherits the SummarizedExperiment
object. This means that the functions for SummarizedExperiment can be used on the NxtSE object. These include row and column annotations using the rowData()
and colData()
accessors.
Rows in the NxtSE
object contain information about each alternate splicing event. For example:
head(rowData(se))
#> DataFrame with 6 rows and 18 columns
#> EventName EventType
#> <character> <character>
#> TRA2B/ENST00000453386_Intron8/clean TRA2B/ENST0000045338.. IR
#> TRA2B/ENST00000453386_Intron7/clean TRA2B/ENST0000045338.. IR
#> TRA2B/ENST00000453386_Intron6/clean TRA2B/ENST0000045338.. IR
#> TRA2B/ENST00000453386_Intron5/clean TRA2B/ENST0000045338.. IR
#> TRA2B/ENST00000453386_Intron4/clean TRA2B/ENST0000045338.. IR
#> TRA2B/ENST00000453386_Intron3/clean TRA2B/ENST0000045338.. IR
#> EventRegion intron_id
#> <character> <character>
#> TRA2B/ENST00000453386_Intron8/clean chrZ:1921-2559/- ENST00000453386_Intr..
#> TRA2B/ENST00000453386_Intron7/clean chrZ:2634-3631/- ENST00000453386_Intr..
#> TRA2B/ENST00000453386_Intron6/clean chrZ:3692-5298/- ENST00000453386_Intr..
#> TRA2B/ENST00000453386_Intron5/clean chrZ:5383-6205/- ENST00000453386_Intr..
#> TRA2B/ENST00000453386_Intron4/clean chrZ:6322-7990/- ENST00000453386_Intr..
#> TRA2B/ENST00000453386_Intron3/clean chrZ:8180-9658/- ENST00000453386_Intr..
#> Inc_Is_Protein_Coding Exc_Is_Protein_Coding
#> <logical> <logical>
#> TRA2B/ENST00000453386_Intron8/clean TRUE TRUE
#> TRA2B/ENST00000453386_Intron7/clean TRUE TRUE
#> TRA2B/ENST00000453386_Intron6/clean TRUE TRUE
#> TRA2B/ENST00000453386_Intron5/clean TRUE TRUE
#> TRA2B/ENST00000453386_Intron4/clean TRUE TRUE
#> TRA2B/ENST00000453386_Intron3/clean TRUE TRUE
#> Exc_Is_NMD Inc_Is_NMD Inc_TSL
#> <logical> <logical> <character>
#> TRA2B/ENST00000453386_Intron8/clean FALSE TRUE 1
#> TRA2B/ENST00000453386_Intron7/clean FALSE TRUE 1
#> TRA2B/ENST00000453386_Intron6/clean FALSE TRUE 1
#> TRA2B/ENST00000453386_Intron5/clean FALSE TRUE 1
#> TRA2B/ENST00000453386_Intron4/clean FALSE TRUE 1
#> TRA2B/ENST00000453386_Intron3/clean FALSE TRUE 1
#> Exc_TSL Event1a Event2a
#> <character> <character> <character>
#> TRA2B/ENST00000453386_Intron8/clean 1 chrZ:1921-2559/- NA
#> TRA2B/ENST00000453386_Intron7/clean 1 chrZ:2634-3631/- NA
#> TRA2B/ENST00000453386_Intron6/clean 1 chrZ:3692-5298/- NA
#> TRA2B/ENST00000453386_Intron5/clean 1 chrZ:5383-6205/- NA
#> TRA2B/ENST00000453386_Intron4/clean 1 chrZ:6322-7990/- NA
#> TRA2B/ENST00000453386_Intron3/clean 1 chrZ:8180-9658/- NA
#> Event1b Event2b
#> <character> <character>
#> TRA2B/ENST00000453386_Intron8/clean NA NA
#> TRA2B/ENST00000453386_Intron7/clean NA NA
#> TRA2B/ENST00000453386_Intron6/clean NA NA
#> TRA2B/ENST00000453386_Intron5/clean NA NA
#> TRA2B/ENST00000453386_Intron4/clean NA NA
#> TRA2B/ENST00000453386_Intron3/clean NA NA
#> is_always_first_intron
#> <logical>
#> TRA2B/ENST00000453386_Intron8/clean FALSE
#> TRA2B/ENST00000453386_Intron7/clean FALSE
#> TRA2B/ENST00000453386_Intron6/clean FALSE
#> TRA2B/ENST00000453386_Intron5/clean FALSE
#> TRA2B/ENST00000453386_Intron4/clean FALSE
#> TRA2B/ENST00000453386_Intron3/clean FALSE
#> is_always_last_intron is_annotated_IR
#> <logical> <logical>
#> TRA2B/ENST00000453386_Intron8/clean NA FALSE
#> TRA2B/ENST00000453386_Intron7/clean FALSE FALSE
#> TRA2B/ENST00000453386_Intron6/clean FALSE FALSE
#> TRA2B/ENST00000453386_Intron5/clean FALSE FALSE
#> TRA2B/ENST00000453386_Intron4/clean FALSE TRUE
#> TRA2B/ENST00000453386_Intron3/clean FALSE FALSE
#> NMD_direction
#> <numeric>
#> TRA2B/ENST00000453386_Intron8/clean 1
#> TRA2B/ENST00000453386_Intron7/clean 1
#> TRA2B/ENST00000453386_Intron6/clean 1
#> TRA2B/ENST00000453386_Intron5/clean 1
#> TRA2B/ENST00000453386_Intron4/clean 1
#> TRA2B/ENST00000453386_Intron3/clean 1
Columns contain information about each sample. By default, no annotations are assigned to each sample. These can be assigned as shown above.
Also, NxtSE
objects can be subsetted by rows (ASEs) or columns (samples). This is useful if one wishes to perform analysis on a subset of the dataset, or only on a subset of ASEs (say for example, only skipped exon events). Subsetting is performed just like for SummarizedExperiment
objects:
# Subset by columns: select the first 2 samples
se[,1:2]
se_sample_subset <-
# Subset by rows: select the first 10 ASE events
se[1:10,] se_ASE_subset <-
SpliceWiz offers default filters to identify and remove low confidence alternative splice events (ASEs). Run the default filter using the following:
se[applyFilters(se),]
se.filtered <-#> Running Depth filter
#> Running Participation filter
#> Running Participation filter
#> Running Consistency filter
#> Running Terminus filter
#> Running ExclusiveMXE filter
Analysis
and then Filters
from the menu bar. It should look like this:
Filters - GUI
To load SpliceWiz’s default filters, click the top right button Load Default Filters
. Then to apply these filters to the NxtSE, click Apply Filters
. After the filters have been run, your session should now look like this:
SpliceWiz default filters- GUI
?ASEFilters
Using the limma wrapper ASE_limma()
, perform differential ASE analysis between conditions “A” and “B”:
# Requires limma to be installed:
require("limma")
ASE_limma(
res_limma <-se = se.filtered,
test_factor = "condition",
test_nom = "B",
test_denom = "A",
)
Analysis
and then Differential Expression Analysis
on the menu side bar. It should look something like this (but with empty fields):
Differential Expression GUI
To perform limma-based differential analysis, first ensure Method
is set to limma
. Using the Variable
drop-down box, select condition
. Then, select the Nominator
and Denominator
fields to A
and B
, respectively. Leave the batch factor fields as (none)
. Then, click Perform DE
.
Once differential expression analysis has finished, your session should look like below. The output is a DT-based data table equivalent to the ASE_limma()
function.
Differential Expression (limma) GUI
NB: The interface allows users to choose to sort the results either by nominal or (multiple-testing) adjusted P values
NB2: There are 3 different ways Intron Retention events can be quantified and analysed - see “What are the different ways intron retention is measured?” below for further details.
ASE_limma
uses limma
to model isoform counts as log-normal distributions. Limma is probably the fastest method and is ideal for large datasetsASE_DESeq
uses DESeq2
to model isoform counts as negative binomial distribution. This method is the most computationally expensive, but gives robust results. Time series analysis is also available for this modeASE_DoubleExpSeq
uses the lesser-known CRAN package DoubleExpSeq
. This package uses the beta-binomial distribution to model isoform counts. The method is at least as fast as limma
, but for now it is restricted to analysis between two groups (i.e. batch correction is not implemented)ASE_satuRn
uses the new satuRn
package which models counts using the quasi-binomial distribution.# Requires limma to be installed:
require("limma")
ASE_limma(
res_limma <-se = se.filtered,
test_factor = "condition",
test_nom = "B",
test_denom = "A"
)
# Requires DESeq2 to be installed:
require("DESeq2")
ASE_DESeq(
res_deseq <-se = se.filtered,
test_factor = "condition",
test_nom = "B",
test_denom = "A",
n_threads = 1
)
# Requires DoubleExpSeq to be installed:
require("DoubleExpSeq")
ASE_DoubleExpSeq(
res_DES <-se = se.filtered,
test_factor = "condition",
test_nom = "B",
test_denom = "A"
)
# Requires satuRn to be installed:
require("satuRn")
ASE_satuRn(
res_saturn <-se = se.filtered,
test_factor = "condition",
test_nom = "B",
test_denom = "A",
n_threads = 1
)
The first (and preferred) approach is using IR-ratio. We presume that every intron is potentially retained (thus ignoring annotation). Given this results in many overlapping introns, SpliceWiz adjusts for this via the following:
makeSE()
step.SpliceOver
method in SpliceWiz). Alternately, users can choose to use IRFinder’s SpliceMax
method, summing junction reads that share either splice junction with the intron of interest. This choice is also made by the user at the makeSE()
step.At the differential analysis step, users can choose the following:
IRmode = "all"
- all introns are potentially retained, use IR-ratio to quantify IR (EventType = "IR"
)IRmode = "annotated"
- only annotated retained intron events are considered, but use IR-ratio to quantify IR (EventType = "IR"
)IRmode = "annotated_binary"
- only annotated retained intron events are considered, use PSI to quantify IR - which considers the IR-transcript and the transcript isoform with the exactly-spliced intron as binary alternatives. Splicing of overlapping introns are not considered in PSI quantitation. ASE_limma(
res_limma_allIntrons <-se = se.filtered,
test_factor = "condition",
test_nom = "B",
test_denom = "A",
IRmode = "all"
)
ASE_limma(
res_limma_annotatedIR <-se = se.filtered,
test_factor = "condition",
test_nom = "B",
test_denom = "A",
IRmode = "annotated"
)
ASE_limma(
res_limma_annotated_binaryIR <-se = se.filtered,
test_factor = "condition",
test_nom = "B",
test_denom = "A",
IRmode = "annotated_binary"
)
<>
ASE_limma
, ASE_DESeq
and ASE_satuRn
can accept up to 2 categories of batches from which to normalize. For example, to normalize the analysis by the batch
category, one would run:
require("limma")
ASE_limma(
res_limma_batchnorm <-se = se.filtered,
test_factor = "condition",
test_nom = "B",
test_denom = "A",
batch1 = "batch"
)
test_nom
and test_denom
parameters. As long as the test_factor
contains numeric values, ASE_DESeq
will treat it as a continuous variable. See the following example:
colData(se.filtered)$timevar <- rep(c(0,1,2), 2)
require("DESeq2")
ASE_DESeq(
res_deseq_cont <-se = se.filtered,
test_factor = "timevar"
)
<>
Volcano plots show changes in PSI levels (log fold change, x axis) against statistical significance (-log10 p values, y axis):
library(ggplot2)
ggplot(res_limma,
aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point() +
labs(title = "Differential analysis - B vs A",
x = "Log2-fold change", y = "BH-adjusted P values (-log10)")
ggplot2
’s facet_wrap
function to separately plot volcanos for each modality of ASE. The type of ASE is contained in the EventType
column of the differential results data frame.
ggplot(res_limma,
aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point() + facet_wrap(vars(EventType)) +
labs(title = "Differential analysis - B vs A",
x = "Log2-fold change", y = "BH-adjusted P values (-log10)")
Display
and then Volcano Plot
. The default volcano plot (top 1000 ASEs) will be automatically generated as a plotly object:
Volcano Plot - GUI
You can customize this volcano plot using the following controls:
Sort by Adjusted P Values
in the differential analysis interface.Splice Type
s. The volcano plot will be filtered by the selected ASE modalitiesON
.NMD Mode
is ON
, the horizontal axis represents whether splicing is shifted towards (positive values) or away from (negative values) a NMD substrate transcript
Scatter plots are useful for showing splicing levels (percent-spliced-in, PSI) between two conditions. The results from differential analysis contains these values and can be plotted:
library(ggplot2)
ggplot(res_limma, aes(x = 100 * AvgPSI_B, y = 100 * AvgPSI_A)) +
geom_point() + xlim(0, 100) + ylim(0, 100) +
labs(title = "PSI values across conditions",
x = "PSI of condition B", y = "PSI of condition A")
Display
and then Scatter Plot
.
You can customize the scatter plot using the following controls:
Sort by Adjusted P Values
in the differential analysis interface.Splice Type
s. The scatter plot will be filtered by the selected ASE modalitiesVariable
refers to the annotation categoryX-axis
and Y-axis
dropdowns refers to the experimental conditions for which to contrast the average PSI values.NMD Mode
is ON
, the PSI values are altered such that they represent the inclusion values of the NMD substrate (instead of the “included” isoform)For now, change the Variable
to condition
, X-axis condition
to A
and Y-axis condition
to B
. A scatter plot should be automatically generated as follows:
Scatter plot - GUI
Volcano plot - selecting ASEs of interest via the GUI
These ASEs of interest will then be highlighted in other plots, for example scatter plot:
Scatter plot - highlighted ASE events
makeMeanPSI()
function that can generate mean PSI values for each condition of a condition category. For example, the below code will calculate the mean PSIs of each “batch” of this example experiment:
makeMeanPSI(
meanPSIs <-se = se,
condition = "batch",
conditionList = list("K", "L", "M")
)
Heatmaps are useful for visualizing differential expression of individual samples, as well as potential patterns of expression.
First, obtain a matrix of PSI values:
# Create a matrix of values of the top 10 differentially expressed events:
makeMatrix(
mat <-
se.filtered,event_list = res_limma$EventName[1:10],
method = "PSI"
)
makeMatrix()
work?makeMatrix()
provides a matrix of PSI values from the given NxtSE
object. The parameters event_list
and sample_list
allows subsetting for ASEs and/or samples, respectively.
The parameter method
accepts 3 options:
"PSI"
: outputs raw PSI values"logit"
: outputs logit PSI values"Z-score"
: outputs Z-score transformed PSI valuesAlso, makeMatrix()
facilitates exclusion of low confidence PSI values. These can occur when counts of both isoforms are too low. Setting the depth_threshold
(default 10
) will set samples with total isoform count below this value to be converted to NA
.
NA
values are filtered out. Setting the parameter na.percent.max
(default 0.1
) means any ASE with the proportion of NA
above this threshold will be removed from the final matrix.
Plot this matrix of values in a heatmap:
library(pheatmap)
as.data.frame(colData(se.filtered))
anno_col_df <- anno_col_df[, 1, drop=FALSE]
anno_col_df <-
pheatmap(mat, annotation_col = anno_col_df)
Display
, then Heatmap
in the menu side bar. A heatmap will be automatically generated:
Heatmap - GUI
The heatmap can be customized as follows:
Top All Results
- unfiltered results, Top Filtered Results
- results filtered using the Differential Expression interactive DT data table, or Top Selected Results
- results selected using the interactive box or lasso select tools in the interactive volcano and/or scatter plots.
processBAM()
. This data is saved in “COV” files, which is a BGZF compressed and indexed file. COV files show compression and performance gains over BigWig files.
Additionally, SpliceWiz visualizes plots group-averaged coverages, based on user-defined experimental conditions. This is a powerful tool to illustrate group-specific differential splicing or IR. SpliceWiz does this by normalising the coverage depths of each sample based on transcript depth at the splice junction / intron of interest. By doing so, the coverage depths of constitutively expressed flanking exons are normalised to unity. As a result, the intron depths reflect the fraction of transcripts with retained introns and can be compared across samples.
First, lets obtain a list of differential events with delta PSI > 5%:
subset(res_limma, abs(deltaPSI) > 0.05) res_limma.filtered <-
Plot up to 4 individual samples using plotCoverage()
:
plotCoverage(
p <-se = se,
Event = res_limma.filtered$EventName[1],
tracks = colnames(se)[c(1,2,4,5)],
)
if(interactive()) {
# Display as plotly object
$final_plot
pelse {
} # Display as ggplot
as_ggplot_cov(p)
}
Sashimi plots, first popularised by MISO, are coverage plots with arcs that denote split reads. They are each labelled by a number denoting the number of split reads spanning the corresponding junctions.
In SpliceWiz (>=0.99.4) junctions can be plotted by setting plotJunctions = TRUE
from within plotCoverage()
plotCoverage(
p <-se = se,
Event = res_limma.filtered$EventName[1],
tracks = colnames(se)[c(1,2,4,5)],
plotJunctions = TRUE
)#> Warning in geom_line(data = dfJn, aes_string(x = "x", y = "yarc", group = "junction", : Ignoring unknown aesthetics: label
#> Ignoring unknown aesthetics: label
#> Ignoring unknown aesthetics: label
#> Ignoring unknown aesthetics: label
if(interactive()) {
# Display as plotly object
$final_plot
pelse {
} # Display as ggplot
as_ggplot_cov(p)
}
selected_transcripts
parameter of the plotCoverage()
function to only display the specified transcripts. Selected transcripts can be supplied by name or ID (transcript_name
and transcript_id
field in the GTF file, respectively)
plotCoverage(
p <-se = se,
Event = res_limma.filtered$EventName[1],
tracks = colnames(se)[c(1,2,4,5)],
selected_transcripts = c("NSUN5-201", "NSUN5-203", "NSUN5-206")
)
if(interactive()) {
# Display as plotly object
$final_plot
pelse {
} # Display as ggplot
as_ggplot_cov(p)
}
We plot average coverages of groups “A” and “B” of the “condition” category, as follows:
plotCoverage(
p <-se = se,
Event = res_limma.filtered$EventName[1],
tracks = c("A", "B"),
condition = "condition",
plotJunctions = TRUE,
)#> Warning in geom_line(data = dfJn, aes_string(x = "x", y = "yarc", group = "junction", : Ignoring unknown aesthetics: label
#> Ignoring unknown aesthetics: label
if(interactive()) {
# Display as plotly object
$final_plot
pelse {
} # Display as ggplot
as_ggplot_cov(p)
}
Note that junctions are labelled not by their counts (which are different between replicates of a group, and are dependent on gene expression and sequencing depth). Instead, SpliceWiz labels junction arcs by the mean +/- sd of “provisional PSI”. These are PSI estimates calculated for each sample, based on the ratio of the junction read count as a proportion of total spliced read counts that share either exon cluster with that of the assessed junction.
Differences in group-normalized coverage are best visualized when the traces are stacked together in the same track. To do this, set stack_tracks = TRUE
from within plotCoverage()
. Note this is only enabled when plotting group-normalized coverage plots (i.e., when condition
is set).
plotCoverage(
p <-se = se,
Event = res_limma.filtered$EventName[1],
tracks = c("A", "B"),
condition = "condition",
stack_tracks = TRUE,
)
if(interactive()) {
# Display as plotly object
$final_plot
pelse {
} # Display as ggplot
as_ggplot_cov(p)
}
The shaded ribbons represent the uncertainty of the coverage. This is denoted by the ribbon_mode
parameter of the plotCoverage() command. The default is standard deviation sd
. Alternatively, users can choose the 95% confidence interval ci
, or the standard error of the mean sem
.
This feature is important for two reasons:
Click on Display
and then Coverage plot
from the menu side bar. It should look like this:
Coverage Plots - GUI
Coverage plots can be customized as follows:
View
by condition
OFF
or collapsed combined transcripts ON
is displayedsessionInfo()
#> R version 4.2.3 (2023-03-15)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.6 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] pheatmap_1.0.12 ggplot2_3.4.1
#> [3] satuRn_1.6.0 DoubleExpSeq_1.1
#> [5] DESeq2_1.38.3 SummarizedExperiment_1.28.0
#> [7] Biobase_2.58.0 MatrixGenerics_1.10.0
#> [9] matrixStats_0.63.0 GenomicRanges_1.50.2
#> [11] GenomeInfoDb_1.34.9 IRanges_2.32.0
#> [13] S4Vectors_0.36.2 limma_3.54.2
#> [15] fstcore_0.9.14 AnnotationHub_3.6.0
#> [17] BiocFileCache_2.6.1 dbplyr_2.3.2
#> [19] BiocGenerics_0.44.0 SpliceWiz_1.0.4
#> [21] NxtIRFdata_1.4.0
#>
#> loaded via a namespace (and not attached):
#> [1] lazyeval_0.2.2 shinydashboard_0.7.2
#> [3] splines_4.2.3 crosstalk_1.2.0
#> [5] BiocParallel_1.32.6 digest_0.6.31
#> [7] foreach_1.5.2 ca_0.71.1
#> [9] htmltools_0.5.5 viridis_0.6.2
#> [11] fansi_1.0.4 magrittr_2.0.3
#> [13] memoise_2.0.1 BSgenome_1.66.3
#> [15] shinyFiles_0.9.3 Biostrings_2.66.0
#> [17] annotate_1.76.0 R.utils_2.12.2
#> [19] prettyunits_1.1.1 colorspace_2.1-0
#> [21] blob_1.2.4 rappdirs_0.3.3
#> [23] xfun_0.38 dplyr_1.1.1
#> [25] crayon_1.5.2 RCurl_1.98-1.12
#> [27] jsonlite_1.8.4 genefilter_1.80.3
#> [29] survival_3.5-5 iterators_1.0.14
#> [31] glue_1.6.2 registry_0.5-1
#> [33] gtable_0.3.3 zlibbioc_1.44.0
#> [35] XVector_0.38.0 webshot_0.5.4
#> [37] DelayedArray_0.24.0 Rhdf5lib_1.20.0
#> [39] HDF5Array_1.26.0 scales_1.2.1
#> [41] DBI_1.1.3 Rcpp_1.0.10
#> [43] progress_1.2.2 viridisLite_0.4.1
#> [45] xtable_1.8-4 bit_4.0.5
#> [47] DT_0.27 htmlwidgets_1.6.2
#> [49] httr_1.4.5 RColorBrewer_1.1-3
#> [51] ellipsis_0.3.2 farver_2.1.1
#> [53] pkgconfig_2.0.3 XML_3.99-0.14
#> [55] R.methodsS3_1.8.2 sass_0.4.5
#> [57] locfit_1.5-9.7 utf8_1.2.3
#> [59] labeling_0.4.2 tidyselect_1.2.0
#> [61] rlang_1.1.0 later_1.3.0
#> [63] AnnotationDbi_1.60.2 munsell_0.5.0
#> [65] BiocVersion_3.16.0 tools_4.2.3
#> [67] cachem_1.0.7 cli_3.6.1
#> [69] generics_0.1.3 RSQLite_2.3.0
#> [71] evaluate_0.20 fastmap_1.1.1
#> [73] heatmaply_1.4.2 yaml_2.3.7
#> [75] ompBAM_1.2.0 fs_1.6.1
#> [77] knitr_1.42 bit64_4.0.5
#> [79] purrr_1.0.1 KEGGREST_1.38.0
#> [81] dendextend_1.17.1 egg_0.4.5
#> [83] pbapply_1.7-0 sparseMatrixStats_1.10.0
#> [85] mime_0.12 rhandsontable_0.3.8
#> [87] R.oo_1.25.0 compiler_4.2.3
#> [89] plotly_4.10.1 filelock_1.0.2
#> [91] curl_5.0.0 png_0.1-8
#> [93] interactiveDisplayBase_1.36.0 geneplotter_1.76.0
#> [95] tibble_3.2.1 bslib_0.4.2
#> [97] highr_0.10 lattice_0.20-45
#> [99] Matrix_1.5-3 vctrs_0.6.1
#> [101] pillar_1.9.0 lifecycle_1.0.3
#> [103] rhdf5filters_1.10.1 BiocManager_1.30.20
#> [105] jquerylib_0.1.4 data.table_1.14.8
#> [107] bitops_1.0-7 seriation_1.4.2
#> [109] httpuv_1.6.9 rtracklayer_1.58.0
#> [111] R6_2.5.1 BiocIO_1.8.0
#> [113] promises_1.2.0.1 TSP_1.2-3
#> [115] gridExtra_2.3 codetools_0.2-19
#> [117] boot_1.3-28.1 assertthat_0.2.1
#> [119] rhdf5_2.42.0 rjson_0.2.21
#> [121] withr_2.5.0 shinyWidgets_0.7.6
#> [123] GenomicAlignments_1.34.1 Rsamtools_2.14.0
#> [125] locfdr_1.1-8 GenomeInfoDbData_1.2.9
#> [127] hms_1.1.3 parallel_4.2.3
#> [129] fst_0.9.8 grid_4.2.3
#> [131] tidyr_1.3.0 rmarkdown_2.21
#> [133] DelayedMatrixStats_1.20.0 numDeriv_2016.8-1.1
#> [135] shiny_1.7.4 restfulr_0.0.15