library(sangeranalyseR)
sangeranalyseR is an R package that provides fast, flexible, and reproducible workflows for assembling Sanger sequencing data into contigs. It is a free, open-source alternative to Geneious, CodonCode Aligner, and Phred-Phrap-Consed. The full reference manual is on the ReadTheDocs site; this vignette focuses on the recipes most users actually need: how to call the constructors, what each parameter does, and how to interpret the output.
The package is built around three S4 classes that form a containment hierarchy:
SangerAlignment ← a set of contigs aligned to each other
└── SangerContig ← one assembled contig (forward + reverse reads)
└── SangerRead ← one ABIF or FASTA read
Every recipe uses the bundled Allolobophora chlorotica ABIF fixture (8 reads
arranged into 4 forward+reverse pairs). The system.file() call below works
from any installed copy of the package.
ab1_dir <- system.file("extdata", "Allolobophora_chlorotica", "ACHLO",
package = "sangeranalyseR")
list.files(ab1_dir, pattern = "\\.ab1$")
## [1] "Achl_ACHLO006-09_1_F.ab1" "Achl_ACHLO006-09_2_R.ab1"
## [3] "Achl_ACHLO007-09_1_F.ab1" "Achl_ACHLO007-09_2_R.ab1"
## [5] "Achl_ACHLO040-09_1_F.ab1" "Achl_ACHLO040-09_2_R.ab1"
## [7] "Achl_ACHLO041-09_1_F.ab1" "Achl_ACHLO041-09_2_R.ab1"
sc <- SangerContig(
inputSource = "ABIF",
processMethod = "REGEX",
ABIF_Directory = ab1_dir,
contigName = "Achl_ACHLO006-09",
REGEX_SuffixForward = "_[0-9]*_F\\.ab1$",
REGEX_SuffixReverse = "_[0-9]*_R\\.ab1$",
TrimmingMethod = "M1",
M1TrimmingCutoff = 0.0001
)
sc@objectResults@creationResult # TRUE
length(sc@forwardReadList) # 1 forward read
length(sc@reverseReadList) # 1 reverse read
as.character(sc@contigSeq) # the consensus sequence
SangerAlignment)sa <- SangerAlignment(
inputSource = "ABIF",
processMethod = "REGEX",
ABIF_Directory = ab1_dir,
REGEX_SuffixForward = "_[0-9]*_F\\.ab1$",
REGEX_SuffixReverse = "_[0-9]*_R\\.ab1$",
TrimmingMethod = "M1"
)
length(sa@contigList) # 4 contigs
length(sa@contigsConsensus) # cross-contig consensus length
When your filenames don’t follow a clean _F.ab1 / _R.ab1 convention, supply
a reads,direction,contig CSV that explicitly maps every read:
csv_path <- system.file("extdata", "ab1", "SangerAlignment",
"names_conversion.csv", package = "sangeranalyseR")
sa_csv <- SangerAlignment(
inputSource = "ABIF",
processMethod = "CSV",
ABIF_Directory = ab1_dir,
CSV_NamesConversion = csv_path
)
Phase-15 fix: contig labels in the CSV no longer have to appear as substrings
of filenames. The CSV’s reads column drives the lookup directly.
Common in 16S barcoding and short-read survey pipelines. Pass NULL (or
NA_character_) for the missing-direction suffix and set minReadsNum = 1 so
each surviving read can become its own contig:
sa_fwd <- SangerAlignment(
inputSource = "ABIF",
processMethod = "REGEX",
ABIF_Directory = ab1_dir,
REGEX_SuffixForward = "_[0-9]*_F\\.ab1$",
REGEX_SuffixReverse = NULL, # explicit forward-only (Phase-15)
minReadsNum = 1
)
Two trimming algorithms, controlled by TrimmingMethod:
| Method | Algorithm | Parameters |
|---|---|---|
"M1" |
Modified Mott’s (Phred/Phrap-style cumulative) | M1TrimmingCutoff (probability; default 0.0001) |
"M2" |
Sliding-window mean Phred (Trimmomatic-style) | M2CutoffQualityScore, M2SlidingWindowSize |
Tighter trimming for noisy data:
sa_strict <- SangerAlignment(
inputSource = "ABIF",
processMethod = "REGEX",
ABIF_Directory = ab1_dir,
REGEX_SuffixForward = "_[0-9]*_F\\.ab1$",
REGEX_SuffixReverse = "_[0-9]*_R\\.ab1$",
TrimmingMethod = "M2",
M2CutoffQualityScore = 30,
M2SlidingWindowSize = 15,
minReadLength = 50 # post-trim length floor
)
Phase-16 added a defensive width filter: any read trimmed to < 2 bp is
silently dropped before alignment with a MIN_READ_LENGTH_DEFENSIVE_DROP
warning, so you never see DECIPHER::AlignSeqs crash on degenerate inputs.
Forward + reverse reads with poor overlap silently produce IUPAC-ambiguity-soup consensus. Phase-16 adds an opt-in alignment-quality check:
sa_overlap <- SangerAlignment(
inputSource = "ABIF",
processMethod = "REGEX",
ABIF_Directory = ab1_dir,
REGEX_SuffixForward = "_[0-9]*_F\\.ab1$",
REGEX_SuffixReverse = "_[0-9]*_R\\.ab1$",
minOverlapBases = 50L, # warn if any pairwise overlap < 50 bp
minOverlapFraction = 0.05 # or < 5% of the shorter read
)
When triggered, you’ll see LOW_OVERLAP_WARN in the log alongside the offending
read pair.
Three modes are exposed via consensusMethod:
| Mode | Behaviour | Per-position quality |
|---|---|---|
"strict" (default) |
DECIPHER’s ConsensusSequence with IUPAC ambiguity codes for disagreements. |
not provided |
"majority" |
Per-column plurality vote; ties break alphabetically. No IUPAC codes ever appear in the output. | synthetic Phred |
"quality_weighted" |
Same as majority but votes are weighted by source-read Phred. Alias: qualityAware = TRUE. |
mean Phred of agreers |
sc_majority <- SangerContig(
inputSource = "ABIF",
processMethod = "REGEX",
ABIF_Directory = ab1_dir,
contigName = "Achl_ACHLO006-09",
REGEX_SuffixForward = "_[0-9]*_F\\.ab1$",
REGEX_SuffixReverse = "_[0-9]*_R\\.ab1$",
consensusMethod = "majority"
)
as.character(sc_majority@contigSeq) # plain ACGT, no IUPAC codes
attr(sc_majority@contigSeq, "qualityScores") # per-position synthetic Phred
ABIF files store per-base trace amplitudes for the four channels (A, C, G, T)
and per-base Phred quality scores in the PCON.2 data block. sangeranalyseR
re-runs the base-calling step on those raw traces using the
MakeBaseCallsInside helper:
getpeaks).signalRatioCutoff (default 0.33 — secondary peaks below that fraction of
the primary peak are dropped).abifRawData@data$PCON.2 (one entry per detected
peak).Visualize either as a static PDF (chromatogram_overwrite()) or an interactive
WebGL widget (chromatogram_plotly() — Phase-8):
sr <- sa@contigList[[1]]@forwardReadList[[1]]
chromatogram_plotly(sr, max_points = 8000, showtrim = TRUE)
For very long traces (> 50 k points) the widget downsamples by uniform stride
to keep the browser responsive; the original / rendered point counts are
reported via attr(p, "downsample_info").
Each SangerRead exposes:
@primarySeq — the strongest base at each position (DNAString).@secondarySeq — the second-strongest base at each position (DNAString).@signalRatioCutoff (in @ChromatogramParam) — the threshold below which a
secondary peak is dropped.To inspect secondary peaks within a contig alignment, look at
sc@secondaryPeakDF — Phase-3 added one-row-per-ambiguous-column reporting:
head(sc@secondaryPeakDF)
Re-run base-calling with a tighter (or looser) cutoff:
sr_re <- MakeBaseCalls(sr, signalRatioCutoff = 0.22)
launchApp(sa) # works on SangerAlignment or SangerContig
Per-read trimming sliders, contig overview, alignment browser, FASTA / HTML report export. Phase-8 also added a lightweight gadget for batch trimming across a whole alignment:
sa2 <- globalTrimApp(sa) # opens a Shiny dialog; returns the re-trimmed SA
out_dir <- tempdir()
writeFasta(sa, outputDir = out_dir) # SR / SC / SA dispatcher
generateReport(sa, outputDir = out_dir) # HTML report (requires pandoc)
Phase-8 fix: reports now correctly populate the per-frame AA tables under the
default lazyAA = TRUE constructor mode (previously the tables silently
rendered empty).
Every recipe above is built on three S4 constructors. The full parameter list
is in ?SangerAlignment / ?SangerContig / ?SangerRead; the most-asked-about
groups are summarised below.
| Parameter | What it controls |
|---|---|
inputSource |
"ABIF" (raw chromatograms) or "FASTA" (pre-called sequences). |
processMethod |
"REGEX" (group reads by filename suffix) or "CSV" (explicit reads,direction,contig mapping). |
ABIF_Directory |
Path to the directory of .ab1 files. Required for inputSource = "ABIF". |
FASTA_File |
Path to a single FASTA file. Required for inputSource = "FASTA". |
REGEX_SuffixForward |
A regex matched against forward-read filenames, e.g. "_F\\.ab1$". Pass NULL for reverse-only. |
REGEX_SuffixReverse |
A regex matched against reverse-read filenames, e.g. "_R\\.ab1$". Pass NULL for forward-only. |
contigName |
(SangerContig only) The label / prefix shared by reads in this contig. |
CSV_NamesConversion |
Path to a CSV with three columns: reads, direction (F/R), contig. Required for processMethod = "CSV". |
| Parameter | When used | Default | Notes |
|---|---|---|---|
TrimmingMethod |
always | "M1" |
"M1" (modified Mott) or "M2" (sliding window). |
M1TrimmingCutoff |
TrimmingMethod = "M1" |
0.0001 |
Cumulative probability cutoff. Tighter = more aggressive trim. |
M2CutoffQualityScore |
TrimmingMethod = "M2" |
20 |
Mean Phred threshold within the sliding window. |
M2SlidingWindowSize |
TrimmingMethod = "M2" |
10 |
Width of the sliding window in bp. |
minReadLength |
always | 20L |
Reads trimmed to less than this are dropped from the contig. |
signalRatioCutoff |
inputSource = "ABIF" |
0.33 |
Secondary peaks below this fraction of the primary peak are dropped. |
| Parameter | Default | Notes |
|---|---|---|
consensusMethod |
"strict" |
"strict" (DECIPHER+IUPAC), "majority" (plurality vote, no IUPAC), "quality_weighted" (Phred-weighted). |
qualityAware |
FALSE |
Shorthand for consensusMethod = "quality_weighted". |
minFractionCall |
0.5 |
DECIPHER minInformation for "strict" mode. |
maxFractionLost |
0.5 |
DECIPHER threshold for "strict" mode. |
minOverlapBases |
0L |
If > 0, log LOW_OVERLAP_WARN when smallest pairwise non-gap overlap < this. |
minOverlapFraction |
0.0 |
Same in fractional terms (overlap as a fraction of the shorter read). |
alignSeqsParams |
list() |
Extra named args forwarded to DECIPHER::AlignSeqs (e.g. list(iterations = 1L, refinements = 1L)). |
| Parameter | Default | Notes |
|---|---|---|
processorsNum |
1 |
Legacy integer worker count. Honoured for backwards compatibility. |
BPPARAM |
NULL |
Any BiocParallelParam. Auto-derived from processorsNum if NULL (SerialParam for 1, Multicore/Snow for ≥2). |
lazyAA |
TRUE |
Skip eager 3-frame AA translation (Phase-6 default; ~35% wall-time saving). Use primaryAASeqS{1,2,3}() accessors. |
| Symptom | Cause / fix |
|---|---|
'qualityPhredScores' length cannot be zero |
ABIF has empty PCON.2 quality block (older 3500/Beckman firmware). Phase-15 fix: synthesises Phred 30 with MISSING_QUALITY_SCORES_WARN. Update to the devel branch. |
'REGEX_SuffixReverse' must be character type on forward-only data |
Phase-15 fix: pass REGEX_SuffixReverse = NULL (or NA_character_) for forward-only datasets, plus minReadsNum = 1. |
CONTIG_NUMBER_ZERO_ERROR even though each SangerContig() works individually |
Phase-15 fix: the CSV+ABIF aggregator no longer requires contig labels to be substrings of filenames; the reads column drives the lookup. |
'x' must be an XStringSet object from writeFasta on a single-read contig |
Phase-15 fix: writeFastaSC detects empty alignment and writes a single-record FASTA from @contigSeq. |
| Consensus is full of IUPAC ambiguity codes | Phase-17: try consensusMethod = "majority" or "quality_weighted". Also check pairwise overlap with minOverlapBases = 50L to detect spurious merges. |
| Reports render with empty AA tables | Phase-8 fix: the RMD templates were reading @primaryAASeqS* slots directly under lazyAA = TRUE. Fixed in the devel branch — use primaryAASeqS1/S2/S3() accessors if you customise the templates. |
Please cite the package via:
Kuan-Hao Chao, Kirston Barton, Sarah Palmer, Robert Lanfear (2021). sangeranalyseR: simple and interactive processing of Sanger sequencing data in R. Genome Biology and Evolution. doi:10.1093/gbe/evab028.
sessionInfo()
## R version 4.6.0 Patched (2026-05-01 r89994)
## Platform: aarch64-apple-darwin23
## Running under: macOS Tahoe 26.3.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] sangeranalyseR_1.23.0 sangerseqR_1.49.0 stringr_1.6.0
## [4] pwalign_1.9.0 DECIPHER_3.9.0 Biostrings_2.81.1
## [7] Seqinfo_1.3.0 XVector_0.53.0 IRanges_2.47.1
## [10] S4Vectors_0.51.2 BiocGenerics_0.59.2 generics_0.1.4
## [13] BiocStyle_2.41.0
##
## loaded via a namespace (and not attached):
## [1] ade4_1.7-24 tidyselect_1.2.1 viridisLite_0.4.3
## [4] dplyr_1.2.1 farver_2.1.2 S7_0.2.2
## [7] fastmap_1.2.0 lazyeval_0.2.3 promises_1.5.0
## [10] shinyjs_2.1.1 digest_0.6.39 mime_0.13
## [13] lifecycle_1.0.5 magrittr_2.0.5 compiler_4.6.0
## [16] rlang_1.2.0 sass_0.4.10 tools_4.6.0
## [19] yaml_2.3.12 data.table_1.18.4 excelR_0.4.0
## [22] knitr_1.51 htmlwidgets_1.6.4 RColorBrewer_1.1-3
## [25] BiocParallel_1.47.0 purrr_1.2.2 shinyWidgets_0.9.1
## [28] grid_4.6.0 xtable_1.8-8 ggplot2_4.0.3
## [31] scales_1.4.0 MASS_7.3-65 dichromat_2.0-0.1
## [34] cli_3.6.6 rmarkdown_2.31 crayon_1.5.3
## [37] otel_0.2.0 httr_1.4.8 DBI_1.3.0
## [40] ape_5.8-1 cachem_1.1.0 parallel_4.6.0
## [43] BiocManager_1.30.27 vctrs_0.7.3 jsonlite_2.0.0
## [46] bookdown_0.46 seqinr_4.2-44 plotly_4.12.0
## [49] jquerylib_0.1.4 tidyr_1.3.2 ggdendro_0.2.0
## [52] glue_1.8.1 codetools_0.2-20 DT_0.34.0
## [55] stringi_1.8.7 gtable_0.3.6 later_1.4.8
## [58] shinycssloaders_1.1.0 shinydashboard_0.7.3 tibble_3.3.1
## [61] logger_0.4.2 pillar_1.11.1 htmltools_0.5.9
## [64] R6_2.6.1 evaluate_1.0.5 shiny_1.13.0
## [67] lattice_0.22-9 openxlsx_4.2.8.1 httpuv_1.6.17
## [70] bslib_0.11.0 Rcpp_1.1.1-1.1 zip_2.3.3
## [73] gridExtra_2.3 nlme_3.1-169 xfun_0.57
## [76] pkgconfig_2.0.3