1 Introduction

Most proteomics experiments need protein (peptide) separation and cleavage procedures before these molecules could be analyzed or identified by mass spectrometry or other analytical tools.

cleaver allows in-silico cleavage of polypeptide sequences to e.g. create theoretical mass spectrometry data.

The cleavage rules are taken from the ExPASy PeptideCutter tool (Gasteiger et al. 2005).

2 Simple Usage

Loading the cleaver package:

library("cleaver")

Getting help and list all available cleavage rules:

help("cleave")

Cleaving of Gastric juice peptide 1 (P01358) using Trypsin:

## cleave it
cleave("LAAGKVEDSD", enzym="trypsin")
## $LAAGKVEDSD
## [1] "LAAGK" "VEDSD"
## get the cleavage ranges
cleavageRanges("LAAGKVEDSD", enzym="trypsin")
## $LAAGKVEDSD
##      start end
## [1,]     1   5
## [2,]     6  10
## get only cleavage sites
cleavageSites("LAAGKVEDSD", enzym="trypsin")
## $LAAGKVEDSD
## [1] 5

Sometimes cleavage is not perfect and the enzym miss some cleavage positions:

## miss one cleavage position
cleave("LAAGKVEDSD", enzym="trypsin", missedCleavages=1)
## $LAAGKVEDSD
## [1] "LAAGKVEDSD"
cleavageRanges("LAAGKVEDSD", enzym="trypsin", missedCleavages=1)
## $LAAGKVEDSD
##      start end
## [1,]     1  10
## miss zero or one cleavage positions
cleave("LAAGKVEDSD", enzym="trypsin", missedCleavages=0:1)
## $LAAGKVEDSD
## [1] "LAAGK"      "VEDSD"      "LAAGKVEDSD"
cleavageRanges("LAAGKVEDSD", enzym="trypsin", missedCleavages=0:1)
## $LAAGKVEDSD
##      start end
## [1,]     1   5
## [2,]     6  10
## [3,]     1  10

Combine cleaver and Biostrings (Pages et al., n.d.):

## create AAStringSet object
p <- AAStringSet(c(gaju="LAAGKVEDSD", pnm="AGEPKLDAGV"))

## cleave it
cleave(p, enzym="trypsin")
## AAStringSetList of length 2
## [["gaju"]] LAAGK VEDSD
## [["pnm"]] AGEPK LDAGV
cleavageRanges(p, enzym="trypsin")
## IRangesList object of length 2:
## $gaju
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1         5         5
##   [2]         6        10         5
## 
## $pnm
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1         5         5
##   [2]         6        10         5
cleavageSites(p, enzym="trypsin")
## $gaju
## [1] 5
## 
## $pnm
## [1] 5

3 Insulin & Somatostatin Example

Downloading Insulin (P01308) and Somatostatin (P61278) sequences from the UniProt (The UniProt Consortium 2012) database using UniProt.ws (Carlson, n.d.).

## load UniProt.ws library
library("UniProt.ws")

## select species Homo sapiens
up <- UniProt.ws(taxId=9606)

## download sequences of Insulin/Somatostatin
s <- select(up,
    keys=c("P01308", "P61278"),
    columns=c("sequence"),
    keytype="UniProtKB"
)

## fetch only sequences
sequences <- setNames(s$Sequence, s$Entry)

## remove whitespaces
sequences <- gsub(pattern="[[:space:]]", replacement="", x=sequences)

Cleaving using Pepsin:

cleave(sequences, enzym="pepsin")
## $P01308
##  [1] "MA"              "L"               "W"               "MRLLP"          
##  [5] "LL"              "A"               "WGPDPAAA"        "F"              
##  [9] "VNQH"            "CGSH"            "VEA"             "Y"              
## [13] "VCGERG"          "FF"              "YTPKTRREAED"     "QVGQVE"         
## [17] "GGGPGAGS"        "LQP"             "LA"              "EGS"            
## [21] "QKRGIVEQCCTSICS" "Q"               "EN"              "CN"             
## 
## $P61278
##  [1] "ML"                    "SCRL"                  "QCA"                  
##  [4] "L"                     "AA"                    "SIV"                  
##  [7] "A"                     "GCVTGAPSDPRL"          "RQ"                   
## [10] "FL"                    "QKS"                   "LAAAAGKQEL"           
## [13] "AK"                    "Y"                     "AE"                   
## [16] "SEPNQTENDA"            "LEPED"                 "SQAAEQDEMRL"          
## [19] "EL"                    "QRSANSNPAMAPRERKAGCKN" "FF"                   
## [22] "W"                     "KT"                    "FTSC"

4 Isotopic Distribution Of Tryptic Digested Insulin

A common use case of in-silico cleavage is the calculation of the isotopic distribution of peptides (which were enzymatic digested in the in-vitro experimental workflow). Here BRAIN (Claesen et al. 2012; Dittwald et al. 2013) is used to calculate the isotopic distribution of cleaver’s output. (please note: it is only a toy example, e.g. the relation of intensity values between peptides isn’t correct).

## load BRAIN library
library("BRAIN")

## cleave insulin
cleavedInsulin <- cleave(sequences[1], enzym="trypsin")[[1]]

## create empty plot area
plot(NA, xlim=c(150, 4300), ylim=c(0, 1),
     xlab="mass", ylab="relative intensity",
     main="tryptic digested insulin - isotopic distribution")

## loop through peptides
for (i in seq(along=cleavedInsulin)) {
  ## count C, H, N, O, S atoms in current peptide
  atoms <- BRAIN::getAtomsFromSeq(cleavedInsulin[[i]])
  ## calculate isotopic distribution
  d <- useBRAIN(atoms)
  ## draw peaks
  lines(d$masses, d$isoDistr, type="h", col=2)
}

5 Session Information

## R version 4.6.0 Patched (2026-05-01 r89994)
## Platform: aarch64-apple-darwin23
## Running under: macOS Tahoe 26.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BRAIN_1.59.0        lattice_0.22-9      PolynomF_2.0-8     
##  [4] UniProt.ws_2.53.0   cleaver_1.51.0      Biostrings_2.81.1  
##  [7] Seqinfo_1.3.0       XVector_0.53.0      IRanges_2.47.1     
## [10] S4Vectors_0.51.2    BiocGenerics_0.59.2 generics_0.1.4     
## [13] BiocStyle_2.41.0   
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.53.0      xfun_0.57            bslib_0.11.0        
##  [4] httr2_1.2.2          Biobase_2.73.1       vctrs_0.7.3         
##  [7] rjsoncons_1.3.2      tools_4.6.0          curl_7.1.0          
## [10] tibble_3.3.1         AnnotationDbi_1.75.0 RSQLite_3.52.0      
## [13] blob_1.3.0           pkgconfig_2.0.3      BiocBaseUtils_1.15.1
## [16] dbplyr_2.5.2         lifecycle_1.0.5      compiler_4.6.0      
## [19] progress_1.2.3       tinytex_0.59         htmltools_0.5.9     
## [22] sass_0.4.10          yaml_2.3.12          pillar_1.11.1       
## [25] crayon_1.5.3         jquerylib_0.1.4      cachem_1.1.0        
## [28] magick_2.9.1         tidyselect_1.2.1     digest_0.6.39       
## [31] dplyr_1.2.1          bookdown_0.46        fastmap_1.2.0       
## [34] grid_4.6.0           cli_3.6.6            magrittr_2.0.5      
## [37] prettyunits_1.2.0    filelock_1.0.3       rappdirs_0.3.4      
## [40] bit64_4.8.0          rmarkdown_2.31       httr_1.4.8          
## [43] bit_4.6.0            otel_0.2.0           png_0.1-9           
## [46] hms_1.1.4            memoise_2.0.1        evaluate_1.0.5      
## [49] knitr_1.51           BiocFileCache_3.3.0  rlang_1.2.0         
## [52] Rcpp_1.1.1-1.1       glue_1.8.1           DBI_1.3.0           
## [55] BiocManager_1.30.27  jsonlite_2.0.0       R6_2.6.1

References

Carlson, Marc. n.d. UniProt.ws: R Interface to UniProt Web Services.
Claesen, Jürgen, Piotr Dittwald, Tomasz Burzykowski, and Dirk Valkenborg. 2012. “An Efficient Method to Calculate the Aggregated Isotopic Distribution and Exact Center-Masses.” Journal of The American Society for Mass Spectrometry 23 (4): 753–63.
Dittwald, Piotr, Jürgen Claesen, Tomasz Burzykowski, Dirk Valkenborg, and Anna Gambin. 2013. “BRAIN: A Universal Tool for High-Throughput Calculations of the Isotopic Distribution for Mass Spectrometry.” Analytical Chemistry 85 (4): 1991–94.
Gasteiger, Elisabeth, Christine Hoogland, Alexandre Gattiker, et al. 2005. “Protein Identification and Analysis Tools on the ExPASy Server.” In The Proteomics Protocols Handbook, edited by John M. Walker. Humana Press. https://doi.org/10.1385/1-59259-890-0:571.
Pages, H., P. Aboyoun, R. Gentleman, and S. DebRoy. n.d. Biostrings: String Objects Representing Biological Sequences, and Matching Algorithms.
The UniProt Consortium. 2012. “Reorganizing the Protein Space at the Universal Protein Resource (UniProt).” Nucleic Acids Research 40 (D1): D71–75. https://doi.org/10.1093/nar/gkr981.