This vignette demonstrates how to use filterSubnetworkByContext() to filter a protein interaction subnetwork by the contextual relevance of its supporting literature. The function:
This is useful when a subnetwork contains many edges supported by literature from unrelated biological contexts, and you want to focus on edges relevant to a specific research question — in this case, DNA damage repair in cancer.
filterSubnetworkByContext() expects a nodes and edges dataframe, typically produced by getSubnetworkFromIndra(). For this example we construct a small representative input table directly, mimicking the structure of a proteomics experiment centred on the DNA damage response kinase CHK1.
The input table contains one row per protein with columns for the UniProt mnemonic identifier, the log2 fold-change, and the adjusted p-value from a differential expression analysis.
input <- data.frame(
Protein = c("CHK1_HUMAN", "RFA1_HUMAN", "CLH1_HUMAN", "CRTC3_HUMAN"),
log2FC = c(2.31, 1.87, 1.45, 1.12),
adj.pvalue = c(0.0021, 0.0089, 0.0310, 0.0490),
stringsAsFactors = FALSE
)
input## Protein log2FC adj.pvalue
## 1 CHK1_HUMAN 2.31 0.0021
## 2 RFA1_HUMAN 1.87 0.0089
## 3 CLH1_HUMAN 1.45 0.0310
## 4 CRTC3_HUMAN 1.12 0.0490
All four proteins are up-regulated (positive log2FC) and statistically significant (adj.pvalue < 0.05).
annotateProteinInfoFromIndra() maps UniProt mnemonics to HGNC gene identifiers and other metadata used downstream by the INDRA query engine.
getSubnetworkFromIndra() queries the INDRA database for curated causal interactions among the annotated proteins and returns a list containing $nodes and $edges dataframes.
Key parameters used here:
pvalueCutoff = 0.2 — relaxed threshold to retain more candidate edges for downstream context filteringevidence_count_cutoff = 1 — keep edges supported by at least one literature statementforce_include_other = "HGNC:1925" — always include CHK1 (HGNC:1925) regardless of significance, as it is the focal protein of interestfilter_by_curation = FALSE — include both curated and automatically extracted interactionssubnetwork <- getSubnetworkFromIndra(
annotated_df,
pvalueCutoff = 0.2,
logfc_cutoff = NULL,
evidence_count_cutoff = 1,
sources_filter = NULL,
force_include_other = "HGNC:1925",
filter_by_curation = FALSE
)
# Inspect the unfiltered network
nrow(subnetwork$nodes)
nrow(subnetwork$edges)The query string is compared against each PubMed abstract supporting the network edges. A richer query — one that includes synonyms, abbreviations, and related terms — improves recall under TF-IDF, which relies on exact token matching rather than semantic understanding.
The expanded query below was produced with the help of a chatbot and covers the major vocabulary used in the DNA damage repair and cancer literature.
tags <- c(
"dna damage repair",
"cancer",
"oncology",
"dna repair",
"genome integrity",
"genomic instability",
"double strand_break",
"dsb",
"single strand_break",
"ssb",
"base excision repair",
"ber",
"nucleotide excision repair",
"ner",
"mismatch repair",
"mmr",
"homologous recombination",
"hr",
"non homologous end joining",
"nhej",
"brca1",
"brca2",
"atm",
"atr",
"p53",
"tp53",
"parp",
"tumor suppressor",
"oncogene",
"carcinogenesis",
"tumorigenesis",
"chemotherapy resistance",
"radiation resistance",
"genotoxic stress",
"replication stress",
"oxidative dna_damage",
"somatic mutation",
"tumor mutational burden",
"tmb"
)Tip: You can iteratively refine
tagsby inspecting the scores infiltered_network$evidenceand adding terms that appear frequently in high-scoring abstracts but are absent from your query.
filterSubnetworkByContext() ties everything together. The cutoff parameter controls stringency — only edges whose supporting abstracts score at or above this value are retained.
filtered_network <- filterSubnetworkByContext(
nodes = subnetwork$nodes,
edges = subnetwork$edges,
method = "tag_count",
cutoff = 3,
query = tags
)The function prints a progress summary to the console:
Processing N unique statement hashes...
Fetching M abstracts...
Progress: M/M (100.0%)
Done fetching abstracts!
X / M abstracts passed score cutoff (>= 0.10)
Retained: A edges (of B), C nodes (of D), E evidence rows (of F)
Only proteins connected by at least one contextually relevant edge are retained.
Each row represents a causal interaction (e.g. phosphorylation, activation) supported by literature that passed the score threshold.
The evidence dataframe contains the following columns:
| Column | Description |
|---|---|
source |
Source protein / gene |
target |
Target protein / gene |
interaction |
Interaction type (e.g. Phosphorylation) |
site |
Modification site if applicable |
evidenceLink |
URL to the INDRA evidence viewer |
stmt_hash |
Unique INDRA statement identifier |
text |
Sentence extracted from the supporting paper |
pmid |
PubMed ID of the source article |
score |
Cosine score of the abstract vs. query |
You can sort by score to identify the most on-topic supporting evidence:
# Run with permissive cutoff to see full score distribution
exploratory <- filterSubnetworkByContext(
nodes = subnetwork$nodes,
edges = subnetwork$edges,
cutoff = 0.0,
query = tags
)
summary(exploratory$evidence$score)
hist(exploratory$evidence$score,
breaks = 30,
main = "Distribution of abstract scores",
xlab = "Number of tags matched",
col = "steelblue")The query string is compared against each PubMed abstract supporting the network edges. A richer query — one that includes synonyms, abbreviations, and related terms — improves recall under TF-IDF, which relies on exact token matching rather than semantic understanding.
The expanded query below was produced with the help of a chatbot and covers the major vocabulary used in the DNA damage repair and cancer literature.
my_query <- "DNA damage repair cancer oncology DNA repair genome integrity
genomic instability double strand break DSB single strand break SSB
base excision repair BER nucleotide excision repair NER mismatch repair MMR
homologous recombination HR non-homologous end joining NHEJ BRCA1 BRCA2
ATM ATR p53 TP53 PARP tumor suppressor oncogene carcinogenesis tumorigenesis
chemotherapy resistance radiation resistance genotoxic stress replication stress
oxidative DNA damage somatic mutation tumor mutational burden TMB"Tip: You can iteratively refine
my_queryby inspecting the scores infiltered_network$evidenceand adding terms that appear frequently in high-scoring abstracts but are absent from your query.
filterSubnetworkByContext() ties everything together. The cutoff parameter controls stringency — only edges whose supporting abstracts score at or above this value are retained.
filtered_network <- filterSubnetworkByContext(
nodes = subnetwork$nodes,
edges = subnetwork$edges,
method = "cosine",
cutoff = 0.10,
query = my_query
)The function prints a progress summary to the console:
Processing N unique statement hashes...
Fetching M abstracts...
Progress: M/M (100.0%)
Done fetching abstracts!
X / M abstracts passed score cutoff (>= 0.10)
Retained: A edges (of B), C nodes (of D), E evidence rows (of F)
Only proteins connected by at least one contextually relevant edge are retained.
Each row represents a causal interaction (e.g. phosphorylation, activation) supported by literature that passed the score threshold.
The evidence dataframe contains the following columns:
| Column | Description |
|---|---|
source |
Source protein / gene |
target |
Target protein / gene |
interaction |
Interaction type (e.g. Phosphorylation) |
site |
Modification site if applicable |
evidenceLink |
URL to the INDRA evidence viewer |
stmt_hash |
Unique INDRA statement identifier |
text |
Sentence extracted from the supporting paper |
pmid |
PubMed ID of the source article |
score |
Relevance score (tag count or cosine similarity) |
You can sort by score to identify the most on-topic supporting evidence:
The right cutoff depends on how broadly the query overlaps with the literature in your network. As a rough guide:
| Cutoff | Effect |
|---|---|
0.05 |
Permissive — removes only completely off-topic abstracts |
0.10 |
Recommended default for domain-specific queries |
0.20 |
Stringent — retains only highly on-topic edges |
> 0.30 |
Very stringent — use only with highly specific queries |
To explore the score distribution before committing to a cutoff, run the function at a low threshold and inspect the scores:
# Run with permissive cutoff to see full score distribution
exploratory <- filterSubnetworkByContext(
nodes = subnetwork$nodes,
edges = subnetwork$edges,
cutoff = 0.0,
method = "cosine",
query = my_query
)
summary(exploratory$evidence$score)
hist(exploratory$evidence$score,
breaks = 30,
main = "Distribution of abstract scores",
xlab = "Cosine score to query",
col = "steelblue")sessionInfo()
#> R Under development (unstable) (2026-03-05 r89546)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] shiny_1.13.0 MSstatsBioNet_1.3.5 MSstats_4.19.1
#> [4] BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 viridisLite_0.4.3 dplyr_1.2.0
#> [4] farver_2.1.2 S7_0.2.1 bitops_1.0-9
#> [7] fastmap_1.2.0 lazyeval_0.2.2 promises_1.5.0
#> [10] XML_3.99-0.22 digest_0.6.39 mime_0.13
#> [13] lifecycle_1.0.5 survival_3.8-6 statmod_1.5.1
#> [16] magrittr_2.0.4 compiler_4.6.0 r2r_0.1.2
#> [19] rlang_1.1.7 sass_0.4.10 tools_4.6.0
#> [22] yaml_2.3.12 data.table_1.18.2.1 knitr_1.51
#> [25] stopwords_2.3 htmlwidgets_1.6.4 MSstatsConvert_1.21.1
#> [28] marray_1.89.0 xml2_1.5.2 RColorBrewer_1.1-3
#> [31] KernSmooth_2.23-26 purrr_1.2.1 grid_4.6.0
#> [34] preprocessCore_1.73.0 caTools_1.18.3 xtable_1.8-8
#> [37] log4r_0.4.4 ggplot2_4.0.2 scales_1.4.0
#> [40] gtools_3.9.5 MASS_7.3-65 dichromat_2.0-0.1
#> [43] cli_3.6.5 crayon_1.5.3 rmarkdown_2.30
#> [46] reformulas_0.4.4 generics_0.1.4 otel_0.2.0
#> [49] httr_1.4.8 minqa_1.2.8 cachem_1.1.0
#> [52] splines_4.6.0 parallel_4.6.0 BiocManager_1.30.27
#> [55] vctrs_0.7.1 boot_1.3-32 Matrix_1.7-4
#> [58] jsonlite_2.0.0 bookdown_0.46 ggrepel_0.9.7
#> [61] limma_3.67.0 plotly_4.12.0 lgr_0.5.2
#> [64] tidyr_1.3.2 jquerylib_0.1.4 glue_1.8.0
#> [67] nloptr_2.2.1 gtable_0.3.6 later_1.4.8
#> [70] lme4_2.0-1 mlapi_0.1.1 tibble_3.3.1
#> [73] pillar_1.11.1 htmltools_0.5.9 gplots_3.3.0
#> [76] float_0.3-3 rsparse_0.5.3 R6_2.6.1
#> [79] Rdpack_2.6.6 evaluate_1.0.5 lattice_0.22-9
#> [82] rentrez_1.2.4 rbibutils_2.4.1 backports_1.5.0
#> [85] RhpcBLASctl_0.23-42 memoise_2.0.1 httpuv_1.6.16
#> [88] bslib_0.10.0 text2vec_0.6.6 Rcpp_1.1.1
#> [91] nlme_3.1-168 checkmate_2.3.4 xfun_0.56
#> [94] pkgconfig_2.0.3