Contents

0.1 Paul J. McMurdie & Joseph N. Paulson

If you find biomformat and/or its tutorials useful, please acknowledge and cite biomformat in your publications:

Paul J. McMurdie and Joseph N Paulson (2015). biomformat: An interface package for the BIOM file format. R/Bioconductor package version 1.0.0.

0.2 phyloseq Home Page

phyloseq has many more utilities for interacting with this kind of data than are provided in this I/O-oriented package.

0.3 Motivation for the biomformat package

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages. Further details on R markdown here.

The BIOM file format (canonically pronounced “biome”) is designed to be a general-use format for representing biological sample by observation contingency tables. BIOM is a recognized standard for the Earth Microbiome Project and is a Genomics Standards Consortium candidate project. Please see the biom-format home page for more details.

This demo is designed to provide an overview of the biom package to get you started using it quickly. The biom package itself is intended to be a utility package that will be depended-upon by other packages in the future. It provides I/O functionality, and functions to make it easier to with data from biom-format files. It does not (and probably should not) provide statistical analysis functions. However, it does provide tools to access data from BIOM format files in ways that are extremely common in R (such as "data.frame", "matrix", and "Matrix" classes).

Package versions at the time (Sun Mar 15 17:24:59 2026) of this build:

library("biomformat"); packageVersion("biomformat")
## [1] '1.39.10'

1 Read BIOM format

Here is an example importing BIOM formats of different types into R using the read_biom function. The resulting data objects in R are given names beginning with x.

min_dense_file   = system.file("extdata", "min_dense_otu_table.biom", 
                               package = "biomformat")
min_sparse_file  = system.file("extdata", "min_sparse_otu_table.biom", 
                               package = "biomformat")
rich_dense_file  = system.file("extdata", "rich_dense_otu_table.biom", 
                               package = "biomformat")
rich_sparse_file = system.file("extdata", "rich_sparse_otu_table.biom", 
                               package = "biomformat")
min_dense_file   = system.file("extdata", "min_dense_otu_table.biom", package = "biomformat")
rich_dense_char  = system.file("extdata", "rich_dense_char.biom", package = "biomformat")
rich_sparse_char  = system.file("extdata", "rich_sparse_char.biom", package = "biomformat")
x1 = read_biom(min_dense_file)
x2 = read_biom(min_sparse_file)
x3 = read_biom(rich_dense_file)
x4 = read_biom(rich_sparse_file)
x5 = read_biom(rich_dense_char)
x6 = read_biom(rich_sparse_char)
x1
## biom object. 
## type: OTU table 
## matrix_type: dense 
## 5 rows and 6 columns

It would be hard to interpret and wasteful of RAM to stream all the data from a BIOM format file to the standard out when printed with print or show methods. Instead, a brief summary of the contents BIOM data is shown.

2 Access BIOM data

To get access to the data in a familiar form appropriate for many standard R functions, we will need to use accessor functions, also included in the biom package.

2.0.1 Core observation data

The core “observation” data is stored in either sparse or dense matrices in the BIOM format file, and sparse matrix support is carried through on the R side via the Matrix package. The variables x1 and x2, assigned above from BIOM files, have similar (but not identical) data. They are small enough to observe directly as tables in standard output in an R session:

biom_data(x1)
## 5 x 6 Matrix of class "dgeMatrix"
##          Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
## GG_OTU_1       0       0       1       0       0       0
## GG_OTU_2       5       1       0       2       3       1
## GG_OTU_3       0       0       1       4       2       0
## GG_OTU_4       2       1       1       0       0       1
## GG_OTU_5       0       1       1       0       0       0
biom_data(x2)
## 5 x 6 sparse Matrix of class "dgCMatrix"
##          Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
## GG_OTU_1       .       .       1       .       .       .
## GG_OTU_2       5       1       .       2       3       1
## GG_OTU_3       .       .       1       4       .       2
## GG_OTU_4       2       1       1       .       .       1
## GG_OTU_5       .       1       1       .       .       .

As you can see above, x1 and x2 are represented in R by slightly different matrix classes, assigned automatically based on the data. However, most operations in R will understand this automatically and you should not have to worry about the precise matrix class. However, if the R function you are attempting to use is having a problem with these fancier classes, you can easily coerce the data to the simple, standard "matrix" class using the as function:

as(biom_data(x2), "matrix")
##          Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
## GG_OTU_1       0       0       1       0       0       0
## GG_OTU_2       5       1       0       2       3       1
## GG_OTU_3       0       0       1       4       0       2
## GG_OTU_4       2       1       1       0       0       1
## GG_OTU_5       0       1       1       0       0       0

2.0.2 Observation Metadata

Observation metadata is metadata associated with the individual units being counted/recorded in a sample, as opposed to measurements of properties of the samples themselves. For microbiome census data, for instance, observation metadata is often a taxonomic classification and anything else that might be known about a particular OTU/species. For other types of data, it might be metadata known about a particular genome, gene family, pathway, etc. In this case, the observations are counts of OTUs and the metadata is taxonomic classification, if present. The absence of observation metadata is also supported, as we see for the minimal cases of x1 and x2, where we see instead.

observation_metadata(x1)
## NULL
observation_metadata(x2)
## NULL
observation_metadata(x3)
##            taxonomy1         taxonomy2              taxonomy3
## GG_OTU_1 k__Bacteria p__Proteobacteria c__Gammaproteobacteria
## GG_OTU_2 k__Bacteria  p__Cyanobacteria    c__Nostocophycideae
## GG_OTU_3  k__Archaea  p__Euryarchaeota     c__Methanomicrobia
## GG_OTU_4 k__Bacteria     p__Firmicutes          c__Clostridia
## GG_OTU_5 k__Bacteria p__Proteobacteria c__Gammaproteobacteria
##                     taxonomy4             taxonomy5         taxonomy6
## GG_OTU_1 o__Enterobacteriales f__Enterobacteriaceae    g__Escherichia
## GG_OTU_2        o__Nostocales        f__Nostocaceae g__Dolichospermum
## GG_OTU_3 o__Methanosarcinales f__Methanosarcinaceae g__Methanosarcina
## GG_OTU_4   o__Halanaerobiales   f__Halanaerobiaceae  g__Halanaerobium
## GG_OTU_5 o__Enterobacteriales f__Enterobacteriaceae    g__Escherichia
##                                taxonomy7
## GG_OTU_1                             s__
## GG_OTU_2                             s__
## GG_OTU_3                             s__
## GG_OTU_4 s__Halanaerobiumsaccharolyticum
## GG_OTU_5                             s__
observation_metadata(x4)[1:2, 1:3]
##            taxonomy1         taxonomy2              taxonomy3
## GG_OTU_1 k__Bacteria p__Proteobacteria c__Gammaproteobacteria
## GG_OTU_2 k__Bacteria  p__Cyanobacteria    c__Nostocophycideae
class(observation_metadata(x4))
## [1] "data.frame"

2.0.3 Sample Metadata

Similarly, we can access metadata – if available – that describe properties of the samples. We access this sample metadata using the sample_metadata function.

sample_metadata(x1)
## NULL
sample_metadata(x2)
## NULL
sample_metadata(x3)
##         BarcodeSequence  LinkerPrimerSequence BODY_SITE Description
## Sample1    CGCTTATCGAGA CATGCTGCCTCCCGTAGGAGT       gut   human gut
## Sample2    CATACCAGTAGC CATGCTGCCTCCCGTAGGAGT       gut   human gut
## Sample3    CTCTCTACCTGT CATGCTGCCTCCCGTAGGAGT       gut   human gut
## Sample4    CTCTCGGCCTGT CATGCTGCCTCCCGTAGGAGT      skin  human skin
## Sample5    CTCTCTACCAAT CATGCTGCCTCCCGTAGGAGT      skin  human skin
## Sample6    CTAACTACCAAT CATGCTGCCTCCCGTAGGAGT      skin  human skin
sample_metadata(x4)[1:2, 1:3]
##         BarcodeSequence  LinkerPrimerSequence BODY_SITE
## Sample1    CGCTTATCGAGA CATGCTGCCTCCCGTAGGAGT       gut
## Sample2    CATACCAGTAGC CATGCTGCCTCCCGTAGGAGT       gut
class(sample_metadata(x4))
## [1] "data.frame"

2.0.4 Plots

The data really is accessible to other R functions.

plot(biom_data(x4))

boxplot(as(biom_data(x4), "vector"))

heatmap(as(biom_data(x4), "matrix"))

3 Write BIOM format (JSON / v1)

The biom objects in R can be written to a file/connection using the write_biom function. If you modified the biom object, this may still work as well, but no guarantees about this as we are still working on internal checks. The following example writes x4 to a temporary file, then reads it back using read_biom and stores it as variable y. The exact comparison of these two objects using the identical function shows that they are exactly the same in R.

outfile = tempfile()
write_biom(x4, outfile)
y = read_biom(outfile)
identical(x4, y)
## [1] FALSE

Furthermore, it is possible to invoke standard operating system commands through the R system function – in this case to invoke the diff command available on Unix-like systems or the FC command on Windows – in order to compare the original and temporary files directly. Note that this is shown here for convenience, but not automatically run with the rest of the script because of the OS-dependence. During development, though, this same command is tested privately and no differences are reported between the files.

# On Unix OSes
system(paste0("diff ", rich_sparse_file, outfile))
# On windows
system(paste0("FC ", rich_sparse_file, outfile))

4 HDF5 (BIOM v2) read and write

BIOM v2 uses the HDF5 binary format, which stores both sample-major and observation-major compressed-sparse representations of the count matrix. Reading is handled automatically by read_biom() — it detects HDF5 files by their magic bytes and routes to read_hdf5_biom() internally. Writing to HDF5 is done with write_hdf5_biom().

Both functions require the Bioconductor package rhdf5. If it is not installed they stop with a clear installation message rather than a cryptic C-level error.

hdf5_file <- system.file("extdata", "rich_sparse_otu_table_hdf5.biom",
                          package = "biomformat")
xh <- read_biom(hdf5_file)
xh
## biom object. 
## type: OTU table 
## matrix_type: dense 
## 5 rows and 6 columns
biom_data(xh)
## 5 x 6 Matrix of class "dgeMatrix"
##          Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
## GG_OTU_1       0       0       1       0       0       0
## GG_OTU_2       5       1       0       2       3       1
## GG_OTU_3       0       0       1       4       0       2
## GG_OTU_4       2       1       1       0       0       1
## GG_OTU_5       0       1       1       0       0       0

Round-trip through write_hdf5_biom(): write the object to a temporary HDF5 file, read it back, and confirm the count data and metadata are preserved.

hdf5_out <- tempfile(fileext = ".biom")
write_hdf5_biom(xh, hdf5_out)

xh2 <- read_biom(hdf5_out)
# Count matrices are identical
identical(biom_data(xh), biom_data(xh2))
## [1] TRUE
# Observation metadata (taxonomy) is preserved
identical(observation_metadata(xh), observation_metadata(xh2))
## [1] TRUE
# Sample metadata is preserved
identical(sample_metadata(xh), sample_metadata(xh2))
## [1] TRUE

You can also convert an existing JSON/v1 biom object to HDF5 v2 format:

json_to_hdf5 <- tempfile(fileext = ".biom")
write_hdf5_biom(x4, json_to_hdf5)
x4_hdf5 <- read_biom(json_to_hdf5)
identical(biom_data(x4), biom_data(x4_hdf5))
## [1] FALSE

5 Tidy long-format output

For downstream analysis with tidyverse tools it is often convenient to have the count data in long format: one row per (feature, sample) pair, with sample and observation metadata joined in automatically.

biomformat provides two methods for this:

long_df <- as.data.frame(x4)
head(long_df)
##   feature_id sample_id count BarcodeSequence  LinkerPrimerSequence BODY_SITE
## 1   GG_OTU_1   Sample1     0    CGCTTATCGAGA CATGCTGCCTCCCGTAGGAGT       gut
## 2   GG_OTU_1   Sample2     0    CATACCAGTAGC CATGCTGCCTCCCGTAGGAGT       gut
## 3   GG_OTU_1   Sample3     1    CTCTCTACCTGT CATGCTGCCTCCCGTAGGAGT       gut
## 4   GG_OTU_1   Sample4     0    CTCTCGGCCTGT CATGCTGCCTCCCGTAGGAGT      skin
## 5   GG_OTU_1   Sample5     0    CTCTCTACCAAT CATGCTGCCTCCCGTAGGAGT      skin
## 6   GG_OTU_1   Sample6     0    CTAACTACCAAT CATGCTGCCTCCCGTAGGAGT      skin
##   Description   taxonomy1         taxonomy2              taxonomy3
## 1   human gut k__Bacteria p__Proteobacteria c__Gammaproteobacteria
## 2   human gut k__Bacteria p__Proteobacteria c__Gammaproteobacteria
## 3   human gut k__Bacteria p__Proteobacteria c__Gammaproteobacteria
## 4  human skin k__Bacteria p__Proteobacteria c__Gammaproteobacteria
## 5  human skin k__Bacteria p__Proteobacteria c__Gammaproteobacteria
## 6  human skin k__Bacteria p__Proteobacteria c__Gammaproteobacteria
##              taxonomy4             taxonomy5      taxonomy6 taxonomy7
## 1 o__Enterobacteriales f__Enterobacteriaceae g__Escherichia       s__
## 2 o__Enterobacteriales f__Enterobacteriaceae g__Escherichia       s__
## 3 o__Enterobacteriales f__Enterobacteriaceae g__Escherichia       s__
## 4 o__Enterobacteriales f__Enterobacteriaceae g__Escherichia       s__
## 5 o__Enterobacteriales f__Enterobacteriaceae g__Escherichia       s__
## 6 o__Enterobacteriales f__Enterobacteriaceae g__Escherichia       s__
dim(long_df)   # n_obs * n_samples rows, feature/sample/count + metadata cols
## [1] 30 14

The result has one row per (feature × sample) pair. Columns always include feature_id, sample_id, and count. Any observation metadata (e.g. taxonomy) and sample metadata (e.g. body site, barcode) are joined in as additional columns:

names(long_df)
##  [1] "feature_id"           "sample_id"            "count"               
##  [4] "BarcodeSequence"      "LinkerPrimerSequence" "BODY_SITE"           
##  [7] "Description"          "taxonomy1"            "taxonomy2"           
## [10] "taxonomy3"            "taxonomy4"            "taxonomy5"           
## [13] "taxonomy6"            "taxonomy7"
library(tibble)
long_tbl <- as_tibble.biom(x4)
long_tbl
## # A tibble: 30 × 14
##    feature_id sample_id count BarcodeSequence LinkerPrimerSequence  BODY_SITE
##    <chr>      <chr>     <dbl> <chr>           <chr>                 <chr>    
##  1 GG_OTU_1   Sample1       0 CGCTTATCGAGA    CATGCTGCCTCCCGTAGGAGT gut      
##  2 GG_OTU_1   Sample2       0 CATACCAGTAGC    CATGCTGCCTCCCGTAGGAGT gut      
##  3 GG_OTU_1   Sample3       1 CTCTCTACCTGT    CATGCTGCCTCCCGTAGGAGT gut      
##  4 GG_OTU_1   Sample4       0 CTCTCGGCCTGT    CATGCTGCCTCCCGTAGGAGT skin     
##  5 GG_OTU_1   Sample5       0 CTCTCTACCAAT    CATGCTGCCTCCCGTAGGAGT skin     
##  6 GG_OTU_1   Sample6       0 CTAACTACCAAT    CATGCTGCCTCCCGTAGGAGT skin     
##  7 GG_OTU_2   Sample2       1 CATACCAGTAGC    CATGCTGCCTCCCGTAGGAGT gut      
##  8 GG_OTU_2   Sample3       0 CTCTCTACCTGT    CATGCTGCCTCCCGTAGGAGT gut      
##  9 GG_OTU_2   Sample4       2 CTCTCGGCCTGT    CATGCTGCCTCCCGTAGGAGT skin     
## 10 GG_OTU_2   Sample1       5 CGCTTATCGAGA    CATGCTGCCTCCCGTAGGAGT gut      
## # ℹ 20 more rows
## # ℹ 8 more variables: Description <chr>, taxonomy1 <chr>, taxonomy2 <chr>,
## #   taxonomy3 <chr>, taxonomy4 <chr>, taxonomy5 <chr>, taxonomy6 <chr>,
## #   taxonomy7 <chr>

5.0.1 purrr-style summarisation over samples

With the long-format data frame in hand, standard purrr + dplyr patterns work directly. The chunk below is guarded so it only runs when purrr is available, but the base-R equivalent (using tapply / lapply) is shown alongside each example for clarity.

library(purrr)
library(dplyr)

# Total counts per sample (purrr + dplyr)
long_df |>
  group_by(sample_id) |>
  summarise(total_counts = sum(count), .groups = "drop") |>
  arrange(desc(total_counts))
## # A tibble: 6 × 2
##   sample_id total_counts
##   <chr>            <dbl>
## 1 Sample1              7
## 2 Sample4              6
## 3 Sample3              4
## 4 Sample6              4
## 5 Sample2              3
## 6 Sample5              3
# Per-sample Shannon diversity using purrr::map_dbl
long_df |>
  group_by(sample_id) |>
  group_split() |>
  purrr::map_dbl(function(df) {
    p <- df$count / sum(df$count)
    p <- p[p > 0]
    -sum(p * log(p))
  }) |>
  setNames(unique(long_df$sample_id))
##   Sample1   Sample2   Sample3   Sample4   Sample5   Sample6 
## 0.5982696 1.0986123 1.3862944 0.6365142 0.0000000 1.0397208
# Base-R equivalent when purrr is not available
tapply(long_df$count, long_df$sample_id, function(counts) {
  p <- counts / sum(counts)
  p <- p[p > 0]
  -sum(p * log(p))
})

6 SummarizedExperiment interoperability

SummarizedExperiment is the standard Bioconductor container for rectangular feature-by-sample assay data. biomformat provides two ways to coerce a biom object into a SummarizedExperiment:

  1. The explicit constructor biom_to_SummarizedExperiment(x)
  2. S4 coercion syntax as(x, "SummarizedExperiment")

Both require the SummarizedExperiment and S4Vectors Bioconductor packages.

library(SummarizedExperiment)

se <- biom_to_SummarizedExperiment(x4)
se
## class: SummarizedExperiment 
## dim: 5 6 
## metadata(0):
## assays(1): counts
## rownames(5): GG_OTU_1 GG_OTU_2 GG_OTU_3 GG_OTU_4 GG_OTU_5
## rowData names(7): taxonomy1 taxonomy2 ... taxonomy6 taxonomy7
## colnames(6): Sample1 Sample2 ... Sample5 Sample6
## colData names(4): BarcodeSequence LinkerPrimerSequence BODY_SITE
##   Description

The count matrix is stored in the "counts" assay slot, observation metadata goes to rowData, and sample metadata goes to colData:

# Count matrix (same as biom_data(x4))
head(assay(se, "counts"))
## 5 x 6 sparse Matrix of class "dgCMatrix"
##          Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
## GG_OTU_1       .       .       1       .       .       .
## GG_OTU_2       5       1       .       2       3       1
## GG_OTU_3       .       .       1       4       .       2
## GG_OTU_4       2       1       1       .       .       1
## GG_OTU_5       .       1       1       .       .       .
# Sample metadata in colData
colData(se)[, 1:3]
## DataFrame with 6 rows and 3 columns
##         BarcodeSequence  LinkerPrimerSequence   BODY_SITE
##             <character>           <character> <character>
## Sample1    CGCTTATCGAGA CATGCTGCCTCCCGTAGGAGT         gut
## Sample2    CATACCAGTAGC CATGCTGCCTCCCGTAGGAGT         gut
## Sample3    CTCTCTACCTGT CATGCTGCCTCCCGTAGGAGT         gut
## Sample4    CTCTCGGCCTGT CATGCTGCCTCCCGTAGGAGT        skin
## Sample5    CTCTCTACCAAT CATGCTGCCTCCCGTAGGAGT        skin
## Sample6    CTAACTACCAAT CATGCTGCCTCCCGTAGGAGT        skin
# Observation (OTU) metadata in rowData
rowData(se)[, 1:3]
## DataFrame with 5 rows and 3 columns
##            taxonomy1         taxonomy2              taxonomy3
##          <character>       <character>            <character>
## GG_OTU_1 k__Bacteria p__Proteobacteria c__Gammaproteobacteria
## GG_OTU_2 k__Bacteria  p__Cyanobacteria    c__Nostocophycideae
## GG_OTU_3  k__Archaea  p__Euryarchaeota     c__Methanomicrobia
## GG_OTU_4 k__Bacteria     p__Firmicutes          c__Clostridia
## GG_OTU_5 k__Bacteria p__Proteobacteria c__Gammaproteobacteria

The S4 as() coercion is equivalent:

se2 <- as(x4, "SummarizedExperiment")
identical(assay(se, "counts"), assay(se2, "counts"))
## [1] TRUE

From here the full Bioconductor ecosystem is available — for example, downstream use with DESeq2, edgeR, limma, etc.

7 Session info

sessionInfo()
## R Under development (unstable) (2026-03-05 r89546)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.41.1 Biobase_2.71.0             
##  [3] GenomicRanges_1.63.1        Seqinfo_1.1.0              
##  [5] IRanges_2.45.0              S4Vectors_0.49.0           
##  [7] BiocGenerics_0.57.0         generics_0.1.4             
##  [9] MatrixGenerics_1.23.0       matrixStats_1.5.0          
## [11] dplyr_1.2.0                 purrr_1.2.1                
## [13] tibble_3.3.1                biomformat_1.39.10         
## [15] knitr_1.51                  BiocStyle_2.39.0           
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10         utf8_1.2.6          SparseArray_1.11.11
##  [4] lattice_0.22-9      digest_0.6.39       magrittr_2.0.4     
##  [7] evaluate_1.0.5      grid_4.6.0          bookdown_0.46      
## [10] fastmap_1.2.0       jsonlite_2.0.0      Matrix_1.7-4       
## [13] tinytex_0.58        BiocManager_1.30.27 jquerylib_0.1.4    
## [16] abind_1.4-8         cli_3.6.5           rlang_1.1.7        
## [19] XVector_0.51.0      cachem_1.1.0        DelayedArray_0.37.0
## [22] yaml_2.3.12         otel_0.2.0          S4Arrays_1.11.1    
## [25] tools_4.6.0         Rhdf5lib_1.33.5     vctrs_0.7.1        
## [28] R6_2.6.1            lifecycle_1.0.5     rhdf5_2.55.16      
## [31] magick_2.9.1        pkgconfig_2.0.3     bslib_0.10.0       
## [34] pillar_1.11.1       glue_1.8.0          Rcpp_1.1.1         
## [37] tidyselect_1.2.1    xfun_0.56           rhdf5filters_1.23.3
## [40] htmltools_0.5.9     rmarkdown_2.30      compiler_4.6.0