This vignette demonstrates the use of the pengls package for high-dimensional data with spatial or temporal autocorrelation. It consists of an iterative loop around the nlme and glmnet packages. Currently, only continuous outcomes and R2 and MSE as performance measure are implemented.
The pengls package is available from BioConductor, and can be installed as follows:
library(BiocManager)
install("pengls")
Once installed, it can be loaded and version info printed.
suppressPackageStartupMessages(library(pengls))
cat("pengls package version", as.character(packageVersion("pengls")), "\n")
## pengls package version 1.10.0
We first create a toy dataset with spatial coordinates.
library(nlme)
25 #Sample size
n <- 50 #Number of features
p <- 15 #Size of the grid
g <-#Generate grid
expand.grid("x" = seq_len(g), "y" = seq_len(g))
Grid <-# Sample points from grid without replacement
Grid[sample(nrow(Grid), n, replace = FALSE),]
GridSample <-#Generate outcome and regressors
matrix(rnorm(p*n), n , p)
b <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
a <-#Compile to a matrix
data.frame("a" = a, "b" = b, GridSample) df <-
The pengls method requires prespecification of a functional form for the autocorrelation. This is done through the corStruct objects defined by the nlme package. We specify a correlation decaying as a Gaussian curve with distance, and with a nugget parameter. The nugget parameter is a proportion that indicates how much of the correlation structure explained by independent errors; the rest is attributed to spatial autocorrelation. The starting values are chosen as reasonable guesses; they will be overwritten in the fitting process.
# Define the correlation structure (see ?nlme::gls), with initial nugget 0.5 and range 5
corGaus(form = ~ x + y, nugget = TRUE, value = c("range" = 5, "nugget" = 0.5)) corStruct <-
Finally the model is fitted with a single outcome variable and large number of regressors, with the chosen covariance structure and for a prespecified penalty parameter λ=0.2.
#Fit the pengls model, for simplicity for a simple lambda
pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE), glsSt = corStruct, lambda = 0.2, verbose = TRUE) penglsFit <-
## Starting iterations...
## Iteration 1
## Iteration 2
## Iteration 3
Standard extraction functions like print(), coef() and predict() are defined for the new “pengls” object.
penglsFit
## pengls model with correlation structure: corGaus
## and 18 non-zero coefficients
coef(penglsFit)
penglsCoef <- predict(penglsFit) penglsPred <-
The method can also account for temporal autocorrelation by defining another correlation structure from the nlme package, e.g. autocorrelation structure of order 1:
set.seed(354509)
100 #Sample size
n <- 10 #Number of features
p <-#Generate outcome and regressors
matrix(rnorm(p*n), n , p)
b <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
a <-#Compile to a matrix
data.frame("a" = a, "b" = b, "t" = seq_len(n))
dfTime <- corAR1(form = ~ t, value = 0.5) corStructTime <-
The fitting command is similar, this time the λ parameter is found through cross-validation of the naive glmnet (for full cross-validation , see below). We choose α=0.5 this time, fitting an elastic net model.
pengls(data = dfTime, outVar = "a", verbose = TRUE,
penglsFitTime <-xNames = grep(names(dfTime), pattern = "b", value =TRUE),
glsSt = corStructTime, nfolds = 5, alpha = 0.5)
## Fitting naieve model...
## Starting iterations...
## Iteration 1
## Iteration 2
Show the output
penglsFitTime
## pengls model with correlation structure: corAR1
## and 2 non-zero coefficients
The pengls package also provides cross-validation for finding the optimal λ value. If the tuning parameter λ is not supplied, the optimal λ according to cross-validation with the naive glmnet function (the one that ignores dependence) is used. Hence we recommend to use the following function to use cross-validation. Multithreading is supported through the BiocParallel package :
library(BiocParallel)
register(MulticoreParam(3)) #Prepare multithereading
3 #Number of cross-validation folds nfolds <-
The function is called similarly to cv.glmnet:
cv.pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE), glsSt = corStruct, nfolds = nfolds) penglsFitCV <-
Check the result:
penglsFitCV
## Cross-validated pengls model with correlation structure: corGaus
## and 2 non-zero coefficients.
## 3 fold cross-validation yielded an estimated R2 of -0.3023641 .
By default, the 1 standard error is used to determine the optimal value of λ :
$lambda.1se #Lambda for 1 standard error rule penglsFitCV
## [1] 1.184277
$cvOpt #Corresponding R2 penglsFitCV
## [1] -0.3023641
Extract coefficients and fold IDs.
head(coef(penglsFitCV))
## [1] -1.614877 0.000000 0.000000 0.348187 0.000000 0.000000
$foldid #The folds used penglsFitCV
## 39 163 26 178 204 78 74 57 104 176 7 207 193 69 1 27 53 56 58 167
## 2 3 2 3 3 1 2 2 2 3 1 3 3 2 1 2 1 2 2 1
## 50 110 47 84 111
## 1 1 1 2 1
By default, blocked cross-validation is used, but random cross-validation is also available (but not recommended for timecourse or spatial data). First we illustrate the different ways graphically, again using the timecourse example:
set.seed(5657)
makeFolds(nfolds = nfolds, dfTime, "random", "t")
randomFolds <- makeFolds(nfolds = nfolds, dfTime, "blocked", "t")
blockedFolds <-plot(dfTime$t, randomFolds, xlab ="Time", ylab ="Fold")
points(dfTime$t, blockedFolds, col = "red")
legend("topleft", legend = c("random", "blocked"), pch = 1, col = c("black", "red"))
To perform random cross-validation
cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, cvType = "random") penglsFitCVtime <-
To negate baseline differences at different timepoints, it may be useful to center or scale the outcomes in the cross validation. For instance for centering only:
cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, cvType = "blocked", transFun = function(x) x-mean(x))
penglsFitCVtimeCenter <-$cvOpt #Better performance penglsFitCVtimeCenter
## [1] 0.9949127
Alternatively, the mean squared error (MSE) can be used as loss function, rather than the default R2:
cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, loss = "MSE") penglsFitCVtime <-
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.38.0 nlme_3.1-164 pengls_1.10.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 knitr_1.46 rlang_1.1.3 xfun_0.43
## [5] highr_0.10 jsonlite_1.8.8 htmltools_0.5.8.1 sass_0.4.9
## [9] glmnet_4.1-8 rmarkdown_2.26 grid_4.4.0 evaluate_0.23
## [13] jquerylib_0.1.4 fastmap_1.1.1 yaml_2.3.8 foreach_1.5.2
## [17] lifecycle_1.0.4 compiler_4.4.0 codetools_0.2-20 Rcpp_1.0.12
## [21] lattice_0.22-6 digest_0.6.35 R6_2.5.1 parallel_4.4.0
## [25] splines_4.4.0 shape_1.4.6.1 bslib_0.7.0 Matrix_1.7-0
## [29] tools_4.4.0 iterators_1.0.14 survival_3.6-4 cachem_1.0.8