Overview

The deepgp package provides a fully-Bayesian implementation of deep Gaussian process (DGP) regression models via Markov Chain Monte Carlo (MCMC). The package was designed with an eye towards surrogate modeling of computer experiments. It supports several acquisition criteria for strategic sequential design and offers optional Vecchia-approximation for faster computations with larger data sizes. Computations are performed in C/C++ under-the-hood, with optional parallelization through OpenMP and SNOW.

This vignette serves as an introduction to the package’s main features. Full methodological details are provided in my Ph.D. Dissertation (Sauer 2023) and in the following manuscripts:

Sauer, Gramacy, et al. (2023): Bayesian DGPs with sequential design targeting minimized posterior variance
Sauer, Cooper, et al. (2023): incorporation of the Vecchia approximation into the Bayesian DGP framework
Gramacy et al. (2022): optimization through expected improvement with strategic selection of candidates
Booth et al. (2025): contour location using entropy (combined with posterior uncertainty)
Barnett et al. (2025): monotonic latent layer warpings
Booth (2025): gradient-enhancement and gradient predictions

While I will showcase simple one-dimensional examples here, the simulated examples provided in these works span up to seven dimensions, with training data sizes up to tens of thousands. Real-world applications to computer experiments include a three-dimensional simulation of the Langley Glide Back Booster (LGBB, Pamadi et al. 2004), the seven-dimensional Test Particle Monte Carlo Simulator (TPM) of a satellite in low earth orbit (Sun et al. 2019), and a seven-dimensional simulation of an RAE-2822 transonic airfoil (Economon et al. 2016), with training data sizes ranging from \(n = 300\) to \(n = 100,000\).

All exercises are supported by code in a public git repository: https://bitbucket.org/gramacylab/deepgp-ex/. Further details on function arguments and default settings (including prior distributions) are provided in the function documentation.

In the first section, I’ll discuss model fitting (training through MCMC then generating posterior predictions). I’ll review a typical one-layer Gaussian process before introducing the DGP functionality. I’ll also discuss extensions to the original DGP framework, including the utilization of Vecchia approximation (Vecchia 1988) for faster computations, monotonic warpings, gradient-enhancement, and gradient predictions. In the second section, I’ll introduce sequential design criteria for three different objectives: optimizing the response, locating a contour, and minimizing posterior variance.

To start, let’s define a one-dimensional function that we can use for demonstration and visualization. The following code defines the “Booth” function, an adaptation of the “Higdon” function (Higdon 2002) which is found on the pages of the Virtual Library of Simulation Experiments (Surjanovic and Bingham 2013).

booth <- function(x) {
  y <- rep(NA, length(x))
  for (i in 1:length(x)) {
    if (x[i] <= 0.58) {
      y[i] <- sin(pi * x[i] * 6) + cos(pi * x[i] * 12)
    } else y[i] <- 5 * x[i] - 4.9
  }
  return(y)
}

Next, we generate training and testing data. For now, we will work in a noise-free setting (to mimic a deterministic computer simulation).

# Training data
n <- 20
x_booth <- seq(0, 1, length = n)
y_booth <- booth(x_booth)

# Testing data
np <- 100
xp_booth <- seq(0, 1, length = np)
yp_booth <- booth(xp_booth)

plot(xp_booth, yp_booth, type = "l", col = 4, xlab = "X", ylab = "Y", 
     main = "Booth function")
points(x_booth, y_booth)

The piecewise form of this function is an example of “nonstationarity”. There are two distinct regimes (one wiggly and one linear), with an abrupt shift near x = 0.6. We will soon see how a one-layer GP suffers from an assumption of stationarity which restricts it from fitting these two separate regimes simultaneously. DGPs are better equipped to handle such complex surfaces.

Model Fitting

One-layer Shallow GP

Traditional Gaussian processes (GPs) are popular nonlinear regression models, preferred for their closed-form posterior predictive moments, with relevant applications to Machine Learning (ML, Rasmussen and Williams 2005) and computer experiment surrogate modeling (Santner et al. 2018; Gramacy 2020). GP models place a multivariate normal prior distribution for the response, \[ Y \sim \mathcal{N}\left(0, \Sigma(X)\right), \] with \[ \Sigma(X)^{ij} = \Sigma(x_i, x_j) = \tau^2\left(k\left(\frac{||x_i - x_j||^2}{\theta}\right) + g\mathbb{I}_{i=j}\right), \] where \(k(\cdot)\) represents a stationary kernel function (deepgp offers both the squared exponential and Matérn kernels). The package only supports mean-zero priors (aside from a special case in the two-layer DGP), but non-zero means may be addressed by pre-scaling the response appropriately. In this canonical form, GPs suffer from the limitation of stationarity due to reliance on covariance kernels \(k(\cdot)\) that are strictly functions of Euclidean distance.

This covariance depends on hyperparameters \(\tau^2\), \(\theta\), and \(g\) controlling the scale, lengthscale, and noise respectively. [Notice how \(\theta\) operates on squared distances under this parameterization.] Looking forward to our DGP implementation, we embrace an MCMC scheme utilizing Metropolis Hastings sampling of these unknown hyperparameters (except \(\tau^2\) which can be marginialized analytically under an inverse gamma reference prior, see Gramacy (2020), Chapter 5).

Ok, let’s get to some code. MCMC sampling of theta and g for a one-layer GP is built into the fit_one_layer function. Since our function is deterministic, we will fix g at a small value with true_g argument below. The following code collects 2,000 Metropolis Hastings samples (using the Matérn kernel by default and creating an S3 object of the gp class), then displays trace plots of the log likelihood and corresponding samples.

NOTE: To keep run times low, we use 2,000 MCMC samples throughout this vignette. This is NOT the general recommendation. More MCMC sampling is often required to reach sufficient burn-in, particularly with more complicated deeper models. The software default is 10,000.

gp_booth <- fit_one_layer(x_booth, y_booth, nmcmc = 2000, true_g = 1e-6, 
                          verb = FALSE)
plot(gp_booth)

There is randomization involved in MCMC proposals and acceptances, so these trace plots may show slight variations upon different renderings. In this simple case they burn-in quickly. Before we do anything else with these samples, we may want to trim off burn-in and thin the remaining samples. This functionality is built into the trim function (which will work on any of our models, not just the one layer).

gp_booth <- trim(gp_booth, 1000, 2) # remove 1000 as burn-in, thin by half

Now that we have a set of posterior samples of kernel hyperparameters, we can generate posterior predictions for the testing locations xp. Under a GP prior, posterior predictions \(Y^\star\) at \(X^\star\) locations follow \[ \begin{aligned} Y^\star \mid X, Y \sim \mathcal{N}\left(\mu^\star, \Sigma^\star\right) \quad\textrm{where}\quad \mu^\star &= \Sigma(X^\star, X)\Sigma(X)^{-1}Y \\ \Sigma^\star &= \Sigma(X^\star) - \Sigma(X^\star, X)\Sigma(X)^{-1}\Sigma(X, X^\star) \end{aligned} \]

Each \(\Sigma(\cdot)\) above relies on values of \(\tau^2\), \(\theta\), and \(g\). Since we have MCMC samples \(\{\tau2^t, \theta^t\}\) for \(t\in\mathcal{T}\) (with g fixed), we can evaluate \(\mu^{\star(t)}\) and \(\Sigma^{\star(t)}\) for each \(t\in\mathcal{T}\) and aggregate the results using the law of total variance, \[ \mu^\star = \sum_{t=1}^\mathcal{T} \mu^{\star(t)} \quad\textrm{and}\quad \Sigma^\star = \sum_{t=1}^\mathcal{T} \Sigma^{\star(t)} + \mathbb{C}\mathrm{ov}\left(\mu^{\star(t)}\right) \]

This process is neatly wrapped in the predict function (tailored for each S3 class object), with convenient plotting in one-dimension provided in the plot function.

gp_booth <- predict(gp_booth, xp_booth, lite = FALSE)
plot(gp_booth)

The GP provides a nice nonlinear fit with uncertainty estimates, but it has inflated uncertainty in the linear region. It is unable to accommodate different regimes simultaneously.

Two-layer DGP

DGPs (Damianou and Lawrence 2013) address this stationarity limitation through functional compositions of Gaussian layers. Intermediate layers act as warped versions of the original inputs, allowing for nonstationary flexibility. A “two-layer” DGP may be formulated as \[ \begin{aligned} Y \mid W &\sim \mathcal{N}\left(0, \Sigma(W)\right) \\ W_i &\sim \mathcal{N}\left(0, \Sigma_i(X)\right) \quad\textrm{for}\quad i = 1, \dots, D \end{aligned} \] where \(\Sigma(\cdot)\) represents a stationary kernel function as defined in the previous subsection and \(W = [W_1, W_2, \dots, W_D]\) is the column-combined matrix of individual \(W_i\) “nodes”. Latent layers are noise-free with unit scale (i.e. \(g = 0\) and \(\tau^2 = 1\) for each \(W_i\)). The intermediate latent layer \(W\) is unobserved and presents an inferential challenge (it can not be marginialized from the posterior analytically). Many have embraced variational inference for fast, thrifty approximations Marmin and Filippone (2022). The deepgp package offers an alternative, fully-Bayesian sampling-based inferential scheme as detailed in Sauer, Gramacy, et al. (2023). This sampling approach provides full uncertainty quantification (UQ), which is crucial for sequential design.

Latent Gaussian layers are sampled through elliptical slice sampling (ESS, Murray et al. 2010). Kernel hyperparameters are sampled through Metropolis Hastings. All of these are iterated in a Gibbs scheme. MCMC sampling for two-layer DGPs is wrapped in the fit_two_layer function.

dgp_booth <- fit_two_layer(x_booth, y_booth, nmcmc = 1000, true_g = 1e-6, 
                           verb = FALSE)
plot(dgp_booth)

Notice there are now two lengthscale parameters, one for each GP layer (there will be additional theta_w[i] in higher dimensions, with each \(\Sigma_i(X)\) having its own). Again there is some randomization involved in this sampling. These trace plots are helpful in diagnosing burn-in. If you need to run further MCMC samples, you may use the continue function.

dgp_booth <- continue(dgp_booth, 1000, verb = FALSE)

There are now 10,000 total samples stored in fit2,

dgp_booth$nmcmc
#> [1] 2000

In one dimension we may visualize the ESS samples of W by specifying hidden = TRUE to the plot call.

plot(dgp_booth, trace = FALSE, hidden = TRUE)

Each line represents a single ESS sample of \(W\). Since 10,000 lines would be a lot to plot, an evenly spaced selection of 100 samples is shown. The samples are colored on a gradient – red lines are the earliest iterations and yellow lines are the latest. In this simple example these lines don’t look all too interesting, but we see that they are “stretching” inputs in the left region (indicated by a steeper slope) and “squishing” inputs in the right region (indicated by the flatter slope). This accommodates the fact that there is more signal for x < 0.6.

We may then trim/thin and predict using the same function calls. Predictions follow the same structure of the one-layer GP, they just involve an additional mapping of \(X^\star\) through \(W\) before obtaining \(\mu^\star\) and \(\Sigma^\star\). We utilize lite = FALSE here so that full posterior predictive covariances are returned (by default, only point-wise variances are returned).

dgp_booth <- trim(dgp_booth, 1000, 2)
dgp_booth <- predict(dgp_booth, xp_booth, lite = FALSE)
plot(dgp_booth)

The two-layer DGP was more easily able to identify the “wiggly” regions on the left, while simultaneously predicting the linear fit on the right.

Three-layer DGP

A three-layer DGP involves an additional functional composition, \[ \begin{aligned} Y \mid W &\sim \mathcal{N}\left(0, \Sigma(W)\right) \\ W_i \mid Z &\sim \mathcal{N}\left(0, \Sigma_i(Z)\right) \quad\textrm{for}\quad i = 1, \dots, D \\ Z_j &\sim \mathcal{N}\left(0, \Sigma_j(X)\right) \quad\textrm{for}\quad j = 1, \dots, D \end{aligned} \] We conduct additional Metropolis Hastings sampling of lengthscale parameters for the innermost layer and additional elliptical slice sampling for \(Z\). In prediction, we incorporate an additional mapping \(X^\star \rightarrow Z \rightarrow W\) before obtaining predictions for \(Y^\star\). Training/prediction utilizes the same functionality, simply with the fit_three_layer function.

dgp3_booth <- fit_three_layer(x_booth, y_booth, nmcmc = 2000, true_g = 1e-6, 
                              verb = FALSE)
dgp3_booth <- trim(dgp3_booth, 1000, 2)
dgp3_booth <- predict(dgp3_booth, xp_booth, lite = FALSE)
plot(dgp3_booth)

You may notice that the three-layer fit is similar to the two-layer fit (although it requires significantly more computation). There are diminishing returns for additional depth. We will formally compare the predictions next.

Evaluation Metrics

To objectively compare various models, we can evaluate predictions against the true response at the testing locations. There are three evaluation metrics incorporated in the package:

rmse: root mean squared prediction error (lower is better)
crps: continuous rank probability score (requires point-wise predictive variances, lower is better)
score: predictive log likelihood (requires full predictive covariance, higher is better)

Let’s evaluate these metrics for the three models we have entertained.

metrics <- data.frame("RMSE" = c(rmse(yp_booth, gp_booth$mean), 
                                 rmse(yp_booth, dgp_booth$mean),
                                 rmse(yp_booth, dgp3_booth$mean)),
                      "CRPS" = c(crps(yp_booth, gp_booth$mean, diag(gp_booth$Sigma)),
                                 crps(yp_booth, dgp_booth$mean, diag(dgp_booth$Sigma)),
                                 crps(yp_booth, dgp3_booth$mean, diag(dgp3_booth$Sigma))),
                      "SCORE" = c(score(yp_booth, gp_booth$mean, gp_booth$Sigma),
                                  score(yp_booth, dgp_booth$mean, dgp_booth$Sigma),
                                  score(yp_booth, dgp3_booth$mean, dgp3_booth$Sigma)))
rownames(metrics) <- c("One-layer GP", "Two-layer DGP", "Three-layer DGP")
metrics
#>                       RMSE       CRPS    SCORE
#> One-layer GP    0.06312500 0.04393267 6.866264
#> Two-layer DGP   0.05737324 0.03290125 8.032227
#> Three-layer DGP 0.09039607 0.05572757 7.559084

Since we used a small number of MCMC iterations to keep computation time low, results may vary depending on random initializations. They will stabilize with more MCMC. Generally, the two-layer DGP outperforms. The additional flexibility of the three-layer DGP is not necessary in this setting.

Extensions

Noisy Modeling

Previously, we fixed the g parameter using the true_g argument. If we wanted to estimate noisy data, we could leave true_g = NULL (the default), and the g parameter would be inferred through Metropolis Hastings sampling. For example:

dgp_booth_noisy <- fit_two_layer(x_booth, y_booth, nmcmc = 2000, verb = FALSE)
dgp_booth_noisy <- trim(dgp_booth_noisy, 1000, 2)
dgp_booth_noisy <- predict(dgp_booth_noisy, xp_booth, lite = FALSE)
plot(dgp_booth_noisy)

Vecchia Approximation

The Bayesian DGP suffers from the computational bottlenecks that are endemic in typical GPs. Inverses of dense covariance matrices experience \(\mathcal{O}(n^3)\) costs. The Vecchia approximation (Vecchia 1988) circumvents this bottleneck by inducing sparsity in the precision matrix (Katzfuss et al. 2020; Katzfuss and Guinness 2021). The sparse Cholesky decomposition of the precision matrix is easily populated in parallel (the package uses OpenMP for this), making for fast evaluation of Gaussian likelihoods and posterior predictive moments. In addition to the original, un-approximated implementation, the deepgp package offers the option to utilize Vecchia approximation in all under-the-hood calculations. Vecchia approximation is triggered by specifying vecchia = TRUE in any of the fit functions.

An additional m argument controls the fidelity of the approximation. Details are provided in Sauer, Cooper, et al. (2023). While the original implementation was only suitable for data sizes in the hundreds, the Vecchia option allows for feasible computations with data sizes up to a hundred thousand.

The to_vec function may be called on a non-Vecchia approximated fit before prediction in order to trigger Vecchia-approximated predictions (when MCMC samples were not collected with Vecchia approximation). This is useful if the training data size is small, but the testing data size is large.

Note, this Vecchia implementation is targeted towards no-noise or low-noise settings. It has not been adapted for high-noise, low-signal situations yet.

Joint posterior sampling

The post_sample function provides full joint posterior sampling (rather than the summarized posterior moments that are provided in predict). Vecchia approximation will be used if the fit was evaluated with vecchia = TRUE or the object has been passed through the to_vec function. I will provide an example in the “Gradients” section below.

Monotonic Warpings

In its original form, the only assumption placed on the latent layer \(W\) is the Gaussian process prior. This is a very flexible assumption, and in some situations it is advantageous to reign in this flexibility. In particular, we may want to constrain \(W\) so that it monotonically warps \(X\). In our previous plots of ESS samples, monotonicity would ensure that no line folds back in on itself, thus ensuring an injective mapping from x to each sample of w.

To enforce monotonicity, we must warp each dimension of \(X\) individually. We accomplish a monotonic warping by applying a transformation to the originally implemented ESS samples.ESS proposals proceed normally, but only the transformed values are plugged into the outer Gaussian likelihood. Further methodological details are provided in Barnett et al. (2025).

Let’s see this in action. Monotonic warpings are most effective when the nonstationarity in the response surface is axis-aligned. To highlight this, consider the two-dimensional “cross-in-tray” function (also found on the VLSE).

tray <- function(x) {
  x <- x * 4 - 2
  p1 <- abs(100 - sqrt(apply(x^2, 1, sum)) / pi)
  p2 <- abs(apply(sin(x), 1, prod) * exp(p1)) + 1
  y <- -0.0001 * (p2)^(0.1)
  return((y + 1.9) / 0.2)
}

The following code visualizes this two-dimensional surface. The steep ridges are axis-aligned, making this a prime candidate for monotonic warpings.

grid <- seq(0, 1, length = 30)
xp_tray <- as.matrix(expand.grid(grid, grid))
yp_tray <- tray(xp_tray)
par(mar = c(1, 1, 1, 1))
persp(grid, grid, matrix(yp_tray, nrow = length(grid)), xlab = "x1", ylab = "x2",
      zlab = "y", theta = 30, phi = 30, r = 30)

We then get some training data and fit a two-layer DGP as usual, but specify monowarp = TRUE to trigger the monotonic warpings.

x_tray <- matrix(runif(50 * 2), ncol = 2)
y_tray <- tray(x_tray)
dgp_tray <- fit_two_layer(x_tray, y_tray, nmcmc = 2000, monowarp = TRUE, 
                          true_g = 1e-6, verb = FALSE)
dgp_tray <- trim(dgp_tray, 1000, 2)

When we visualize the samples of the latent layer, we now have w1 as a monotonic function over x1 and w2 as a monotonic function over x2. Each of these “nodes” is stretching the input values near the center and squishing the input values near the edges, which allows the outer GP to better model the cross shape in the center of the space.

plot(dgp_tray, trace = FALSE, hidden = TRUE)

Finally, we may visualize the predicted surface. [Although the code is not featured here, these predictions compare favoribly to the original two-layer DGP and to a stationary GP.]

dgp_tray <- predict(dgp_tray, xp_tray, lite = TRUE)
plot(dgp_tray)

Separable Lengthscales for One-layer GP

To allow for nonstationarity along axis-alignment, it is common to use a vectorized lengthscale with separate \(\theta_h\) for each dimension of \(X\) (see Gramacy (2020), Chapter 5). \[ \Sigma(X)^{ij} = \Sigma(x_i, x_j) = \tau^2\left(k\left(\sum_{h=1}^d \left(\frac{(x_{ih} - x_{jh})^2}{\theta_h}\right)\right) + g\mathbb{I}_{i=j}\right), \] While this is not the primary functionality of the deepgp package, a one-layer GP with separable lengthscales (sampled through additional Metropolis Hastings) may be a handy benchmark. Separable (anisotropic) lengthscales are triggered by the argument sep = TRUE to fit_one_layer.

Note, separable lengthscales are not an option for two- and three-layer models as the latent layers provide enough flexibility to mimic vectorized \(\theta\). Latent layer nodes are modeled isotropically to preserve identifiability and parsimony.

Gradients

Version 1.2.0 now offers gradient-enhancement and gradient predictions for one-layer GP and two-layer DGP models, following Booth (2025). Gradients require the "exp2" kernel. To demonstrate, we will use a simple step function.

step <- function(x) {
  y <- pnorm((x - 0.5)/0.04)
  dy <- dnorm((x - 0.5)/0.04)/0.04
  return(list(y = y, dy = dy))
}

# Training data
x_step <- seq(0, 1, length = 6) 
y_step <- step(x_step)

# Testing data
xp_step <- seq(0, 1, length = 100)
yp_step <- step(xp_step)

par(mfrow = c(1, 2))
plot(xp_step, yp_step$y, type = "l", col = 4, xlab = "X", ylab = "Y",
     main = "Step function")
points(x_step, y_step$y)
plot(xp_step, yp_step$dy, type = "l", col = 4, xlab = "X", ylab = "Y",
     main = "Step function Gradient")
points(x_step, y_step$dy)

Gradient-enhancement is triggered when dydx is provided in the fit_one_layer or fit_two_layer functions. Predictions of the gradient are triggered by specifying grad = TRUE in either the predict or post_sample functions. These may be combined or used separately.

For example, the following code fits a one-layer gradient-enhanced GP, then draws joint posterior samples of the response and the gradient at the testing locations.

gp_step <- fit_one_layer(x_step, y_step$y, y_step$dy, nmcmc = 2000, 
                          true_g = 1e-6, cov = "exp2", verb = FALSE)
gp_step <- trim(gp_step, 1000, 10) # leave 100 iterations
gp_samples <- post_sample(gp_step, xp_step, grad = TRUE)

par(mfrow = c(1, 2))
matplot(xp_step, t(gp_samples$y), type = "l", xlab = "x", ylab = "y", main = "GP")
points(x_step, y_step$y, pch = 20)
matplot(xp_step, t(gp_samples$dy), type = "l", xlab = "x", ylab = "dy", main = "GP")
points(x_step, y_step$dy, pch = 20)

We may do the same with a two-layer DGP.

dgp_step <- fit_two_layer(x_step, y_step$y, y_step$dy, nmcmc = 2000, 
                          true_g = 1e-6, cov = "exp2", verb = FALSE)
dgp_step <- trim(dgp_step, 1000, 10) # leave 100 iterations
dgp_samples <- post_sample(dgp_step, xp_step, grad = TRUE)

par(mfrow = c(1, 2))
matplot(xp_step, t(dgp_samples$y), type = "l", xlab = "x", ylab = "y", main = "DGP")
points(x_step, y_step$y, pch = 20)
matplot(xp_step, t(dgp_samples$dy), type = "l", xlab = "x", ylab = "dy", main = "DGP")
points(x_step, y_step$dy, pch = 20)

Active Learning

Active learning (also called “sequential design”) is the process of sequentially selecting training locations using a statistical “surrogate” model to inform acquisitions. Acquisitions may target specific objectives including optimizing the response or minimizing posterior variance. The nonstationary flexibility of a DGP, combined with full UQ through Bayesian MCMC, is well-suited for sequential design tasks. Where a typical stationary GP may end up space-filling, DGPs are able to target areas of high signal/high interest (Sauer, Gramacy, et al. 2023).

Optimization through Expected Improvement

If the sequential design objective is to minimize the response, the expected improvement criterion (EI, Jones et al. 1998) is offered by specifying the argument EI = TRUE within the predict function. For the Booth function example, we can calculate EI using the testing locations xp as candidates with

dgp_booth <- predict(dgp_booth, xp_booth, EI = TRUE)

To visualize EI with the corresponding fit, the following code overlays EI (green line) atop the two-layer DGP fit.

plot(dgp_booth)
par(new = TRUE)
plot(xp_booth, dgp_booth$EI, type = "l", lwd = 2, col = 3, axes = FALSE, 
     xlab = "", ylab = "")
points(xp_booth[which.max(dgp_booth$EI)], max(dgp_booth$EI), pch = 17, 
       cex = 1.5, col = 3)

The next acquisition is chosen as the candidate that maximized EI (highlighted by the green triangle). In this example, the EI criterion is accurately identifying the two local minimums in the response surface. This implementation relies on optimization through candidate evaluation. Strategically choosing candidates is crucial; see Gramacy et al. (2022) for discussion on candidate optimization of EI with DGPs.

Contour Location through Entropy

If the sequential design objective is to locate an entire contour or level set in the response surface, the entropy criterion is offered by specification of an entropy_limit within the predict function. The entropy_limit represents the value of the response that separates pass/fail regions. For the Booth function example, if we define a contour at y = 0 and again use the testing grid as candidates, this yields

dgp_booth <- predict(dgp_booth, xp_booth, entropy_limit = 0)
plot(dgp_booth)
par(new = TRUE)
plot(xp_booth, dgp_booth$entropy, type = "l", lwd = 2, col = 3, axes = FALSE, 
     xlab = "", ylab = "")
points(xp_booth[which.max(dgp_booth$entropy)], max(dgp_booth$entropy), pch = 17, 
       cex = 1.5, col = 3)

Notice entropy is highest where the predicted mean crosses the limit threshold. The entropy criterion alone, with its extreme peakiness and limited incorporation of posterior uncertainty, is not an ideal acquisition criteria. Instead, it is best combined with additional metrics that encourage exploration. See (booth2024contour?) for further discussion. Again, this implementation relies on candidates, and strategic allocation of candidate locations is crucial.

Minimizing Posterior Variance

If instead, we simply want to obtain the best surrogate fit we may choose to conduct sequential design targeting minimal posterior variance; the deepgp package offers two acquisition criteria to target minimal posterior variance. The current implementation of these criteria requires the "exp2" kernel and does not support Vecchia approximation. To demonstrate, we will revisit the step function, creating some fresh surrogates (without gradient-enhancement).

gp_step <- fit_one_layer(x_step, y_step$y, nmcmc = 2000, true_g = 1e-6, 
                         cov = "exp2", verb = FALSE)
gp_step <- trim(gp_step, 1000, 2)
gp_step <- predict(gp_step, xp_step, lite = TRUE)
plot(gp_step)

dgp_step <- fit_two_layer(x_step, y_step$y, nmcmc = 2000, true_g = 1e-6, 
                         cov = "exp2", verb = FALSE)
dgp_step <- trim(dgp_step, 1000, 2)
dgp_step <- predict(dgp_step, xp_step, lite = TRUE)
plot(dgp_step)

The Integrated Mean Squared Error (IMSE) acquisition criterion (Sacks et al. 1989) is implemented in the IMSE function. The following code evaluates this criterion for the one- and two-layer models to the step function, using the predictive grid as candidates. The next acquisition is chosen as the candidate that produced the minimum IMSE (highlighted by the blue triangle).

gp_imse <- IMSE(gp_step, xp_step)
dgp_imse <- IMSE(dgp_step, xp_step)
par(mfrow = c(1, 2))
plot(xp_step, gp_imse$value, type = "l", ylab = "IMSE", main = "One-layer")
points(xp_step[which.min(gp_imse$value)], min(gp_imse$value), pch = 17, 
       cex = 1.5, col = 4)
plot(xp_step, dgp_imse$value, type = "l", ylab = "IMSE", main = "Two-layer")
points(xp_step[which.min(dgp_imse$value)], min(dgp_imse$value), pch = 17, cex = 1.5, col = 4)

Whereas the one-layer model is inclined to space-fill with low IMSE across most of the domain, the two-layer model appropriately targets acquisitions in the middle region, where uncertainty is highest. The sum approximation to IMSE, referred to as Active learning Cohn in the ML literature, is similarly implemented in the ALC function (although the ALC acquisition is chosen as the candidate that yields the maximum criterion value).

gp_alc <- ALC(gp_step, xp_step)
dgp_alc <- ALC(dgp_step, xp_step)

When conducting a full sequential design, it is helpful to initialize the DGP chain after an acquisition at the last sampled values of latent layers and kernel hyperparameters. Initial values of MCMC sampling for a two-layer DGP may be specified by the theta_y_0, theta_w_0, g_0, and w_0 arguments to fit_two_layer. Similar arguments are implemented for fit_one_layer and fit_three_layer. Examples of this are provided in the “active_learning” folder of the git repository (https://bitbucket.org/gramacylab/deepgp-ex/).

Computational Considerations

Fully-Bayesian DGPs are hefty models that require a good bit of computation. While I have integrated tools to assist with fast computations, there are still some relevant considerations.

If you are not using the Vecchia implementation, you may substantially speed-up under-the-hood matrix calculations by using a fast linear algebra library. R’s default BLAS/LAPACK is not optimized for speedy matrix calculations. I recommend Intel’s Math Kernel Library (MKL) when possible. Other options include OpenBLAS and the Accelerate framework on OSX. Check out Appendix A in Gramacy (2020) for further discussion and illustration of linear algebra library alternatives. Even with these, I recommend using vecchia = TRUE for data sizes above the low hundreds.
If you are using the Vecchia implementation, computation speed is reliant on parallelizaton through OpenMP. When packages are installed from CRAN on a Mac, they are not compiled with OpenMP by default. To set up OpenMP parallelization on a Mac, download the package source code and install using the gcc/g++ compiler. The fit functions will produce a warning message when vecchia = TRUE is called if OpenMP is not setup.

References

Barnett, Steven D, Lauren J Beesley, Annie S Booth, Robert B Gramacy, and Dave Osthus. 2025. “Monotonic Warpings for Additive and Deep Gaussian Processes.” Statistics and Computing 35 (3): 65.

Booth, Annie S. 2025. “Deep Gaussian Processes with Gradients.” arXiv Preprint arXiv:2512.18066.

Booth, Annie S, S Ashwin Renganathan, and Robert B Gramacy. 2025. “Contour Location for Reliability in Airfoil Simulation Experiments Using Deep Gaussian Processes.” The Annals of Applied Statistics 19 (1): 191–211.

Damianou, Andreas, and Neil D Lawrence. 2013. “Deep Gaussian Processes.” Artificial Intelligence and Statistics, 207–15.

Economon, Thomas D, Francisco Palacios, Sean R Copeland, Trent W Lukaczyk, and Juan J Alonso. 2016. “SU2: An Open-Source Suite for Multiphysics Simulation and Design.” Aiaa Journal 54 (3): 828–46.

Gramacy, Robert B. 2020. Surrogates: Gaussian Process Modeling, Design and Optimization for the Applied Sciences. Chapman Hall/CRC.

Gramacy, Robert B, Annie Sauer, and Nathan Wycoff. 2022. “Triangulation Candidates for Bayesian Optimization.” Advances in Neural Information Processing Systems 35: 35933–45.

Higdon, Dave. 2002. “Space and Space-Time Modeling Using Process Convolutions.” In Quantitative Methods for Current Environmental Issues. Springer.

Jones, DR, M Schonlau, and WJ Welch. 1998. “Efficient Global Optimization of Expensive Black-Box Functions.” Journal of Global Optimization 13 (4): 455–92.

Katzfuss, Matthias, and Joseph Guinness. 2021. “A General Framework for Vecchia Approximations of Gaussian Processes.” Statistical Science 36 (1): 124–41.

Katzfuss, Matthias, Joseph Guinness, Wenlong Gong, and Daniel Zilber. 2020. “Vecchia Approximations of Gaussian-Process Predictions.” Journal of Agricultural, Biological and Environmental Statistics 25 (3): 383–414.

Marmin, Sébastien, and Maurizio Filippone. 2022. “Deep Gaussian Processes for Calibration of Computer Models.” Bayesian Analysis, 1–30. https://doi.org/10.1214/21-BA1293.

Murray, Iain, Ryan Prescott Adams, and David J. C. MacKay. 2010. “Elliptical Slice Sampling.” The Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, JMLR: w&CP, vol. 9: 541–48.

Pamadi, Bandu, Peter Covell, Paul Tartabini, and Kelly Murphy. 2004. “Aerodynamic Characteristics and Glide-Back Performance of Langley Glide-Back Booster.” 22nd Applied Aerodynamics Conference and Exhibit, 5382.

Rasmussen, Carl Edward, and Christopher K. I. Williams. 2005. Gaussian Processes for Machine Learning. MIT Press.

Sacks, J, WJ Welch, TJ Mitchell, and HP Wynn. 1989. “Design and Analysis of Computer Experiments.” Statistical Science 4 (4): 409–23.

Salimbeni, Hugh, and Marc Deisenroth. 2017. “Doubly Stochastic Variational Inference for Deep Gaussian Processes.” Advances in Neural Information Processing Systems 30.

Santner, TJ, BJ Williams, and W Notz. 2018. The Design and Analysis of Computer Experiments, Second Edition. Springer–Verlag.

Sauer, Annie. 2023. “Deep Gaussian Process Surrogates for Computer Experiments.” PhD thesis, Virginia Tech.

Sauer, Annie, Andrew Cooper, and Robert B Gramacy. 2023. “Vecchia-Approximated Deep Gaussian Processes for Computer Experiments.” Journal of Computational and Graphical Statistics 32: 824–37.

Sauer, Annie, Robert B Gramacy, and David Higdon. 2023. “Active Learning for Deep Gaussian Process Surrogates.” Technometrics 65 (1): 4–18.

Sun, Furong, Robert B Gramacy, Benjamin Haaland, Earl Lawrence, and Andrew Walker. 2019. “Emulating Satellite Drag from Large Simulation Experiments.” SIAM/ASA Journal on Uncertainty Quantification 7 (2): 720–59.

Surjanovic, S, and D Bingham. 2013. Virtual Library of Simulation Experiments: Test Functions and Datasets. Https://www.sfu.ca/~ssurjano/.

Vecchia, Aldo V. 1988. “Estimation and Model Identification for Continuous Spatial Processes.” Journal of the Royal Statistical Society: Series B (Methodological) 50 (2): 297–312.

`deepgp`: an R-package for Bayesian Deep Gaussian Processes

Annie S. Booth (annie_booth@vt.edu)

Overview

Model Fitting

One-layer Shallow GP

Two-layer DGP

Three-layer DGP

Evaluation Metrics

Extensions

Noisy Modeling

Vecchia Approximation

Joint posterior sampling

Monotonic Warpings

Separable Lengthscales for One-layer GP

Gradients

Active Learning

Optimization through Expected Improvement

Contour Location through Entropy

Minimizing Posterior Variance

Computational Considerations

References

deepgp: an R-package for Bayesian Deep Gaussian Processes

Annie S. Booth (annie_booth@vt.edu)

Overview

Model Fitting

One-layer Shallow GP

Two-layer DGP

Three-layer DGP

Evaluation Metrics

Extensions

Noisy Modeling

Vecchia Approximation

Joint posterior sampling

Monotonic Warpings

Separable Lengthscales for One-layer GP

Gradients

Active Learning

Optimization through Expected Improvement

Contour Location through Entropy

Minimizing Posterior Variance

Computational Considerations

References

`deepgp`: an R-package for Bayesian Deep Gaussian Processes