Flexible nonlinear spatial autoregressive models: a gradient boosting approach with closed-form estimation

Scientific motivation

Spatial autoregressive models are a central tool in spatial econometrics because observations located close to each other often interact. The spatial lag model introduced in the classical spatial econometric literature (Ord, 1975; Anselin, 1988) formalizes this idea through a row-standardized spatial weights matrix $W$, which defines spatial lags such as $Wy$. In empirical work, the parameter attached to $Wy$ is usually interpreted as a spillover or diffusion effect: outcomes observed at one location partly depend on outcomes observed at neighbouring locations.

The difficulty is that spatial dependence and nonlinear misspecification are hard to distinguish empirically. McMillen (2003) shows that residual spatial autocorrelation may arise because the regression function is misspecified, not necessarily because there is a genuine spatial spillover. This point is especially important when covariates are themselves spatially structured. A flexible regression component can correctly recover nonlinear covariate effects, but if it is too flexible it can also absorb the same low-frequency spatial signal that the researcher wants to attribute to a structural spatial parameter — a phenomenon analysed as the geoadditive identification trap in Geniaux (2026, Section 3.4). Recent work on identification in spatial spillover models emphasizes this tension and the need to distinguish functional-form flexibility from structural spatial dependence (Debarsy, 2025).

spboost is motivated by this identification problem, building on the methodological developments in Geniaux (2026), “Flexible nonlinear spatial autoregressive models: a gradient boosting approach with closed-form estimation”. It combines nonlinear regression tools with spatial autoregressive models while keeping the spatial parameter explicitly identified and amenable to standard diagnostics. The nonlinear component $f(X)$ can be fitted with component-wise boosting (mboost), penalized splines (mgcv), MARS (earth), or tree/linear boosting (xgboost). Component-wise boosting performs regularized variable selection while fitting additive effects (Bühlmann and van de Geer, 2011), thereby connecting the variable-selection literature in statistical learning to spatial econometric estimation.

The second ingredient is computational. Likelihood-based spatial models require repeated evaluation of log-determinants such as $\log |I-\rho W|$, which becomes expensive for large samples. Smirnov (2020, doi:10.1111/gean.12268) proposed a Closed-Form Estimator (CFE) for linear spatial dependence models that avoids this repeated determinant calculation. The CFE is computationally very fast because it relies on three auxiliary regressions without spatial dependence, while achieving accuracy close to maximum-likelihood estimation in the settings studied by Smirnov (2020, doi:10.1111/gean.12268) for the estimation of the spatial parameter. The methodological question behind spboost is whether this determinant-free idea remains valid when $f(X)$ is estimated nonparametrically. Geniaux (2026) gives a positive answer for additive estimators that admit a fixed smoothing-matrix or projection representation: a formal asymptotic result establishes the consistency of the Closed-Form Estimator and the validity of its asymptotic variance formula under standard regularity conditions on the additive components, with no additional regularity requirement induced by the spatial autoregressive structure. For boosting-based estimators, the formal proof for the greedy operator remains open, but extensive finite-sample evidence from the Monte Carlo experiments of Geniaux (2026) supports the same conclusion.

The package focuses on three Gaussian spatial data-generating structures:

\[ y = \rho Wy + f(X) + \varepsilon \qquad \text{(SAR)} \]

\[ y = f(X) + (I - \lambda W)^{-1}\varepsilon \qquad \text{(SEM)} \]

\[ y = \rho Wy + f(X) + (I - \lambda W_2)^{-1}\varepsilon \qquad \text{(SARAR)}. \]

Here $f(X)$ is an additive or machine-learning regression component. In the boosting backend, it is written as

\[ f(X) = \beta_0 + \sum_j h_j(X_j), \]

where each $h_j$ is selected and regularized through component-wise gradient boosting. The SARAR specification supports two distinct row-standardized weight matrices — $W$ for the spatial lag and $W_2$ for the error process — the common case $W_2 = W$ corresponding to the SARAR model studied in Geniaux (2026). The main interface, spbgam(), allows the user to specify the spatial model (DGP) and the estimation method (method).

Two families of spatial-parameter estimators are implemented. Methods ending in ML use a profile maximum-likelihood criterion and evaluate the spatial log-determinant. Methods ending in CFE use the determinant-free Closed-Form Estimator of Smirnov (2020, doi:10.1111/gean.12268), extended to nonlinear additive regression in Geniaux (2026). In the nonlinear setting, CFE proceeds by estimating auxiliary residuals from regressions involving $y$, $Wy$, and, for SAR, $W^2y$, and then solving a quadratic estimating equation for the spatial parameter.

The theoretical status of the estimators differs across estimation backends. The formal asymptotic result applies to nonlinear additive estimators that admit a fixed smoothing-matrix representation, including penalized spline GAM estimators and MARS estimators when the retained basis dimension is fixed. Component-wise boosting estimators are supported by consistent finite-sample evidence from the simulations of Geniaux (2026) and by their asymptotic connection to penalized spline regularization, but the formal proof for the greedy boosting operator remains open. Tree-based xgboost variants are provided as experimental predictive benchmarks whose validity in the CFE framework requires empirical validation on a case-by-case basis.

Main interface

The executable function signature is:

args(spbgam)
#> function (formula, data, W, W2 = NULL, DGP = "SAR", method = "gamboost_ML", 
#>     control = list(), debug = NULL, debug_fit_each_iter = NULL, 
#>     debug_print = NULL) 
#> NULL

The key arguments are:

Argument	Meaning
`formula`	Regression formula. Its syntax must match the backend: `mboost` base learners for `BSPA`, `mgcv` smooths for `GAM`, ordinary formula terms for `MARS` and `XGBOOST`.
`data`	Data frame containing the response and covariates.
`W`	Row-standardized spatial weights matrix for SAR or SEM dependence.
`W2`	Optional second row-standardized matrix for the error process in SARAR models.
`DGP`	Spatial model: `"SAR"`, `"SEM"`, or `"SARAR"`.
`method`	Estimator/backend combination, such as `"BSPA_SAR_CFE"` or `"MARS_SEM_ML"`.
`control`	List of backend and tuning controls.

Model and estimator catalogue

The method names follow a compact grammar:

<backend>_<model>_<spatial estimator>

The backend determines how $f(X)$ is estimated; the model determines the spatial structure; the suffix determines how the spatial parameter is estimated.

Prefix	Regression component	Typical formula syntax	Returned object
`BSPA`	Component-wise boosting with spline base learners from `mboost`	`Y ~ bbs(X1) + bbs(X2) + bols(X3)`	`mboost` object augmented with class `spboost`
`BLA`	Component-wise boosting with linear base learners	`Y ~ bols(X1) + bols(X2)`	`mboost` object augmented with class `spboost`
`GAM`	Penalized spline GAM from `mgcv`	`Y ~ s(X1) + s(X2)`	`gam`/`bam` object augmented with class `spboost`
`MARS`	Multivariate Adaptive Regression Splines through `earth::earth`	`Y ~ X1 + X2 + X3`	`earth` object augmented with class `spboost`
`XGBOOST`	Linear or tree-based gradient boosting through `xgboost`	`Y ~ X1 + X2 + X3`	`xgb.Booster` object augmented with class `spboost`
`LM`	Linear benchmark	`Y ~ X1 + X2 + X3`	Linear spatial model object augmented with spatial fields

Available combinations through spbgam() are:

Spatial model	Equation	Available methods
`SAR`	$y=\rho Wy+f(X)+\varepsilon$	`BSPA_SAR_ML`, `BSPA_SAR_CFE`, `BLA_SAR_CFE`, `MARS_SAR_ML`, `MARS_SAR_CFE`, `GAM_SAR_ML`, `GAM_SAR_CFE`, `GAM_SAR_FIVA`, `GAM_SAR_2SLSA`, `XGBOOST_SAR_ML`, `XGBOOST_SAR_CFE`, `LM_SAR_ML`, `BLA_SAR_ML`
`SEM`	$y=f(X)+(I-\lambda W)^{-1}\varepsilon$	`BSPA_SEM_ML`, `BSPA_SEM_CFE`, `BSPA_SEM_CFE_iter`, `BSPA_SEM_CFE_BRUT`, `MARS_SEM_ML`, `MARS_SEM_CFE`, `GAM_SEM_CFE`, `BLA_SEM_ML`
`SARAR`	$y=\rho Wy+f(X)+(I-\lambda W_2)^{-1}\varepsilon$	`BSPA_SARAR_ML`, `BSPA_SARAR_CFE`, `BLA_SARAR_ML`

In user-facing discussions, spboost_earth_cfe corresponds to the MARS_*_CFE branch, because the implementation uses earth::earth; spboost_xgboost_cfe corresponds to the XGBOOST_SAR_CFE branch.

Theoretical status

The table below is intentionally conservative. It separates formal asymptotic results, finite-sample empirical validation, and experimental methods.

Status	Estimators	Interpretation
Formal asymptotic guarantees	`GAM_SAR_CFE`, `GAM_SEM_CFE`; `MARS_SAR_CFE` and `MARS_SEM_CFE` when `nprune` is fixed	The CFE residual projection argument applies to additive estimators with a fixed smoothing-matrix or projection representation. Consistency and asymptotic variance validity follow under the regularity and identification conditions stated in the methodological paper.
Finite-sample validated by simulation evidence	`BSPA_SAR_CFE`, `BSPA_SEM_CFE`, `BSPA_SEM_CFE_iter`, `BSPA_SEM_CFE_BRUT`; `MARS_*_CFE` when `nprune` is selected by cross-validation	The formal proof does not directly cover the discrete model-selection path, but Monte Carlo evidence shows behaviour close to the corresponding GAM/MARS reference estimators in the studied designs.
Likelihood reference estimators	`*_ML` methods such as `BSPA_SAR_ML`, `MARS_SAR_ML`, `BSPA_SEM_ML`, `BSPA_SARAR_ML`	These are natural robustness checks for CFE results. They are usually more computationally demanding because they rely on profile likelihood and log-determinant evaluation.
Experimental spatial estimators	`GAM_SAR_FIVA`, `GAM_SAR_2SLSA`, `XGBOOST_SAR_CFE`, `XGBOOST_SAR_ML`, `BSPA_SARAR_CFE`	These are useful for benchmarking, prediction, or exploratory analysis, but their theoretical status is more limited in the current framework. Tree-based `xgboost` methods do not admit the smoothing-matrix representation used by the CFE proof.

For SEM models with high spatial error dependence near the boundary, CFE fallbacks may be triggered. In that case, it is good practice to compare against the corresponding ML estimator and to inspect the rho_method field stored in the fitted object.

Returned objects

All successful fits are given class spboost, in addition to the class of the underlying backend. Common fields include:

Field	Meaning
`rho`	Estimated spatial autoregressive parameter for SAR; in SEM routines this field often stores the estimated error parameter $\lambda$.
`lambda`	Error-process parameter for SARAR methods, when available.
`fitted`	Full fitted value on the response scale.
`fittedYS`	Fitted nonlinear component after the spatial transformation used internally.
`residuals`	Full-model residuals.
`rmse`	Root mean squared error of full-model residuals.
`var_rho`	CFE asymptotic variance estimate when available and numerically stable.
`rho_method`	Diagnostic label for the CFE branch, such as `cfe_exact`, `cfe_ratio_R`, or another fallback.
`spbgam_meta`	Metadata added by `spbgam()`, including the method, DGP, stopping rule, and CV information.

The package provides summary.spboost(), predict.spboost(), and fitted_decomp_spboost() methods for model inspection, prediction, and decomposition of fitted values by variable.

Formula syntax

The formula must match the backend.

For component-wise boosting:

formula_bspa <- Y ~ bbs(X1) + bbs(X2) + bbs(X3) + bols(X4)

For mgcv GAM backends:

formula_gam <- Y ~ s(X1) + s(X2) + s(X3)

For earth/MARS and xgboost:

formula_plain <- Y ~ X1 + X2 + X3 + X4

The MARS backend converts compatible formulas to an earth::earth formula. For reproducible inference with the current asymptotic argument, fix control_earth$nprune. If use_cv_nprune = TRUE, the estimator is still available and useful, but its theoretical status becomes empirical rather than formally covered by the fixed-projection proof.

Example 1: SAR model with fixed hyperparameters

The first example simulates a nonlinear SAR process and fits two BSPA estimators with known hyperparameters. The CFE version is fast and determinant-free; the ML version is a useful reference because it estimates $\rho$ by profile likelihood.

library(spboost)
library(mboost)

set.seed(1)
sim <- dgp(
  n = 1000,
  rho = 0.6,
  betas = c(0, 0.5, 1, -1),
  sigma2 = 1,
  model = "SAR",
  nonlin = TRUE,
  myseed = 1
)

fit_sar_cfe <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "BSPA_SAR_CFE",
  control = list(
    control_gamboost = boost_control(mstop = 150, nu = 0.1)
  )
)

fit_sar_cfe$rho
#> [1] 0.6073672
fit_sar_cfe$rmse
#> [1] 1.032868
fit_sar_cfe$rho_method
#> [1] "cfe_exact"
summary(fit_sar_cfe)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim$data, 
#>     W = sim$W, DGP = "SAR", method = "BSPA_SAR_CFE", control = list(control_gamboost = boost_control(mstop = 150, 
#>         nu = 0.1)))
#> Estimator: BSPA_SAR_CFE 
#> mstop: 150 (fixed)
#> Spatial parameter (rho/lambda): 0.607367 
#> Spatial estimator: cfe_exact 
#> Var(rho/lambda): 0.001110 
#> SE(rho/lambda): 0.033318 
#> Wald z-statistic: 18.229407 
#> p-value (H0: rho/lambda = 0): < 2.2e-16 
#> Elapsed (s): 0.240 
#> 
#> Full model diagnostics (original Y scale)
#> n = 1000 
#> RMSE: 1.032868 
#> MAE : 0.821090 
#> R2  : 0.567521 
#> Deviance explained (full): 0.567521 
#> 
#> Underlying model summary (spatially filtered equation)
#> 
#>   Model-based Boosting
#> 
#> Call:
#> gamboost(formula = "Y ~ bbs(X1) + bbs(X2) + bbs(X3)", data = data,     control = control)
#> 
#> 
#>   Squared Error (Regression) 
#> 
#> Loss function: (y - f)^2 
#>  
#> 
#> Number of boosting iterations: mstop = 150 
#> Step size:  0.1 
#> Offset:  0.4862967 
#> Number of baselearners:  3 
#> 
#> Selection frequencies:
#>   bbs(X2)   bbs(X3)   bbs(X1) 
#> 0.4600000 0.3333333 0.2066667

The likelihood counterpart uses the same regression formula and stopping iteration:

fit_sar_ml <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "BSPA_SAR_ML",
  control = list(
    control_gamboost = boost_control(mstop = 150, nu = 0.1)
  )
)

c(CFE = fit_sar_cfe$rho, ML = fit_sar_ml$rho)
#>       CFE        ML 
#> 0.6073672 0.6022003
summary(fit_sar_ml)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim$data, 
#>     W = sim$W, DGP = "SAR", method = "BSPA_SAR_ML", control = list(control_gamboost = boost_control(mstop = 150, 
#>         nu = 0.1)))
#> Estimator: BSPA_SAR_ML 
#> mstop: 150 (fixed)
#> Spatial parameter (rho/lambda): 0.602200 
#> 
#> Full model diagnostics (original Y scale)
#> n = 1000 
#> RMSE: 1.033908 
#> MAE : 0.821900 
#> R2  : 0.566650 
#> Deviance explained (full): 0.566650 
#> 
#> Underlying model summary (spatially filtered equation)
#> 
#>   Model-based Boosting
#> 
#> Call:
#> gamboost(formula = formula2, data = data, control = control_final)
#> 
#> 
#>   Squared Error (Regression) 
#> 
#> Loss function: (y - f)^2 
#>  
#> 
#> Number of boosting iterations: mstop = 150 
#> Step size:  0.1 
#> Offset:  0.4927991 
#> Number of baselearners:  3 
#> 
#> Selection frequencies:
#>   bbs(X2)   bbs(X3)   bbs(X1) 
#> 0.4600000 0.3333333 0.2066667

The fitted nonlinear components can be inspected visually. Because additive components are identifiable up to constants, the true and estimated component-specific curves are centered before plotting.

decomp_cfe_ex1 <- fitted_decomp_spboost(fit_sar_cfe)
decomp_ml_ex1 <- fitted_decomp_spboost(fit_sar_ml)

center_curve <- function(x) as.numeric(x - mean(x, na.rm = TRUE))
plot_component_ex1 <- function(true_col, fit_col, title) {
  ord <- order(sim$data[[fit_col]])
  x_component <- sim$data[[fit_col]][ord]
  true_component <- center_curve(sim$data[[true_col]])[ord]
  cfe_component <- center_curve(decomp_cfe_ex1[[fit_col]])[ord]
  ml_component <- center_curve(decomp_ml_ex1[[fit_col]])[ord]

  plot(
    x_component,
    ml_component,
    type = "l",
    lwd = 1,
    col = "#D95F02",
    xlab = fit_col,
    ylab = "Centered component",
    main = title,
    ylim = range(c(true_component, cfe_component, ml_component), na.rm = TRUE)
  )
  lines(x_component, cfe_component, col = "#1B9E77", lwd = 1)
  lines(x_component, true_component, col = "black", lwd = 3)
}

old_par_ex1 <- par(mfrow = c(3, 1), mar = c(4, 4, 2, 1))
plot_component_ex1("f1", "X1", "Component f1(X1)")
legend(
  "topleft",
  legend = c("True component", "BSPA_SAR_CFE", "BSPA_SAR_ML"),
  col = c("black", "#1B9E77", "#D95F02"),
  lwd = c(3, 1, 1),
  bty = "n"
)
plot_component_ex1("f2", "X2", "Component f2(X2)")
plot_component_ex1("f3", "X3", "Component f3(X3)")

par(old_par_ex1)

Example 2: Internal cross-validation, random and spatial

The same SAR model can be estimated with internal cross-validation for the boosting stopping iteration. With cv_mode = "random", folds are random. With cv_mode = "spatial_block", each fold holds out one spatial cluster built with blockCV::cv_cluster(). With cv_mode = "spatial_hex", folds are defined as spatial hexagonal blocks built from the coordinate columns x and y. However, this strategy is generally not recommended for autoregressive models because it may strongly disrupt the underlying neighborhood graph structure (see Geniaux, 2026). The spatial options require blockCV and sf; if they are not installed, the package falls back to a safe random split. Here mstop = 500 is not the final fixed stopping iteration: it is the maximum value searched by the internal CV procedure. Use cv_plot = TRUE to display the spatial folds as points coloured by block; the plotted coordinates are also stored in fit$mstop_selection$cv_plot_data.

fit_sar_cv_random <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "BSPA_SAR_CFE",
  control = list(
    # Maximum mstop considered by the internal CV search.
    control_gamboost = boost_control(mstop = 500, nu = 0.1),
    mstop_init = 100,
    mstop_criterion = "CV",
    nfold = 5,
    ncore = 1,
    cv_mode = "random",
    cv_seed_spatial = 1001
  )
)
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.

fit_sar_cv_spatial <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "BSPA_SAR_CFE",
  control = list(
    # Maximum mstop considered by the internal CV search.
    control_gamboost = boost_control(mstop = 500, nu = 0.1),
    mstop_init = 100,
    mstop_criterion = "CV",
    nfold = 5,
    ncore = 1,
    cv_mode = "spatial_block",
    cv_coord_vars_spatial = c("x", "y"),
    cv_hex_size_spatial = 0.15,
    cv_seed_spatial = 1001,
    cv_plot = TRUE
  )
)
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.

fit_sar_cv_random$spbgam_meta$mstop_final
#> [1] 152
fit_sar_cv_spatial$spbgam_meta$mstop_final
#> [1] 192
summary(fit_sar_cv_random)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim$data, 
#>     W = sim$W, DGP = "SAR", method = "BSPA_SAR_CFE", control = list(control_gamboost = boost_control(mstop = 500, 
#>         nu = 0.1), mstop_init = 100, mstop_criterion = "CV", 
#>         nfold = 5, ncore = 1, cv_mode = "random", cv_seed_spatial = 1001))
#> Estimator: BSPA_SAR_CFE 
#> mstop: 152 (CV)
#> CV mode: random 
#> CV RMSE on spatially filtered Y (at best mstop): 1.048652 
#> CV RMSE on Y (OOF BPN, at best mstop): 1.046238 
#> OOF predictions used: 1000 / 1000 
#> Spatial parameter (rho/lambda): 0.607354 
#> Spatial estimator: cfe_exact 
#> Var(rho/lambda): 0.001110 
#> SE(rho/lambda): 0.033317 
#> Wald z-statistic: 18.229361 
#> p-value (H0: rho/lambda = 0): < 2.2e-16 
#> Elapsed (s): 0.208 
#> 
#> Full model diagnostics (original Y scale)
#> n = 1000 
#> RMSE: 1.032814 
#> MAE : 0.821066 
#> R2  : 0.567567 
#> Deviance explained (full): 0.567567 
#> 
#> Underlying model summary (spatially filtered equation)
#> 
#>   Model-based Boosting
#> 
#> Call:
#> gamboost(formula = "Y ~ bbs(X1) + bbs(X2) + bbs(X3)", data = data,     control = control)
#> 
#> 
#>   Squared Error (Regression) 
#> 
#> Loss function: (y - f)^2 
#>  
#> 
#> Number of boosting iterations: mstop = 152 
#> Step size:  0.1 
#> Offset:  0.4863137 
#> Number of baselearners:  3 
#> 
#> Selection frequencies:
#>   bbs(X2)   bbs(X3)   bbs(X1) 
#> 0.4605263 0.3355263 0.2039474
summary(fit_sar_cv_spatial)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim$data, 
#>     W = sim$W, DGP = "SAR", method = "BSPA_SAR_CFE", control = list(control_gamboost = boost_control(mstop = 500, 
#>         nu = 0.1), mstop_init = 100, mstop_criterion = "CV", 
#>         nfold = 5, ncore = 1, cv_mode = "spatial_block", cv_coord_vars_spatial = c("x", 
#>             "y"), cv_hex_size_spatial = 0.15, cv_seed_spatial = 1001, 
#>         cv_plot = TRUE))
#> Estimator: BSPA_SAR_CFE 
#> mstop: 192 (CV)
#> CV mode: spatial_block 
#> CV RMSE on spatially filtered Y (at best mstop): 1.051404 
#> CV RMSE on Y (OOF BPN, at best mstop): 1.247931 
#> OOF predictions used: 1000 / 1000 
#> Spatial parameter (rho/lambda): 0.607215 
#> Spatial estimator: cfe_exact 
#> Var(rho/lambda): 0.001109 
#> SE(rho/lambda): 0.033307 
#> Wald z-statistic: 18.231059 
#> p-value (H0: rho/lambda = 0): < 2.2e-16 
#> Elapsed (s): 0.227 
#> 
#> Full model diagnostics (original Y scale)
#> n = 1000 
#> RMSE: 1.031971 
#> MAE : 0.820704 
#> R2  : 0.568272 
#> Deviance explained (full): 0.568272 
#> 
#> Underlying model summary (spatially filtered equation)
#> 
#>   Model-based Boosting
#> 
#> Call:
#> gamboost(formula = "Y ~ bbs(X1) + bbs(X2) + bbs(X3)", data = data,     control = control)
#> 
#> 
#>   Squared Error (Regression) 
#> 
#> Loss function: (y - f)^2 
#>  
#> 
#> Number of boosting iterations: mstop = 192 
#> Step size:  0.1 
#> Offset:  0.4864878 
#> Number of baselearners:  3 
#> 
#> Selection frequencies:
#>   bbs(X2)   bbs(X3)   bbs(X1) 
#> 0.4947917 0.3072917 0.1979167

The same fold assignment can be plotted explicitly from the fitted object. This is useful when knitting the vignette because the spatial CV diagnostic is then a standard top-level figure.

Example 3: Comparison table for SAR estimators

The next block compares six estimators on the same simulated SAR sample: linear lagsarlm, non-spatial gamboost, BSPA_SAR_CFE, BSPA_SAR_ML, GAM_SAR_CFE, and GAM_SAR_ML. The table reports the response RMSE, the error on $\rho$, and the RMSE of the estimated structural component $f(X)$. For the non-spatial gamboost benchmark, $\rho$ is not estimated.

The GAM_SAR_* versions are especially useful once the specification is known or has been selected in a preliminary BSPA workflow. They avoid the variable-selection boosting loop and are therefore much faster on large samples; in particular, GAM_SAR_CFE can switch to mgcv::bam() internally.

library(spatialreg)
#> Loading required package: spData
#> Loading required package: sf
#> Warning: package 'sf' was built under R version 4.3.3
#> Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(spdep)
#> 
#> Attaching package: 'spdep'
#> The following objects are masked from 'package:spatialreg':
#> 
#>     get.ClusterOption, get.VerboseOption, get.ZeroPolicyOption,
#>     get.coresOption, get.mcOption, set.ClusterOption,
#>     set.VerboseOption, set.ZeroPolicyOption, set.coresOption,
#>     set.mcOption

listw <- spdep::mat2listw(sim$W, style = "W")

fit_lagsarlm <- spatialreg::lagsarlm(
  Y ~ X1 + X2 + X3,
  data = sim$data,
  listw = listw,
  method = "LU",
  zero.policy = TRUE
)

fit_gamboost <- mboost::gamboost(
  Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim$data,
  control = boost_control(mstop = 150, nu = 0.1)
)

fit_gam_sar_cfe <- spbgam(
  formula = Y ~ s(X1, k = 5) + s(X2, k = 5) + s(X3, k = 5),
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "GAM_SAR_CFE",
  control = list(ncore = 1)
)

fit_gam_sar_ml <- spbgam(
  formula = Y ~ s(X1, k = 5) + s(X2, k = 5) + s(X3, k = 5),
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "GAM_SAR_ML",
  control = list(ncore = 1)
)

true_f <- rowSums(sim$data[, c("f0", "f1", "f2", "f3")])
wy <- as.numeric(sim$W %*% sim$data$Y)

f_lagsarlm <- as.numeric(
  {
    x_lag <- model.matrix(Y ~ X1 + X2 + X3, sim$data)
    b_lag <- coef(fit_lagsarlm)[colnames(x_lag)]
    x_lag %*% b_lag
  }
)
f_gamboost <- as.numeric(fitted(fit_gamboost))
f_cfe <- as.numeric(fit_sar_cfe$fittedYS)
f_ml <- as.numeric(fit_sar_ml$fittedYS)
f_gam_cfe <- as.numeric(fit_gam_sar_cfe$fittedYS)
f_gam_ml <- as.numeric(fit_gam_sar_ml$fittedYS)

comparison_sar <- data.frame(
  estimator = c(
    "SAR(lagsarlm)", "gamboost", "BSPA_SAR_CFE", "BSPA_SAR_ML",
    "GAM_SAR_CFE", "GAM_SAR_ML"
  ),
  rmse_y = c(
    sqrt(mean((sim$data$Y - suppressMessages(fitted(fit_lagsarlm)))^2)),
    sqrt(mean((sim$data$Y - fitted(fit_gamboost))^2)),
    fit_sar_cfe$rmse,
    fit_sar_ml$rmse,
    fit_gam_sar_cfe$rmse,
    fit_gam_sar_ml$rmse
  ),
  rmse_rho = c(
    abs(fit_lagsarlm$rho - 0.6),
    NA_real_,
    abs(fit_sar_cfe$rho - 0.6),
    abs(fit_sar_ml$rho - 0.6),
    abs(fit_gam_sar_cfe$rho - 0.6),
    abs(fit_gam_sar_ml$rho - 0.6)
  ),
  rmse_f = c(
    sqrt(mean((true_f - f_lagsarlm)^2)),
    sqrt(mean((true_f - f_gamboost)^2)),
    sqrt(mean((true_f - f_cfe)^2)),
    sqrt(mean((true_f - f_ml)^2)),
    sqrt(mean((true_f - f_gam_cfe)^2)),
    sqrt(mean((true_f - f_gam_ml)^2))
  )
)

comparison_sar
#>       estimator   rmse_y     rmse_rho    rmse_f
#> 1 SAR(lagsarlm) 1.173212 2.798108e-02 0.5592926
#> 2      gamboost 1.320820           NA 0.7724488
#> 3  BSPA_SAR_CFE 1.032868 7.367201e-03 0.1268243
#> 4   BSPA_SAR_ML 1.033908 2.200315e-03 0.1261273
#> 5   GAM_SAR_CFE 1.033243 7.861567e-03 0.1208463
#> 6    GAM_SAR_ML 1.034861 9.937418e-05 0.1198516

Example 4: SEM model with spatial-error filtering

For SEM models, the spatial parameter governs the error process. The SEM-CFE methods build auxiliary residual regressions and can optionally tune the boosting stopping iteration using a SEM-filtered criterion. In this example, the simulated DGP is SEM and the target parameter is $\lambda$, stored in the rho field by the current SEM routines.

sim_sem <- dgp(
  n = 1000,
  rho = 0.5,
  betas = c(0, 0.5, 1, -1),
  sigma2 = 1,
  model = "SEM",
  nonlin = TRUE,
  myseed = 2
)

fit_sem_cfe <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim_sem$data,
  W = sim_sem$W,
  DGP = "SEM",
  method = "BSPA_SEM_CFE",
  control = list(
    control_gamboost = boost_control(mstop = 250, nu = 0.1),
    mstop_criterion = "CV",
    cv_mode = "random",
    nfold = 5,
    ncore = 1
  )
)
#> Warning in spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data =
#> sim_sem$data, : BSPA_SEM_CFE with SEM CV: enabling internal CV on CFE auxiliary
#> regressions (Y and WY).

fit_sem_cfe$rho
#> [1] 0.4665136
fit_sem_cfe$rmse
#> [1] 1.130323
fit_sem_cfe$rho_method
#> [1] "cfe_exact"
summary(fit_sem_cfe)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim_sem$data, 
#>     W = sim_sem$W, DGP = "SEM", method = "BSPA_SEM_CFE", control = list(control_gamboost = boost_control(mstop = 250, 
#>         nu = 0.1), mstop_criterion = "CV", cv_mode = "random", 
#>         nfold = 5, ncore = 1))
#> Estimator: BSPA_SEM_CFE 
#> mstop: 130 (CV)
#> CV mode: random 
#> CV RMSE on spatially filtered Y (at best mstop): 1.022093 
#> Spatial parameter (rho/lambda): 0.466514 
#> Spatial estimator: cfe_exact 
#> CFE exact root selected: 1 
#> Var(rho/lambda): 0.001541 
#> SE(rho/lambda): 0.039250 
#> Wald z-statistic: 11.885648 
#> p-value (H0: rho/lambda = 0): < 2.2e-16 
#> Elapsed (s): 0.088 
#> 
#> Full model diagnostics (original Y scale)
#> n = 1000 
#> RMSE: 1.130323 
#> MAE : 0.910778 
#> R2  : 0.268590 
#> Deviance explained (full): 0.268590 
#> 
#> Underlying model summary (spatially filtered equation)
#> 
#>   Model-based Boosting
#> 
#> Call:
#> gamboost(formula = formula, data = data, family = cust_loss_sem(I_rhoW),     control = control)
#> 
#> 
#>   SEM with I_rhoW=Diagonal(nrow(W))-rho*W 
#> 
#> Loss function: as.numeric((I_rhoW %*% (y - f))^2) 
#>  
#> 
#> Number of boosting iterations: mstop = 130 
#> Step size:  0.1 
#> Offset:  0.3461594 
#> Number of baselearners:  3 
#> 
#> Selection frequencies:
#>   bbs(X3)   bbs(X2)   bbs(X1) 
#> 0.4076923 0.4000000 0.1923077

When the estimated spatial error parameter is close to the boundary, compare with BSPA_SEM_ML:

fit_sem_ml <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim_sem$data,
  W = sim_sem$W,
  DGP = "SEM",
  method = "BSPA_SEM_ML",
  control = list(
    control_gamboost = boost_control(mstop = 250, nu = 0.1)
  )
)

c(CFE = fit_sem_cfe$rho, ML = fit_sem_ml$rho)
#>       CFE        ML 
#> 0.4665136 0.4659820
summary(fit_sem_ml)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim_sem$data, 
#>     W = sim_sem$W, DGP = "SEM", method = "BSPA_SEM_ML", control = list(control_gamboost = boost_control(mstop = 250, 
#>         nu = 0.1)))
#> Estimator: BSPA_SEM_ML 
#> mstop: 250 (fixed)
#> Spatial parameter (rho/lambda): 0.465982 
#> 
#> Full model diagnostics (original Y scale)
#> n = 1000 
#> RMSE: 1.128568 
#> MAE : 0.909587 
#> R2  : 0.270859 
#> Deviance explained (full): 0.270859 
#> 
#> Underlying model summary (spatially filtered equation)
#> 
#>   Model-based Boosting
#> 
#> Call:
#> gamboost(formula = formula, data = data, family = cust_loss_sem(I_rhoW),     control = control_final)
#> 
#> 
#>   SEM with I_rhoW=Diagonal(nrow(W))-rho*W 
#> 
#> Loss function: as.numeric((I_rhoW %*% (y - f))^2) 
#>  
#> 
#> Number of boosting iterations: mstop = 250 
#> Step size:  0.1 
#> Offset:  0.346143 
#> Number of baselearners:  3 
#> 
#> Selection frequencies:
#> bbs(X2) bbs(X3) bbs(X1) 
#>   0.492   0.340   0.168

A SEM comparison table can be built in the same spirit as the SAR table. Here errorsarlm is the linear spatial-error reference, while gamboost remains a non-spatial nonlinear benchmark.

fit_errorsarlm <- spatialreg::errorsarlm(
  Y ~ X1 + X2 + X3,
  data = sim_sem$data,
  listw = spdep::mat2listw(sim_sem$W, style = "W"),
  method = "LU",
  zero.policy = TRUE
)

fit_gamboost_sem <- mboost::gamboost(
  Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim_sem$data,
  control = boost_control(mstop = 250, nu = 0.1)
)

true_f_sem <- rowSums(sim_sem$data[, c("f0", "f1", "f2", "f3")])
f_errorsarlm <- as.numeric(
  {
    x_err <- model.matrix(Y ~ X1 + X2 + X3, sim_sem$data)
    b_err <- coef(fit_errorsarlm)[colnames(x_err)]
    x_err %*% b_err
  }
)
f_gamboost_sem <- as.numeric(fitted(fit_gamboost_sem))
f_sem_cfe <- as.numeric(fit_sem_cfe$fittedYS)
f_sem_ml <- as.numeric(fit_sem_ml$fittedYS)
lambda_errorsarlm <- if (!is.null(fit_errorsarlm$lambda)) {
  fit_errorsarlm$lambda
} else {
  fit_errorsarlm$lambda.se
}
if (length(lambda_errorsarlm) == 0L) lambda_errorsarlm <- NA_real_
scalar_or_na <- function(x) {
  if (is.null(x) || length(x) == 0L) return(NA_real_)
  as.numeric(x[1])
}

comparison_sem <- data.frame(
  estimator = c("SEM(errorsarlm)", "gamboost", "BSPA_SEM_CFE", "BSPA_SEM_ML"),
  rmse_y = c(
    sqrt(mean((sim_sem$data$Y - suppressMessages(fitted(fit_errorsarlm)))^2)),
    sqrt(mean((sim_sem$data$Y - fitted(fit_gamboost_sem))^2)),
    scalar_or_na(fit_sem_cfe$rmse),
    scalar_or_na(fit_sem_ml$rmse)
  ),
  rmse_lambda = c(
    abs(as.numeric(lambda_errorsarlm[1]) - 0.5),
    NA_real_,
    abs(scalar_or_na(fit_sem_cfe$rho) - 0.5),
    abs(scalar_or_na(fit_sem_ml$rho) - 0.5)
  ),
  rmse_f = c(
    sqrt(mean((true_f_sem - f_errorsarlm)^2)),
    sqrt(mean((true_f_sem - f_gamboost_sem)^2)),
    sqrt(mean((true_f_sem - f_sem_cfe)^2)),
    sqrt(mean((true_f_sem - f_sem_ml)^2))
  )
)

comparison_sem
#>         estimator   rmse_y rmse_lambda    rmse_f
#> 1 SEM(errorsarlm) 1.166148  0.09506476 0.5696819
#> 2        gamboost 1.128129          NA 0.1873721
#> 3    BSPA_SEM_CFE 1.130323  0.03348644 0.1787908
#> 4     BSPA_SEM_ML 1.128568  0.03401802 0.1759179

Example 5: MARS / earth CFE branch

The MARS_*_CFE methods use earth::earth for the nonlinear component. This branch is useful when piecewise-linear effects are expected and a compact model is desired.

fit_mars <- spbgam(
  formula = Y ~ X1 + X2 + X3,
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "MARS_SAR_CFE",
  control = list(
    control_earth = list(
      degree = 1,
      nprune = 12,
      trace = 0
    )
  )
)

fit_mars$rho
#> [1] 0.6039672
fit_mars$rmse
#> [1] 1.036029
fit_mars$rho_method
#> [1] "cfe_exact"

For theory-oriented work, fixing nprune keeps the estimator within the projection argument described above. For predictive work, nprune can also be selected by internal cross-validation:

fit_mars_cv <- spbgam(
  formula = Y ~ X1 + X2 + X3,
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "MARS_SAR_CFE",
  control = list(
    control_earth = list(
      degree = 1,
      use_cv_nprune = TRUE,
      cv_nfold = 5,
      cv_ncore = 1,
      trace = 0
    )
  )
)

Example 6: Experimental xgboost CFE branch

The XGBOOST_SAR_CFE estimator can be useful as a predictive benchmark, but it falls outside the formal smoothing-matrix framework. Use it as an experimental method and validate it empirically against CFE/ML additive estimators.

fit_xgb <- spbgam(
  formula = Y ~ X1 + X2 + X3,
  data = sim$data,
  W = sim$W,
  DGP = "SAR",
  method = "XGBOOST_SAR_CFE",
  control = list(
    mstop_init = 100,
    myparams_xgboost = list(
      booster = "gbtree",
      eta = 0.05,
      max_depth = 2,
      subsample = 0.8,
      colsample_bytree = 0.8,
      nfold = 5,
      early_stopping_rounds = 5,
      nthread = 1,
      verbose = 0
    )
  )
)

fit_xgb$rho
#> [1] 0.6035466
fit_xgb$rmse
#> [1] 0.9983839

Example 7: Prediction and decomposition

For fitted spboost objects, use the S3 prediction method. Out-of-sample prediction with spatial correction requires the full-sample W matrix covering both training and prediction locations.

pred_in <- predict(fit_sar_cfe)
head(pred_in)
#>            [,1]
#> [1,]  1.4229374
#> [2,]  0.3184280
#> [3,]  1.1425926
#> [4,] -0.1092532
#> [5,]  0.1610168
#> [6,]  1.0677343

decomp <- fitted_decomp_spboost(fit_sar_cfe)
head(decomp)
#>   Intercept          X1         X2          X3    fitted
#> 1 0.4862967  0.39833445  0.2019544  0.33635195 2.4327536
#> 2 0.4862967  0.50759936 -0.7541467  0.07867869 0.6900655
#> 3 0.4862967  0.39272386  0.1642198  0.09935227 2.3494308
#> 4 0.4862967 -0.05307168 -0.6423254  0.09984719 0.9543365
#> 5 0.4862967 -0.27366328  0.2495608 -0.30117731 1.1089248
#> 6 0.4862967 -0.44735972  0.7070805  0.32171686 2.0627377

For a new dataset, refit on a training subset and provide a full-sample weights matrix covering both training and prediction locations:

train_id <- 1:800
test_id <- 801:1000

W_train <- sim$W[train_id, train_id, drop = FALSE]
row_sum_train <- Matrix::rowSums(W_train)
W_train <- Matrix::Diagonal(
  x = ifelse(row_sum_train > 0, 1 / row_sum_train, 0)
) %*% W_train

fit_sar_train <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim$data[train_id, ],
  W = W_train,
  DGP = "SAR",
  method = "BSPA_SAR_CFE",
  control = list(
    control_gamboost = boost_control(mstop = 100, nu = 0.1)
  )
)
#> Warning in spb_warn_w_quality(W, name = "W"): W contains 2 island row(s) (row
#> sum ~ 0).

pred_new <- predict(
  fit_sar_train,
  newdata = sim$data[test_id, ],
  data = sim$data[train_id, ],
  W = sim$W,
  type = "BPN"
)
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.
#> Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
#> args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
#> Linear extrapolation used.

head(pred_new)
#> [1] 1.7598530 1.8834854 0.6036237 0.4816779 1.6679928 2.1017742

# diff RMSE train - test
fit_sar_train$rmse
#> [1] 1.078054
rmse_test<-sqrt(mean((pred_new-sim$data[test_id, 'Y'])^2))
rmse_test
#> [1] 1.036808

Example 8: SARAR models

SARAR models include both a spatial lag in the response and a spatially structured error process. They require two spatial matrices unless W2 = W is acceptable for the application.

sim_sarar <- dgp(
  n = 1000,
  rho = 0.5,
  lambda = -0.3,
  betas = c(0, 0.5, 1, -1),
  sigma2 = 1,
  model = "SARAR",
  nonlin = TRUE,
  myseed = 3
)

fit_sarar <- spbgam(
  formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3),
  data = sim_sarar$data,
  W = sim_sarar$W,
  W2 = sim_sarar$W2,
  DGP = "SARAR",
  method = "BSPA_SARAR_CFE",
  control = list(
    control_gamboost = boost_control(mstop = 200, nu = 0.1),
    iter_max = 3,
    tol_lambda = 1e-4
  )
)

fit_sarar$rho
#> [1] 0.450844
fit_sarar$lambda
#> [1] -0.2373298
fit_sarar$rmse
#> [1] 1.188259

Because SARAR-CFE involves joint determinant-free estimation of two spatial parameters, it should be treated more cautiously than the marginal SAR and SEM branches.

Practical recommendations

Start with the simplest estimator that matches the scientific question. For inference-oriented nonlinear SAR or SEM work with known smooth structure, GAM_*_CFE or fixed-nprune MARS_*_CFE are the closest to the formal asymptotic theory. For automatic variable selection, use BSPA_* methods and compare CFE results with their ML counterparts. For prediction-oriented benchmarks, MARS and xgboost can be valuable, but xgboost CFE should be reported as experimental.

For very large datasets, a practical workflow is to search for the model specification with BSPA_*_CFE on representative subsamples, using boosting as a variable-selection and formula-discovery device. Once the relevant covariates, nonlinear terms, and spatial specification are known, refit a GAM_*_CFE model on the full sample with the selected formula. On large samples, the GAM-CFE branch can use mgcv::bam(), which is extremely fast for large additive models. This combines the exploratory strength of boosting with the stronger asymptotic status and computational stability of the GAM-CFE branch at full scale.

Always inspect:

fit <- fit_sar_cfe
summary(fit)
#> Call (spbgam):
#> spbgam(formula = Y ~ bbs(X1) + bbs(X2) + bbs(X3), data = sim$data, 
#>     W = sim$W, DGP = "SAR", method = "BSPA_SAR_CFE", control = list(control_gamboost = boost_control(mstop = 150, 
#>         nu = 0.1)))
#> Estimator: BSPA_SAR_CFE 
#> mstop: 150 (fixed)
#> Spatial parameter (rho/lambda): 0.607367 
#> Spatial estimator: cfe_exact 
#> Var(rho/lambda): 0.001110 
#> SE(rho/lambda): 0.033318 
#> Wald z-statistic: 18.229407 
#> p-value (H0: rho/lambda = 0): < 2.2e-16 
#> Elapsed (s): 0.240 
#> 
#> Full model diagnostics (original Y scale)
#> n = 1000 
#> RMSE: 1.032868 
#> MAE : 0.821090 
#> R2  : 0.567521 
#> Deviance explained (full): 0.567521 
#> 
#> Underlying model summary (spatially filtered equation)
#> 
#>   Model-based Boosting
#> 
#> Call:
#> gamboost(formula = "Y ~ bbs(X1) + bbs(X2) + bbs(X3)", data = data,     control = control)
#> 
#> 
#>   Squared Error (Regression) 
#> 
#> Loss function: (y - f)^2 
#>  
#> 
#> Number of boosting iterations: mstop = 150 
#> Step size:  0.1 
#> Offset:  0.4862967 
#> Number of baselearners:  3 
#> 
#> Selection frequencies:
#>   bbs(X2)   bbs(X3)   bbs(X1) 
#> 0.4600000 0.3333333 0.2066667
fit$rho
#> [1] 0.6073672
fit$lambda
#> NULL
fit$rho_method
#> [1] "cfe_exact"
fit$rmse
#> [1] 1.032868
fit$spbgam_meta
#> $method
#> [1] "BSPA_SAR_CFE"
#> 
#> $DGP
#> [1] "SAR"
#> 
#> $mstop_criterion_requested
#> [1] "none"
#> 
#> $cv_mode_requested
#> [1] "random"
#> 
#> $control_mstop
#> [1] 150
#> 
#> $mstop_final
#> [1] 150
#> 
#> $mstop_source
#> [1] "fixed"
#> 
#> $spectral_overlap_requested
#> [1] FALSE
#> 
#> $spectral_k_requested
#> [1] 50

When rho_method is not an exact CFE branch, report the fallback and compare with the likelihood counterpart. When covariates or geographic smoothers are strongly spatially structured, keep the model deliberately parsimonious and check sensitivity to mstop, formula specification, and the choice of cross-validation regime.

References

Anselin, L. (1988). Spatial Econometrics: Methods and Models. Kluwer.

Buhlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer.

Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794.

Debarsy, N. and Le Gallo, J. (2025). Identification of spatial spillovers: Do’s and don’ts. Journal of Economic Surveys, 39(5), 2152-2173.

Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1-67.

Geniaux, G. (2026). Flexible nonlinear spatial autoregressive models: a gradient boosting approach with closed-form estimation, 20th WORLD CONFERENCE OF THE SPATIAL ECONOMETRICS ASSOCIATION & 24th INTERNATIONAL WORKSHOP ON SPATIAL ECONOMETRICS AND STATISTICS, Paris, June 17-18, 2026

McMillen, D. P. (2003). Spatial autocorrelation or model misspecification? International Regional Science Review, 26(2), 208-217.

Ord, K. (1975). Estimation methods for models of spatial interaction. Journal of the American Statistical Association, 70(349), 120-126.

Smirnov, O. A. (2020). A closed-form consistent estimator for linear models with spatial dependence. Geographical Analysis. doi:10.1111/gean.12268.

Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B, 73(1), 3-36.

spboost: Nonlinear Spatial Autoregressive Models with Boosting and CFE

Ghislain Geniaux