| Language: | en-US |
| Type: | Package |
| Title: | Component-Wise Gradient Boosting after Multiple Imputation |
| Version: | 0.1.2 |
| Description: | Component-wise gradient boosting for analysis of multiply imputed datasets. Implements the algorithm Boosting after Multiple Imputation (MIBoost), which enforces uniform variable selection across imputations and provides utilities for pooling. Includes a cross-validation workflow that first splits the data into training and validation sets and then performs imputation on the training data, applying the learned imputation models to the validation data to avoid information leakage. Supports Gaussian and logistic loss. Methods relate to gradient boosting and multiple imputation as in Buehlmann and Hothorn (2007) <doi:10.1214/07-STS242>, Friedman (2001) <doi:10.1214/aos/1013203451>, and van Buuren (2018, ISBN:9781138588318) and Groothuis-Oudshoorn (2011) <doi:10.18637/jss.v045.i03>; see also Kuchen (2025) <doi:10.48550/arXiv.2507.21807>. |
| License: | MIT + file LICENSE |
| URL: | https://arxiv.org/abs/2507.21807, https://github.com/RobertKuchen/booami |
| BugReports: | https://github.com/RobertKuchen/booami/issues |
| Encoding: | UTF-8 |
| Depends: | R (≥ 4.0) |
| Imports: | MASS, stats, utils, withr |
| Suggests: | mice, miceadds, Matrix, knitr, rmarkdown, testthat (≥ 3.0.0), spelling |
| Config/testthat/edition: | 3 |
| RoxygenNote: | 7.3.2 |
| LazyData: | TRUE |
| NeedsCompilation: | no |
| Packaged: | 2026-02-19 14:46:02 UTC; rokuchen |
| Author: | Robert Kuchen [aut, cre] |
| Maintainer: | Robert Kuchen <rokuchen@uni-mainz.de> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-19 15:10:07 UTC |
Boosting with Multiple Imputation (booami)
Description
booami provides component-wise gradient boosting tailored for analysis with multiply imputed datasets. Its core contribution is MIBoost, an algorithm that couples base-learner selection across imputed datasets by minimizing an aggregated loss at each iteration, yielding a single, unified regularization path and improved model stability. For comparison, booami also includes per-dataset boosting with post-hoc pooling (estimate averaging or selection-frequency thresholding).
Details
What is MIBoost?
In each boosting iteration, candidate base-learners are fit separately within each imputed dataset, but selection is made jointly via the aggregated loss across datasets. The selected base-learner is then updated in every imputed dataset, and fitted contributions are averaged to form a single combined predictor. This enforces uniform variable selection while preserving dataset-specific gradients and updates.
Cross-validation without leakage
booami implements a leakage-avoiding CV protocol:
data are first split into training and validation subsets; the training
covariates are multiply imputed; validation covariates are imputed using the
training imputation models; and (if enabled) centering uses a fold-specific
grand mean \mu_\star computed from the training imputations and applied
consistently to all imputed training and validation matrices. Errors are
averaged across imputations and folds to select the optimal number of boosting
iterations (mstop). Use cv_boost_raw for raw data with
missing covariates (imputation inside CV), or cv_boost_imputed
when imputed datasets are already prepared.
Note: In the recommended predictive workflow implemented by
cv_boost_raw(), rows with missing outcomes y are removed before
fold assignment, and the outcome is not used for imputation (covariates X
are imputed without including y as a predictor).
Key features
-
MIBoost (uniform selection): Joint base-learner selection via aggregated loss across imputed datasets; averaged fitted functions yield a single model.
-
Per-dataset boosting (with pooling): Independent boosting in each imputed dataset, with pooling by estimate averaging or by selection-frequency thresholding.
-
Flexible losses and learners: Supports Gaussian and logistic losses with component-wise base-learners; extensible to other learners.
-
Leakage-safe CV: Training/validation split → train-only imputation of covariates → fold-wise grand-mean centering (
\mu_\star) → error aggregation across imputations and folds.
Main functions
-
impu_boost— Core routine implementing MIBoost as well as per-dataset boosting with pooling. -
cv_boost_raw— Leakage-safe k-fold CV starting from a single dataset with missing covariates (imputation performed inside each fold). -
cv_boost_imputed— CV when imputed datasets (and splits) are already available.
Typical workflow
-
Raw data with missing covariates: use
cv_boost_raw()to impute within folds, selectmstop, and fit the final model. -
Already imputed datasets: use
cv_boost_imputed()to selectmstopand fit. -
Direct control: call
impu_boost()when you want to run MIBoost (or per-dataset boosting) directly, optionally followed by pooling.
Mathematical sketch
At boosting iteration t, for each candidate base-learner r and
each imputed dataset m = 1,\dots,M, let
RSS_r^{(m)[t]} denote the residual sum of squares.
The aggregated loss is
L_r^{[t]} = \sum_{m=1}^M RSS_r^{(m)[t]}.
The base-learner r^* with minimal aggregated loss is selected jointly,
updated in all imputed datasets, and the fitted contributions are averaged to
form the combined predictor. After t_{\mathrm{stop}} iterations, this
yields a single final model.
References
Buehlmann, P. and Hothorn, T. (2007). "Boosting Algorithms: Regularization, Prediction and Model Fitting." doi:10.1214/07-STS242
Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." doi:10.1214/aos/1013203451
van Buuren, S. and Groothuis-Oudshoorn, K. (2011). "mice: Multivariate Imputation by Chained Equations in R." doi:10.18637/jss.v045.i03
Citation
For details, see: Kuchen, R. (2025). "MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation." doi:10.48550/arXiv.2507.21807 https://arxiv.org/abs/2507.21807.
See also
-
mboost: General framework for component-wise gradient boosting in R.
-
miselect: Implements MI-extensions of LASSO and elastic nets for variable selection after multiple imputation.
-
mice: Standard tool for multiple imputation of missing data.
Author(s)
Maintainer: Robert Kuchen rokuchen@uni-mainz.de
See Also
Useful links:
Report bugs at https://github.com/RobertKuchen/booami/issues
Predict with booami models
Description
Minimal, dependency-free predictor for models fitted by
cv_boost_raw, cv_boost_imputed, or a
pooled impu_boost fit. Supports Gaussian (identity)
and logistic (logit) models, returning either the linear predictor
or, for logistic, predicted probabilities.
Usage
booami_predict(
object,
X_new,
family = NULL,
type = c("response", "link"),
center_means = NULL
)
Arguments
object |
A fit returned by |
X_new |
New data (matrix or data.frame) with the same |
family |
Model family; one of |
type |
Prediction type; one of |
center_means |
Optional numeric vector of length |
Details
This function is deterministic and involves no random number generation.
Coefficients are extracted from either $final_model (intercept first,
then coefficients) or from $INT+$BETA (pooled impu_boost).
If X_new has column names and the model has named coefficients, columns
are aligned by name; otherwise they are used in order.
If your training pipeline centered covariates (e.g., center = "auto"),
providing the same center_means here yields numerically consistent
predictions. If not supplied but object$center_means exists, it will
be used automatically. If both are supplied, the explicit center_means
argument takes precedence.
Value
A numeric vector of predictions (length nrow(X_new)). If
X_new has row names, they are propagated to the returned vector.
See Also
cv_boost_raw, cv_boost_imputed, impu_boost
Examples
# 1) Fit on data WITH missing values
set.seed(123)
sim_tr <- simulate_booami_data(
n = 120, p = 12, p_inf = 3,
type = "gaussian",
miss = "MAR", miss_prop = 0.20
)
X_tr <- sim_tr$data[, 1:12]
y_tr <- sim_tr$data$y
fit <- cv_boost_raw(
X_tr, y_tr,
k = 2, mstop = 50, seed = 123,
impute_args = list(m = 2, maxit = 1, printFlag = FALSE, seed = 1),
quickpred_args = list(method = "spearman", mincor = 0.30, minpuc = 0.60),
show_progress = FALSE
)
# 2) Predict on a separate data set WITHOUT missing values (same p)
sim_new <- simulate_booami_data(
n = 5, p = 12, p_inf = 3,
type = "gaussian",
miss = "MCAR", miss_prop = 0 # <- complete data with existing API
)
X_new <- sim_new$data[, 1:12, drop = FALSE]
preds <- booami_predict(fit, X_new = X_new, family = "gaussian", type = "response")
round(preds, 3)
Example dataset for 'booami' (Gaussian, MAR)
Description
A simulated dataset with predictors X1...X25 and a continuous
outcome y. Missing values are generated under a MAR mechanism in the
predictors (covariates) only; the outcome y is fully observed (no NAs).
The object is a data.frame and carries attributes describing the
data-generating process (true coefficients, informative indices, etc.).
Format
A data frame with 300 rows and 26 variables:
- X1
numeric
- X2
numeric
- X3
numeric
- X4
numeric
- X5
numeric
- X6
numeric
- X7
numeric
- X8
numeric
- X9
numeric
- X10
numeric
- X11
numeric
- X12
numeric
- X13
numeric
- X14
numeric
- X15
numeric
- X16
numeric
- X17
numeric
- X18
numeric
- X19
numeric
- X20
numeric
- X21
numeric
- X22
numeric
- X23
numeric
- X24
numeric
- X25
numeric
- y
numeric outcome (fully observed)
Details
Generated by simulate_booami_data with typical settings (see
?simulate_booami_data). The following attributes are attached to
booami_sim:
-
"true_beta": numeric length-25 vector of true coefficients (non-zeros in positions 1-5). -
"informative": integer vector1:5. -
"type":"gaussian". -
"corr_structure":"all_ar1";"rho": 0.3. -
"intercept": 1;"noise_sd": 1 (Gaussian;NAotherwise). -
"mar_scale":TRUE;"keep_mar_drivers":TRUE.
See Also
simulate_booami_data,
impu_boost, cv_boost_raw, cv_boost_imputed
Examples
## \donttest{
utils::data(booami_sim)
dim(booami_sim)
mean(colSums(is.na(booami_sim)) > 0) # fraction of columns with any NAs
sum(is.na(booami_sim$y)) # should be 0
head(attr(booami_sim, "true_beta"))
attr(booami_sim, "informative")
## }
Cross-validation for boosting after multiple imputation (pre-imputed inputs)
Description
To avoid data leakage, each CV fold should first be split into training and validation subsets, after which imputation is performed. For the final model, all data should be imputed independently.
Usage
cv_boost_imputed(
X_train_list,
y_train_list,
X_val_list,
y_val_list,
X_full,
y_full,
ny = 0.1,
mstop = 250,
type = c("gaussian", "logistic"),
MIBoost = TRUE,
pool = TRUE,
pool_threshold = 0,
show_progress = TRUE,
center = c("auto", "off", "force")
)
Arguments
X_train_list |
A list of length |
y_train_list |
A list of length |
X_val_list |
A list of length |
y_val_list |
A list of length |
X_full |
A list of length |
y_full |
A list of length |
ny |
Learning rate. Defaults to |
mstop |
Maximum number of boosting iterations to evaluate during
cross-validation. The selected |
type |
Type of loss function. One of:
|
MIBoost |
Logical. If |
pool |
Logical. If |
pool_threshold |
Only used when |
show_progress |
Logical; print fold-level progress and summary timings.
Default |
center |
One of If centering is applied, a single grand mean vector |
Details
The recommended workflow is illustrated in the examples.
Centering affects only X; y is left unchanged. For
type = "logistic", responses are treated as numeric 0/1
via the logistic link. Validation loss is averaged over
imputations and then over folds.
Value
A list with:
-
CV_error: numeric vector of lengthmstopwith the mean cross-validated loss across folds (and imputations). -
best_mstop: integer index of the minimizing entry inCV_error. -
final_model: numeric vector of length1 + pcontaining the intercept followed bypcoefficients of the final pooled model fitted atbest_mstoponX_full/y_full. -
center_means: (optional) numeric vector of lengthpcontaining the centering means used forX(when available).
References
Kuchen, R. (2025). MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation. arXiv:2507.21807. doi:10.48550/arXiv.2507.21807 https://arxiv.org/abs/2507.21807.
See Also
Examples
set.seed(123)
utils::data(booami_sim)
k <- 2; M <- 2
# Separate X and y; drop missing y (policy)
X_all <- booami_sim[, 1:25, drop = FALSE]
y_all <- booami_sim[, 26]
keep <- !is.na(y_all)
X_all <- X_all[keep, , drop = FALSE]
y_all <- y_all[keep]
n <- nrow(X_all); p <- ncol(X_all)
folds <- sample(rep(seq_len(k), length.out = n))
X_train_list <- vector("list", k)
y_train_list <- vector("list", k)
X_val_list <- vector("list", k)
y_val_list <- vector("list", k)
for (cv in seq_len(k)) {
tr <- folds != cv
va <- !tr
Xtr <- X_all[tr, , drop = FALSE]; ytr <- y_all[tr]
Xva <- X_all[va, , drop = FALSE]; yva <- y_all[va]
# Impute X only (y is never used for imputation)
pm_tr <- mice::quickpred(Xtr, method = "spearman", mincor = 0.30, minpuc = 0.60)
imp_tr <- mice::mice(Xtr, m = M, predictorMatrix = pm_tr, maxit = 1, printFlag = FALSE)
imp_va <- mice::mice.mids(imp_tr, newdata = Xva, maxit = 1, printFlag = FALSE)
X_train_list[[cv]] <- vector("list", M)
y_train_list[[cv]] <- vector("list", M)
X_val_list[[cv]] <- vector("list", M)
y_val_list[[cv]] <- vector("list", M)
for (m in seq_len(M)) {
tr_m <- mice::complete(imp_tr, m)
va_m <- mice::complete(imp_va, m)
X_train_list[[cv]][[m]] <- data.matrix(tr_m)
y_train_list[[cv]][[m]] <- ytr
X_val_list[[cv]][[m]] <- data.matrix(va_m)
y_val_list[[cv]][[m]] <- yva
}
}
# Full-data imputations (X only)
pm_full <- mice::quickpred(X_all, method = "spearman", mincor = 0.30, minpuc = 0.60)
imp_full <- mice::mice(X_all, m = M, predictorMatrix = pm_full, maxit = 1, printFlag = FALSE)
X_full <- lapply(seq_len(M), function(m) data.matrix(mice::complete(imp_full, m)))
y_full <- lapply(seq_len(M), function(m) y_all)
res <- cv_boost_imputed(
X_train_list, y_train_list,
X_val_list, y_val_list,
X_full, y_full,
ny = 0.1, mstop = 50, type = "gaussian",
MIBoost = TRUE, pool = TRUE, center = "auto",
show_progress = FALSE
)
set.seed(2025)
utils::data(booami_sim)
k <- 5; M <- 10
X_all <- booami_sim[, 1:25, drop = FALSE]
y_all <- booami_sim[, 26]
keep <- !is.na(y_all)
X_all <- X_all[keep, , drop = FALSE]
y_all <- y_all[keep]
n <- nrow(X_all); p <- ncol(X_all)
folds <- sample(rep(seq_len(k), length.out = n))
X_train_list <- vector("list", k)
y_train_list <- vector("list", k)
X_val_list <- vector("list", k)
y_val_list <- vector("list", k)
for (cv in seq_len(k)) {
tr <- folds != cv; va <- !tr
Xtr <- X_all[tr, , drop = FALSE]; ytr <- y_all[tr]
Xva <- X_all[va, , drop = FALSE]; yva <- y_all[va]
pm_tr <- mice::quickpred(Xtr, method = "spearman", mincor = 0.20, minpuc = 0.40)
imp_tr <- mice::mice(Xtr, m = M, predictorMatrix = pm_tr, maxit = 5, printFlag = TRUE)
imp_va <- mice::mice.mids(imp_tr, newdata = Xva, maxit = 1, printFlag = FALSE)
X_train_list[[cv]] <- vector("list", M)
y_train_list[[cv]] <- vector("list", M)
X_val_list[[cv]] <- vector("list", M)
y_val_list[[cv]] <- vector("list", M)
for (m in seq_len(M)) {
tr_m <- mice::complete(imp_tr, m)
va_m <- mice::complete(imp_va, m)
X_train_list[[cv]][[m]] <- data.matrix(tr_m)
y_train_list[[cv]][[m]] <- ytr
X_val_list[[cv]][[m]] <- data.matrix(va_m)
y_val_list[[cv]][[m]] <- yva
}
}
pm_full <- mice::quickpred(X_all, method = "spearman", mincor = 0.20, minpuc = 0.40)
imp_full <- mice::mice(X_all, m = M, predictorMatrix = pm_full, maxit = 5, printFlag = TRUE)
X_full <- lapply(seq_len(M), function(m) data.matrix(mice::complete(imp_full, m)))
y_full <- lapply(seq_len(M), function(m) y_all)
res_heavy <- cv_boost_imputed(
X_train_list, y_train_list,
X_val_list, y_val_list,
X_full, y_full,
ny = 0.1, mstop = 250, type = "gaussian",
MIBoost = TRUE, pool = TRUE, center = "auto",
show_progress = TRUE
)
str(res_heavy)
Cross-Validated Component-Wise Gradient Boosting with Multiple Imputation Performed Inside Each Fold
Description
Performs k-fold cross-validation for impu_boost on data with
missing values. Within each fold, multiple imputation, centering, model
fitting, and validation are performed in a leakage-avoiding manner to select
the optimal number of boosting iterations (mstop). The final model is
then fitted on multiple imputations of the full dataset at the selected
stopping iteration.
Usage
cv_boost_raw(
X,
y,
k = 5,
ny = 0.1,
mstop = 250,
type = c("gaussian", "logistic"),
MIBoost = TRUE,
pool = TRUE,
pool_threshold = 0,
impute_args = list(m = 10, maxit = 5, printFlag = FALSE),
impute_method = NULL,
use_quickpred = TRUE,
quickpred_args = list(mincor = 0.1, minpuc = 0.5, method = NULL, include = NULL,
exclude = NULL),
seed = 123,
show_progress = TRUE,
return_full_imputations = FALSE,
center = "auto"
)
Arguments
X |
A data.frame or matrix of predictors of size |
y |
A vector of length |
k |
Number of cross-validation folds. Default is |
ny |
Learning rate. Defaults to |
mstop |
Maximum number of boosting iterations to evaluate during
cross-validation. The selected |
type |
Type of loss function. One of:
|
MIBoost |
Logical. If |
pool |
Logical. If |
pool_threshold |
Only used when |
impute_args |
A named list of arguments forwarded to |
impute_method |
Optional named character vector passed to
|
use_quickpred |
Logical. If |
quickpred_args |
A named list of arguments forwarded to
|
seed |
Base random seed for fold assignment. If |
show_progress |
Logical. If |
return_full_imputations |
Logical. If |
center |
One of If centering is applied, a single grand mean vector |
Details
Rows with missing outcomes y are removed before fold assignment.
Within each CV fold, the remaining data are first split into a training subset
and a validation subset. Multiple imputation is then performed on the
covariates X only (the outcome is never imputed and is not used as a
predictor in the imputation models). The training covariates are multiply
imputed M times using mice, producing M imputed training
datasets. The corresponding validation covariates are then imputed M
times using the imputation models learned from the training data (leakage-avoiding).
If centering is applied, a single grand mean vector \mu_\star is
computed from the imputed training covariates in the corresponding fold and
subtracted from all imputed training and validation covariate matrices
in that fold.
impu_boost is run on the imputed training datasets for up to
mstop boosting iterations. At each iteration, prediction errors are
computed on the corresponding validation datasets and averaged across
imputations. This yields an aggregated error curve per fold, which is then
averaged across folds. The optimal stopping iteration is chosen as the
mstop value minimizing the mean CV error.
Finally, the full covariate matrix X is multiply imputed M times.
If centering is applied, it uses a grand mean \mu_\star computed across
the M full-data imputations. impu_boost is applied to these
datasets for the selected number of boosting iterations to obtain the final model.
Imputation control. All key mice settings can be passed via
impute_args (a named list forwarded to mice::mice()) and/or
impute_method (a named character vector of per-variable methods).
Internally, the function builds a full default method vector from the actual
data given to mice(), then merges any user-supplied entries
by name. The names in impute_method must exactly match the
column names in X (i.e., the data passed to mice()). Partial
vectors are allowed; variables not listed fall back to defaults; unknown names
are ignored with a warning. The function sets and may override data,
method (after merging overrides), predictorMatrix, and
ignore (to enforce train-only learning). Predictor matrices can be built
with mice::quickpred() (see use_quickpred, quickpred_args)
or with mice::make.predictorMatrix().
Value
A list with:
-
CV_error: numeric vector (lengthmstop) of mean CV loss. -
best_mstop: integer index minimizingCV_error. -
final_model: numeric vector of length1 + pwith the intercept and pooled coefficients of the final fit on full-data imputations atbest_mstop. -
full_imputations: (optional) whenreturn_full_imputations=TRUE, a listlist(X = <list length m>, y = <list length m>)containing the full-data imputations used for the final model. -
folds: integer vector of lengthngiving the CV fold id for each observation (1..k).
References
Kuchen, R. (2025). MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation. arXiv:2507.21807. doi:10.48550/arXiv.2507.21807 https://arxiv.org/abs/2507.21807.
See Also
impu_boost, cv_boost_imputed, mice
Examples
utils::data(booami_sim)
X <- booami_sim[, 1:25]
y <- booami_sim[, 26]
res <- cv_boost_raw(
X = X, y = y,
k = 2, seed = 123,
impute_args = list(m = 2, maxit = 1, printFlag = FALSE, seed = 1),
quickpred_args = list(mincor = 0.30, minpuc = 0.60),
mstop = 50,
show_progress = FALSE
)
# Partial custom imputation method override (X variables only)
meth <- c(X1 = "pmm")
res2 <- cv_boost_raw(
X = X, y = y,
k = 2, seed = 123,
impute_args = list(m = 2, maxit = 1, printFlag = FALSE, seed = 456),
quickpred_args = list(mincor = 0.30, minpuc = 0.60),
mstop = 50,
impute_method = meth,
show_progress = FALSE
)
Component-Wise Gradient Boosting Across Multiply Imputed Datasets
Description
Applies component-wise gradient boosting to multiply imputed datasets. Depending on the settings, either a separate model is reported for each imputed dataset, or the M models are pooled to yield a single final model. For pooling, one can choose the novel MIBoost algorithm, which enforces a uniform variable-selection scheme across all imputed datasets, or the more conventional ad-hoc approaches of estimate-averaging and selection-frequency thresholding.
Usage
impu_boost(
X_list,
y_list,
X_list_val = NULL,
y_list_val = NULL,
ny = 0.1,
mstop = 250,
type = c("gaussian", "logistic"),
MIBoost = TRUE,
pool = TRUE,
pool_threshold = 0,
center = c("auto", "force", "off")
)
Arguments
X_list |
List of length M; each element is an |
y_list |
List of length M; each element is a length- |
X_list_val |
Optional validation list (same structure as |
y_list_val |
Optional validation list (same structure as |
ny |
Learning rate. Defaults to |
mstop |
Number of boosting iterations (default |
type |
Type of loss function. One of:
|
MIBoost |
Logical. If |
pool |
Logical. If |
pool_threshold |
Only used when |
center |
One of |
Details
This function supports MIBoost, which enforces uniform variable selection across multiply imputed datasets. For full methodology, see Kuchen (2025).
Value
A list with elements:
-
INT: intercept(s). A scalar ifpool = TRUE, otherwise a length-M vector. -
BETA: coefficient estimates. A length-p vector ifpool = TRUE, otherwise an M\timesp matrix. -
CV_error: vector of validation errors (if validation data were provided), otherwiseNULL.
References
Kuchen, R. (2025). MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation. arXiv:2507.21807. doi:10.48550/arXiv.2507.21807 https://arxiv.org/abs/2507.21807.
See Also
simulate_booami_data, cv_boost_raw, cv_boost_imputed
Examples
set.seed(123)
utils::data(booami_sim)
M <- 2
n <- nrow(booami_sim)
x_cols <- grepl("^X\\d+$", names(booami_sim))
tr_idx <- sample(seq_len(n), floor(0.8 * n))
dat_tr <- booami_sim[tr_idx, , drop = FALSE]
dat_va <- booami_sim[-tr_idx, , drop = FALSE]
pm_tr <- mice::quickpred(dat_tr, method = "spearman",
mincor = 0.30, minpuc = 0.60)
imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr,
maxit = 1, printFlag = FALSE)
imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE)
X_list <- vector("list", M)
y_list <- vector("list", M)
X_list_val <- vector("list", M)
y_list_val <- vector("list", M)
for (m in seq_len(M)) {
tr_m <- mice::complete(imp_tr, m)
va_m <- mice::complete(imp_va, m)
X_list[[m]] <- data.matrix(tr_m[, x_cols, drop = FALSE])
y_list[[m]] <- tr_m$y
X_list_val[[m]] <- data.matrix(va_m[, x_cols, drop = FALSE])
y_list_val[[m]] <- va_m$y
}
fit <- impu_boost(
X_list, y_list,
X_list_val = X_list_val, y_list_val = y_list_val,
ny = 0.1, mstop = 50, type = "gaussian",
MIBoost = TRUE, pool = TRUE, center = "auto"
)
which.min(fit$CV_error)
head(fit$BETA)
fit$INT
## Not run:
# Heavier demo (more imputed datasets and iterations; for local runs)
set.seed(2025)
utils::data(booami_sim)
M <- 10
n <- nrow(booami_sim)
x_cols <- grepl("^X\\d+$", names(booami_sim))
tr_idx <- sample(seq_len(n), floor(0.8 * n))
dat_tr <- booami_sim[tr_idx, , drop = FALSE]
dat_va <- booami_sim[-tr_idx, , drop = FALSE]
pm_tr <- mice::quickpred(dat_tr, method = "spearman",
mincor = 0.20, minpuc = 0.40)
imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr,
maxit = 5, printFlag = TRUE)
imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE)
X_list <- vector("list", M)
y_list <- vector("list", M)
X_list_val <- vector("list", M)
y_list_val <- vector("list", M)
for (m in seq_len(M)) {
tr_m <- mice::complete(imp_tr, m)
va_m <- mice::complete(imp_va, m)
X_list[[m]] <- data.matrix(tr_m[, x_cols, drop = FALSE])
y_list[[m]] <- tr_m$y
X_list_val[[m]] <- data.matrix(va_m[, x_cols, drop = FALSE])
y_list_val[[m]] <- va_m$y
}
fit_heavy <- impu_boost(
X_list, y_list,
X_list_val = X_list_val, y_list_val = y_list_val,
ny = 0.1, mstop = 250, type = "gaussian",
MIBoost = TRUE, pool = TRUE, center = "auto"
)
str(fit_heavy)
## End(Not run)
Predict from booami objects
Description
Predict responses (link or response scale) from fitted booami models.
Usage
## S3 method for class 'booami_cv'
predict(object, newdata, type = c("link", "response"), ...)
## S3 method for class 'booami_pooled'
predict(object, newdata, type = c("link", "response"), ...)
## S3 method for class 'booami_multi'
predict(object, newdata, type = c("link", "response"), ...)
Arguments
object |
A fitted booami object. One of:
|
newdata |
A data.frame or matrix of predictors (same columns/order as training). |
type |
Either |
... |
Passed to |
Value
A numeric vector of predictions.
See Also
Simulate a Booami Example Dataset with Missing Values
Description
Generates a dataset with p predictors, of which the first p_inf
are informative. Predictors are drawn from a multivariate normal with a chosen
correlation structure, and the outcome can be continuous (type = "gaussian")
or binary (type = "logistic"). Missing values are introduced in the
predictors via MAR or MCAR; the outcome y is always fully observed (no NAs).
Usage
simulate_booami_data(
n = 300,
p = 25,
p_inf = 5,
rho = 0.3,
type = c("gaussian", "logistic"),
beta_range = c(1, 2),
intercept = 1,
corr_structure = c("all_ar1", "informative_cs", "blockdiag", "none"),
rho_noise = NULL,
noise_sd = 1,
miss = c("MAR", "MCAR"),
miss_prop = 0.25,
mar_drivers = c(1, 2, 3),
gamma_vec = NULL,
calibrate_mar = FALSE,
mar_scale = TRUE,
keep_observed = integer(0),
jitter_sd = 0.25,
keep_mar_drivers = TRUE
)
Arguments
n |
Number of observations (default |
p |
Total number of predictors (default |
p_inf |
Number of informative predictors (default |
rho |
Correlation parameter (interpretation depends on |
type |
Either |
beta_range |
Length-2 numeric; coefficients for the first |
intercept |
Intercept added to the linear predictor (default |
corr_structure |
One of |
rho_noise |
Optional correlation for the noise block when |
noise_sd |
Std. dev. of Gaussian noise added to |
miss |
Missingness mechanism: |
miss_prop |
Target marginal missingness proportion (default |
mar_drivers |
Indices of predictors that drive MAR (default |
gamma_vec |
Coefficients for MAR drivers; length must equal the number of MAR drivers actually used
(i.e., |
calibrate_mar |
If |
mar_scale |
If |
keep_observed |
Indices of predictors kept fully observed (values outside |
jitter_sd |
Standard deviation of the per-row jitter added to the MAR logit to induce heterogeneity
(default |
keep_mar_drivers |
Logical; if |
Details
Correlation structures:
-
"all_ar1": AR(1) correlation with parameterrhoacross allppredictors. -
"informative_cs": compound symmetry (exchangeable) within the firstp_infpredictors with parameterrho; others independent. -
"blockdiag": block-diagonal AR(1): the informative block (sizep_inf) has AR(1) withrho; the noise block (sizep - p_inf) has AR(1) withrho_noise(defaults torho). -
"none": independent predictors.
Missingness (predictors only):
-
"MAR": for each row, a logit missingness score is computed from the selected MAR drivers (seemar_drivers,gamma_vec,mar_scale); an intercept is set viacalibrate_marto target the proportionmiss_prop(otherwiseqlogis(miss_prop)), and per-row jitterN(0, jitter_sd)adds heterogeneity. The resulting probability is used to mask predictors (except those inkeep_observedand—ifkeep_mar_drivers = TRUE—the drivers themselves). The outcomeyis not masked. -
"MCAR": each predictor (except those inkeep_observed) is masked independently with probabilitymiss_prop. The outcomeyis not masked.
Note: In the simulation, missingness probabilities are computed using the
fully observed latent covariates before masking. From an analyst’s perspective after
masking, allowing the MAR drivers themselves to be missing makes missingness depend on
unobserved values—i.e., effectively non-ignorable (MNAR). Setting
keep_mar_drivers = TRUE keeps those drivers observed and yields a MAR mechanism.
Value
A list with elements:
-
data:data.framewith columnsX1..Xpandy. Missing values are introduced in the predictorsX1..Xp;yis fully observed. -
beta: numeric length-pvector of true coefficients (non-zeros in the firstp_infpositions). -
informative: integer vector1:p_inf. -
type: character, outcome type ("gaussian"or"logistic"). -
intercept: numeric intercept used.
The data element additionally carries attributes:
"true_beta", "informative",
"type", "corr_structure", "rho", "rho_noise" (if set),
"intercept", "noise_sd" (Gaussian; NA otherwise), "mar_scale",
and "keep_mar_drivers".
Reproducing the shipped dataset booami_sim
set.seed(123) sim <- simulate_booami_data( n = 300, p = 25, p_inf = 5, rho = 0.3, type = "gaussian", beta_range = c(1, 2), intercept = 1, corr_structure = "all_ar1", rho_noise = NULL, noise_sd = 1, miss = "MAR", miss_prop = 0.25, mar_drivers = c(1, 2, 3), gamma_vec = NULL, calibrate_mar = FALSE, mar_scale = TRUE, keep_observed = integer(0), jitter_sd = 0.25, keep_mar_drivers = TRUE ) booami_sim <- sim$data
See Also
booami_sim, cv_boost_raw,
cv_boost_imputed, impu_boost
Examples
set.seed(42)
sim <- simulate_booami_data(
n = 200, p = 15, p_inf = 4, rho = 0.25,
type = "gaussian", miss = "MAR", miss_prop = 0.20
)
d <- sim$data
dim(d)
mean(colSums(is.na(d)) > 0) # fraction of columns with any NAs
sum(is.na(d$y)) # should be 0
head(attr(d, "true_beta"))
attr(d, "informative")
# Example with block-diagonal correlation and protected MAR drivers
sim2 <- simulate_booami_data(
n = 150, p = 12, p_inf = 3, rho = 0.40, rho_noise = 0.10,
corr_structure = "blockdiag", miss = "MAR", miss_prop = 0.30,
mar_drivers = c(1, 2), keep_mar_drivers = TRUE
)
colSums(is.na(sim2$data))[1:4]
# Binary outcome example
sim3 <- simulate_booami_data(
n = 100, p = 10, p_inf = 2, rho = 0.2,
type = "logistic", miss = "MCAR", miss_prop = 0.15
)
table(sim3$data$y, useNA = "ifany")
sum(is.na(sim3$data$y)) # should be 0
utils::data(booami_sim)
dim(booami_sim)
head(attr(booami_sim, "true_beta"))
attr(booami_sim, "informative")