| Title: | Missingness Benchmark for Continuous Glucose Monitoring Data |
| Version: | 0.0.1 |
| Description: | Evaluates predictive performance under feature-level missingness in repeated-measures continuous glucose monitoring-like data. The benchmark injects missing values at user-specified rates, imputes incomplete feature matrices using an iterative chained-equations approach inspired by multivariate imputation by chained equations (MICE; Azur et al. (2011) <doi:10.1002/mpr.329>), fits Random Forest regression models (Breiman (2001) <doi:10.1023/A:1010933404324>) and k-nearest-neighbor regression models (Zhang (2016) <doi:10.21037/atm.2016.03.37>), and reports mean absolute percentage error and R-squared across missingness rates. |
| License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
| Encoding: | UTF-8 |
| Depends: | R (≥ 4.3) |
| RoxygenNote: | 7.3.3 |
| Imports: | mice, FNN, Metrics, ranger |
| Suggests: | testthat (≥ 3.0.0), spelling, knitr, rmarkdown |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| Language: | en-US |
| URL: | https://github.com/saraswatsh/CGMissingDataR, https://saraswatsh.github.io/CGMissingDataR/ |
| BugReports: | https://github.com/saraswatsh/CGMissingDataR/issues |
| LazyData: | true |
| VignetteBuilder: | knitr |
| Packaged: | 2026-01-29 02:57:52 UTC; shubh |
| Author: | Shubh Saraswat |
| Maintainer: | Shubh Saraswat <shubh.saraswat00@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-03 10:30:15 UTC |
CGMissingDataR: Missingness Benchmark for Continuous Glucose Monitoring Data
Description
Evaluates predictive performance under feature-level missingness in repeated-measures continuous glucose monitoring-like data. The benchmark injects missing values at user-specified rates, imputes incomplete feature matrices using an iterative chained-equations approach inspired by multivariate imputation by chained equations (MICE; Azur et al. (2011) doi:10.1002/mpr.329), fits Random Forest regression models (Breiman (2001) doi:10.1023/A:1010933404324) and k-nearest-neighbor regression models (Zhang (2016) doi:10.21037/atm.2016.03.37), and reports mean absolute percentage error and R-squared across missingness rates.
Author(s)
Maintainer: Shubh Saraswat shubh.saraswat00@gmail.com (ORCID) [copyright holder]
Authors:
Hasin Shahed Shad hasin.shad@uky.edu
Xiaohua Douglas Zhang douglas.zhang@uky.edu (ORCID)
See Also
Useful links:
Report bugs at https://github.com/saraswatsh/CGMissingDataR/issues
Example dataset for CGMissingData
Description
A small synthetic dataset intended for examples and tests of
run_missingness_benchmark().
Usage
CGMExampleData
Format
A data frame with 250 rows and 6 variables:
- LBORRES
Laboratory Observed Result for Glucose (numeric).
- TimeSeries
Numeric feature representing time series data.
- TimeDifferenceMinutes
Time difference in minutes between measurements (numeric).
- USUBJID
Numeric subject identifier.
- SiteID
Site identifier (character).
- Visit
Visit label (character).
Examples
data("CGMExampleData")
Run missingness benchmark
Description
Benchmarks model performance under feature missingness. The function:
Filters to complete cases for
target_colandfeature_cols(baseline complete data),Splits into training/validation,
Masks feature values at each rate using Bernoulli (cell-wise) missingness,
Imputes missing features using MICE on training data and applies the fitted imputation model to validation data via
mice::mice.mids(newdata = ...)(reduces leakage),Trains Random Forest (
ranger) and kNN regression (FNN::knn.reg),Returns MAPE and R-squared for each model and mask rate.
Feature columns must be numeric (or coercible to numeric without introducing new missing values). This mirrors workflows where features are treated as numeric arrays.
Usage
run_missingness_benchmark(
data,
target_col,
feature_cols = NULL,
mask_rates = c(0.05, 0.1, 0.2, 0.3),
rf_n_estimators = 200,
knn_k = 5,
test_size = 0.2,
seed = 42
)
Arguments
data |
A data.frame (or object coercible to data.frame) containing the dataset. |
target_col |
Single character string: name of the outcome column. |
feature_cols |
Character vector of feature column names. If |
mask_rates |
Numeric vector in (0, 1): proportion of feature entries to mask per rate. |
rf_n_estimators |
Integer: number of trees for the random forest. |
knn_k |
Integer: number of neighbors for kNN regression. |
test_size |
Numeric in (0, 1): fraction of rows assigned to validation split. |
seed |
Integer: seed for data split and model reproducibility. |
Details
Validation imputation is performed using mice::mice.mids(newdata = ...), which generates imputations
for new data according to the model stored in the training mids object.
MAPE is computed using Metrics::mape() on non-zero targets only to avoid instability when actual values are zero.
Value
A data.frame with columns MaskRate, Model, MAPE, and R2.
Author(s)
Shubh Saraswat, Hasin Shahed Shad, and Xiaohua Douglas Zhang
Examples
data("CGMExampleData")
run_missingness_benchmark(
CGMExampleData,
target_col = "LBORRES",
feature_cols = c("TimeDifferenceMinutes", "TimeSeries", "USUBJID"),
mask_rates = c(0.05, 0.10),
rf_n_estimators = 100,
knn_k = 3
)