ldmppr Workflow on
Simulated Datalibrary(ldmppr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, unionTo illustrate the use of the ldmppr package, we will be
using the included small_example_data dataset. This dataset
consists of locations and sizes for points in a (25m x 25m) square
domain.
data("small_example_data")
nrow(small_example_data)
#> [1] 121
head(small_example_data)
#> x y size
#> 1 10.000000 14.0000000 900.7306
#> 2 5.636737 20.3068181 523.1122
#> 3 24.086291 6.2358462 519.3012
#> 4 21.191344 14.0034707 490.1228
#> 5 14.742128 0.3113149 400.2289
#> 6 20.160199 3.6875049 395.8505We also include a collection of example raster files obtained from an
ESS-DIVE repository (see the help file for
small_example_data for details).
The ldmppr package provides tools for:
As of version 1.0.4.9000, the main workflow functions
return S3 objects with standard methods
(print(), plot(), coef(),
as.data.frame(), etc.). In most cases you should no longer
need helper plotting functions like plot_mpp().
We start by estimating the parameters of a self-correcting
spatio-temporal point process using the
estimate_process_parameters function. This function makes
use of user specified optimization algorithms in the nloptr
function to perform estimation through log-likelihood optimization. The
function can be run with three different optimization strategies
("local", "global_local", "multires_global_local") that
allow the user to control the approach. The "local"
strategy performs a local optimization, where the
"global_local" first runs a global optimization and then
uses the solution as a starting point for a local optimization. Finally,
the "multires_global_local" strategy performs similarly to
the "global_local", but iterates further at the local level
by increasing the resolution of the spatio-temporal grid and repeating
the local optimization with increasing degrees of polish. The function
requires the observed locations and observed arrival times, which are
generated from the observed sizes using a power law mapping function.
These inputs are accepted as:
(time, x, y),
or(x, y, size) plus a
mapping hyperparameter delta (or a vector
delta_values for a delta-search).The power law mapping function depends on the hyperparameter
delta, which controls the shape of the mapping relationship
between sizes and arrival times. The function also requires the grid
values for the optimization process, the initial parameter values, the
upper bounds for the parameters, and the optimization algorithm (either
global and local, or just local depending on strategy) to use. The
function returns the optimal parameter estimates for the self-correcting
point process.
Below we show the common case where you start from
(x, y, size) with delta = 1.
# Use the (x, y, size) form and let estimate_process_parameters() construct time via delta
parameter_estimation_data <- small_example_data %>%
dplyr::select(x, y, size)
# Define the integration / grid values used by the likelihood approximation
x_grid <- seq(0, 25, length.out = 5)
y_grid <- seq(0, 25, length.out = 5)
t_grid <- seq(0, 1, length.out = 5)
# Parameter initialization values: (alpha1, beta1, gamma1, alpha2, beta2, alpha3, beta3, gamma3)
parameter_inits <- c(1.5, 8.5, 0.015, 1.5, 3.2, 0.75, 3, 0.08)
# Upper bounds for (t, x, y)
upper_bounds <- c(1, 25, 25)
fit_sc <- estimate_process_parameters(
data = parameter_estimation_data,
process = "self_correcting",
x_grid = x_grid,
y_grid = y_grid,
t_grid = t_grid,
upper_bounds = upper_bounds,
parameter_inits = parameter_inits,
delta = 1,
parallel = FALSE,
strategy = c("global_local"),
global_algorithm = "NLOPT_GN_CRS2_LM",
local_algorithm = "NLOPT_LN_BOBYQA",
global_options = list(maxeval = 150),
local_options = list(maxeval = 300, xtol_rel = 1e-5, maxtime = NULL),
verbose = FALSE
)
# Print method for ldmppr_fit objects
print(fit_sc)
#> <ldmppr_fit>
#> process: self_correcting
#> engine: nloptr
#> n_obs: 121
#> delta*: 1
#> objective(best): 182.9202
# Extract parameters
estimated_parameters <- coef(fit_sc)
estimated_parameters
#> [1] -4.63813041 14.98628572 0.04031436 1.20493243 7.88640050 11.78765565
#> [7] 0.07854220 0.33342012Notes
delta_values = c(...) and set
parallel = TRUE to run one optimizer per delta (optionally
in parallel).Next, we use the train_mark_model function to train a
suitably flexible mark model using location-specific covariates
(rasters), (optionally) semi-distance dependent competition indices, and
the generated arrival times to predict sizes. The function uses
either:
model_type = "xgboost"),
ormodel_type = "ranger"),and may be run in parallel if desired. The user has control over the choice of a Gradient Boosting Machine (GBM) or Random Forest (RF) model, the bounds of the spatial domain, the inclusion of competition indices, the competition radius, the correction method, the final model selection metric, the number of cross validation folds, and the size of the parameter tuning grid for the model.
# Load raster covariates shipped with the package
raster_paths <- list.files(system.file("extdata", package = "ldmppr"),
pattern = "\\.tif$", full.names = TRUE
)
raster_paths <- raster_paths[!grepl("_med\\.tif$", raster_paths)]
rasters <- lapply(raster_paths, terra::rast)
# Scale rasters once up front
scaled_rasters <- scale_rasters(rasters)
# Create training data with a time column (consistent with the delta used above)
model_training_data <- small_example_data %>%
mutate(time = power_law_mapping(size, delta = 1))
mark_model <- train_mark_model(
data = model_training_data,
raster_list = scaled_rasters,
scaled_rasters = TRUE,
model_type = "xgboost",
xy_bounds = c(0, 25, 0, 25),
parallel = FALSE,
include_comp_inds = TRUE,
competition_radius = 10,
correction = "none",
selection_metric = "rmse",
cv_folds = 5,
tuning_grid_size = 20,
verbose = TRUE
)
#> Processing data...
#> Training XGBoost model...
#> → A | warning: A correlation computation is required, but `estimate` is constant and has 0
#> standard deviation, resulting in a divide by 0 error. `NA` will be returned.
#> There were issues with some computations A: x8
#> There were issues with some computations A: x24
#> There were issues with some computations A: x40
#> There were issues with some computations A: x40
#>
#> Training complete!
# S3 print method
mark_model
#> <ldmppr_mark_model>
#> engine: xgboost
#> has fit_engine: TRUE
#> has xgb_raw: FALSE
#> n_features: 13If the user wants to save the trained mark model (and reload it at a
later time), the package provides the save_mark_model and
load_mark_model functions (which handle model-engine
specifics appropriately):
Now that we have (i) an estimated process model and (ii) a trained
mark model, we can check the model fit using the
check_model_fit function. This function provides a variety
of non-parametric summaries for point processes and global envelope
tests to assess the goodness of fit of the model. The package provides
individual envelope tests for the L, F, G, J, E, and V statistics, or a
combined envelope test for all statistics simultaneously by making use
of the functionality of the spatstat and GET
packages. By plotting the combined envelope test, we can visually assess
the goodness of fit of the model and obtain a \(p\)-value measuring how well the estimated
model captures the relationships observed in the reference data.
Typically a \(p\)-value less than 0.05
indicates a poor fit of the model to the data, and the authors of the
GET package recommend a minimum of 2500 simulations to
obtain a valid \(p\)-value at the .05
level.
check_model_fit() returns an S3 object with a
plot() method (by default it plots the combined global
envelope test).
# Create a marked point pattern from the observed data
reference_mpp <- generate_mpp(
locations = small_example_data[, c("x", "y")],
marks = small_example_data$size,
xy_bounds = c(0, 25, 0, 25)
)
#> Registered S3 method overwritten by 'spatstat.geom':
#> method from
#> print.metric yardstick
# Anchor point (condition on an observed location)
M_n <- as.matrix(small_example_data[1, c("x", "y")])
model_check <- check_model_fit(
reference_data = reference_mpp,
t_min = 0,
t_max = 1,
sc_params = estimated_parameters,
anchor_point = M_n,
raster_list = scaled_rasters,
scaled_rasters = TRUE,
mark_model = mark_model,
xy_bounds = c(0, 25, 0, 25),
include_comp_inds = TRUE,
thinning = TRUE,
correction = "none",
competition_radius = 10,
n_sim = 100,
save_sims = FALSE,
verbose = TRUE,
seed = 90210
)
#> Beginning Data Simulations...
#> Data Simulations Complete...
# S3 plot method (combined global envelope test)
plot(model_check)
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`.
#> ℹ See also `vignette("ggplot2-in-packages")` for more information.
#> ℹ The deprecated feature was likely used in the GET package.
#> Please report the issue at <https://github.com/myllym/GET/issues>.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: `aes_()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`
#> ℹ The deprecated feature was likely used in the GET package.
#> Please report the issue at <https://github.com/myllym/GET/issues>.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.You can still access the underlying GET objects if you
want (e.g., model_check$combined_env), but for most
workflows plot(model_check) is sufficient.
Finally, we can simulate a new marked point process realization from the fitted model.
simulate_mpp() returns an object of class
ldmppr_sim with plot() and
as.data.frame() methods.
simulated <- simulate_mpp(
sc_params = estimated_parameters,
t_min = 0,
t_max = 1,
anchor_point = M_n,
raster_list = scaled_rasters,
scaled_rasters = TRUE,
mark_model = mark_model,
xy_bounds = c(0, 25, 0, 25),
include_comp_inds = TRUE,
competition_radius = 10,
correction = "none",
thinning = TRUE
)
# Plot method for ldmppr_sim
plot(simulated)
# Data-frame view of the simulated realization
head(as.data.frame(simulated))
#> time x y marks
#> 1 0.0000000 10.000000 14.000000 570.8120
#> 2 0.5213940 13.418784 6.459381 525.7977
#> 3 0.5475949 5.723242 3.441871 490.7401
#> 4 0.5993501 4.807220 17.550734 526.4841
#> 5 0.6048312 21.538434 2.632380 440.5092
#> 6 0.6319068 14.873970 10.073311 370.6411This vignette demonstrates the core ldmppr workflow
using the updated S3-based interfaces:
estimate_process_parameters() → returns a fitted
process object (ldmppr_fit)train_mark_model() → returns a fitted mark model object
(ldmppr_mark_model)check_model_fit() → returns a model-check object
(ldmppr_model_check)simulate_mpp() → returns a simulation object
(ldmppr_sim)These objects are designed to be composable and to behave like standard R model objects through S3 methods.