Example ldmppr Workflow on Simulated Data

library(ldmppr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Data

To illustrate the use of the ldmppr package, we will be using the included small_example_data dataset. This dataset consists of locations and sizes for points in a (25m x 25m) square domain.

data("small_example_data")
nrow(small_example_data)
#> [1] 121
head(small_example_data)
#>           x          y     size
#> 1 10.000000 14.0000000 900.7306
#> 2  5.636737 20.3068181 523.1122
#> 3 24.086291  6.2358462 519.3012
#> 4 21.191344 14.0034707 490.1228
#> 5 14.742128  0.3113149 400.2289
#> 6 20.160199  3.6875049 395.8505

We also include a collection of example raster files obtained from an ESS-DIVE repository (see the help file for small_example_data for details).

Introduction

The ldmppr package provides tools for:

  1. estimating a mechanistic spatio-temporal point process model for a reference marked point pattern (by mapping marks to event times),
  2. training a mark model that relates covariates (and optional competition indices) to marks,
  3. checking model fit using non-parametric summaries and global envelope tests, and
  4. simulating from the fitted model.

As of version 1.0.4.9000, the main workflow functions return S3 objects with standard methods (print(), plot(), coef(), as.data.frame(), etc.). In most cases you should no longer need helper plotting functions like plot_mpp().


Example workflow

Step 1: Estimate the point process parameters

We start by estimating the parameters of a self-correcting spatio-temporal point process using the estimate_process_parameters function. This function makes use of user specified optimization algorithms in the nloptr function to perform estimation through log-likelihood optimization. The function can be run with three different optimization strategies ("local", "global_local", "multires_global_local") that allow the user to control the approach. The "local" strategy performs a local optimization, where the "global_local" first runs a global optimization and then uses the solution as a starting point for a local optimization. Finally, the "multires_global_local" strategy performs similarly to the "global_local", but iterates further at the local level by increasing the resolution of the spatio-temporal grid and repeating the local optimization with increasing degrees of polish. The function requires the observed locations and observed arrival times, which are generated from the observed sizes using a power law mapping function. These inputs are accepted as:

The power law mapping function depends on the hyperparameter delta, which controls the shape of the mapping relationship between sizes and arrival times. The function also requires the grid values for the optimization process, the initial parameter values, the upper bounds for the parameters, and the optimization algorithm (either global and local, or just local depending on strategy) to use. The function returns the optimal parameter estimates for the self-correcting point process.

Below we show the common case where you start from (x, y, size) with delta = 1.

# Use the (x, y, size) form and let estimate_process_parameters() construct time via delta
parameter_estimation_data <- small_example_data %>%
  dplyr::select(x, y, size)

# Define the integration / grid values used by the likelihood approximation
x_grid <- seq(0, 25, length.out = 5)
y_grid <- seq(0, 25, length.out = 5)
t_grid <- seq(0, 1,  length.out = 5)

# Parameter initialization values: (alpha1, beta1, gamma1, alpha2, beta2, alpha3, beta3, gamma3)
parameter_inits <- c(1.5, 8.5, 0.015, 1.5, 3.2, 0.75, 3, 0.08)

# Upper bounds for (t, x, y)
upper_bounds <- c(1, 25, 25)


fit_sc <- estimate_process_parameters(
  data = parameter_estimation_data,
  process = "self_correcting",
  x_grid = x_grid,
  y_grid = y_grid,
  t_grid = t_grid,
  upper_bounds = upper_bounds,
  parameter_inits = parameter_inits,
  delta = 1,
  parallel = FALSE,
  strategy = c("global_local"),
  global_algorithm = "NLOPT_GN_CRS2_LM",
  local_algorithm = "NLOPT_LN_BOBYQA",
  global_options = list(maxeval = 150),
  local_options = list(maxeval = 300, xtol_rel = 1e-5, maxtime = NULL),
  verbose = FALSE
)

# Print method for ldmppr_fit objects
print(fit_sc)
#> <ldmppr_fit>
#>   process: self_correcting
#>   engine:  nloptr
#>   n_obs:   121
#>   delta*:  1
#>   objective(best): 182.9202

# Extract parameters
estimated_parameters <- coef(fit_sc)
estimated_parameters
#> [1] -4.63813041 14.98628572  0.04031436  1.20493243  7.88640050 11.78765565
#> [7]  0.07854220  0.33342012

Notes

Step 2: Train a mark model

Next, we use the train_mark_model function to train a suitably flexible mark model using location-specific covariates (rasters), (optionally) semi-distance dependent competition indices, and the generated arrival times to predict sizes. The function uses either:

and may be run in parallel if desired. The user has control over the choice of a Gradient Boosting Machine (GBM) or Random Forest (RF) model, the bounds of the spatial domain, the inclusion of competition indices, the competition radius, the correction method, the final model selection metric, the number of cross validation folds, and the size of the parameter tuning grid for the model.

# Load raster covariates shipped with the package
raster_paths <- list.files(system.file("extdata", package = "ldmppr"),
  pattern = "\\.tif$", full.names = TRUE
)
raster_paths <- raster_paths[!grepl("_med\\.tif$", raster_paths)]
rasters <- lapply(raster_paths, terra::rast)

# Scale rasters once up front
scaled_rasters <- scale_rasters(rasters)

# Create training data with a time column (consistent with the delta used above)
model_training_data <- small_example_data %>%
  mutate(time = power_law_mapping(size, delta = 1))

mark_model <- train_mark_model(
  data = model_training_data,
  raster_list = scaled_rasters,
  scaled_rasters = TRUE,
  model_type = "xgboost",
  xy_bounds = c(0, 25, 0, 25),
  parallel = FALSE,
  include_comp_inds = TRUE,
  competition_radius = 10,
  correction = "none",
  selection_metric = "rmse",
  cv_folds = 5,
  tuning_grid_size = 20,
  verbose = TRUE
)
#> Processing data...
#> Training XGBoost model...
#> → A | warning: A correlation computation is required, but `estimate` is constant and has 0
#>                standard deviation, resulting in a divide by 0 error. `NA` will be returned.
#> There were issues with some computations   A: x8
#> There were issues with some computations   A: x24
#> There were issues with some computations   A: x40
#> There were issues with some computations   A: x40
#> 
#> Training complete!

# S3 print method
mark_model
#> <ldmppr_mark_model>
#>   engine: xgboost
#>   has fit_engine: TRUE
#>   has xgb_raw: FALSE
#>   n_features: 13

If the user wants to save the trained mark model (and reload it at a later time), the package provides the save_mark_model and load_mark_model functions (which handle model-engine specifics appropriately):

save_path <- tempfile(fileext = ".rds")
save_mark_model(mark_model, save_path)

mark_model2 <- load_mark_model(save_path)
mark_model2

Step 3: Check model fit

Now that we have (i) an estimated process model and (ii) a trained mark model, we can check the model fit using the check_model_fit function. This function provides a variety of non-parametric summaries for point processes and global envelope tests to assess the goodness of fit of the model. The package provides individual envelope tests for the L, F, G, J, E, and V statistics, or a combined envelope test for all statistics simultaneously by making use of the functionality of the spatstat and GET packages. By plotting the combined envelope test, we can visually assess the goodness of fit of the model and obtain a \(p\)-value measuring how well the estimated model captures the relationships observed in the reference data. Typically a \(p\)-value less than 0.05 indicates a poor fit of the model to the data, and the authors of the GET package recommend a minimum of 2500 simulations to obtain a valid \(p\)-value at the .05 level.

check_model_fit() returns an S3 object with a plot() method (by default it plots the combined global envelope test).

# Create a marked point pattern from the observed data
reference_mpp <- generate_mpp(
  locations = small_example_data[, c("x", "y")],
  marks = small_example_data$size,
  xy_bounds = c(0, 25, 0, 25)
)
#> Registered S3 method overwritten by 'spatstat.geom':
#>   method       from     
#>   print.metric yardstick

# Anchor point (condition on an observed location)
M_n <- as.matrix(small_example_data[1, c("x", "y")])

model_check <- check_model_fit(
  reference_data = reference_mpp,
  t_min = 0,
  t_max = 1,
  sc_params = estimated_parameters,
  anchor_point = M_n,
  raster_list = scaled_rasters,
  scaled_rasters = TRUE,
  mark_model = mark_model,
  xy_bounds = c(0, 25, 0, 25),
  include_comp_inds = TRUE,
  thinning = TRUE,
  correction = "none",
  competition_radius = 10,
  n_sim = 100,
  save_sims = FALSE,
  verbose = TRUE,
  seed = 90210
)
#> Beginning Data Simulations...
#> Data Simulations Complete...

# S3 plot method (combined global envelope test)
plot(model_check)
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`.
#> ℹ See also `vignette("ggplot2-in-packages")` for more information.
#> ℹ The deprecated feature was likely used in the GET package.
#>   Please report the issue at <https://github.com/myllym/GET/issues>.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: `aes_()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`
#> ℹ The deprecated feature was likely used in the GET package.
#>   Please report the issue at <https://github.com/myllym/GET/issues>.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

You can still access the underlying GET objects if you want (e.g., model_check$combined_env), but for most workflows plot(model_check) is sufficient.

Step 4: Simulate from the fitted model

Finally, we can simulate a new marked point process realization from the fitted model.

simulate_mpp() returns an object of class ldmppr_sim with plot() and as.data.frame() methods.

simulated <- simulate_mpp(
  sc_params = estimated_parameters,
  t_min = 0,
  t_max = 1,
  anchor_point = M_n,
  raster_list = scaled_rasters,
  scaled_rasters = TRUE,
  mark_model = mark_model,
  xy_bounds = c(0, 25, 0, 25),
  include_comp_inds = TRUE,
  competition_radius = 10,
  correction = "none",
  thinning = TRUE
)

# Plot method for ldmppr_sim
plot(simulated)


# Data-frame view of the simulated realization
head(as.data.frame(simulated))
#>        time         x         y    marks
#> 1 0.0000000 10.000000 14.000000 570.8120
#> 2 0.5213940 13.418784  6.459381 525.7977
#> 3 0.5475949  5.723242  3.441871 490.7401
#> 4 0.5993501  4.807220 17.550734 526.4841
#> 5 0.6048312 21.538434  2.632380 440.5092
#> 6 0.6319068 14.873970 10.073311 370.6411

Summary

This vignette demonstrates the core ldmppr workflow using the updated S3-based interfaces:

These objects are designed to be composable and to behave like standard R model objects through S3 methods.

mirror server hosted at Truenetwork, Russian Federation.