Feature samplers are a core component of perturbation-based feature importance methods (PFI, CFI, RFI) and any other methods based on some form of marginalization (SAGE). They determine how features are replaced or perturbed when evaluating feature importance. This vignette introduces the different types of feature samplers available in xplainfi and demonstrates their use.
We create two tasks: one with mixed features (the penguins data), and one with all-numeric features.
All feature samplers inherit from the FeatureSampler
base class, which provides a common interface for sampling features.
Feature Type Support: Each sampler declares which feature types it supports:
# Check supported feature types for different samplers
task_mixed$feature_types
#> Key: <id>
#> id type
#> <char> <char>
#> 1: bill_depth numeric
#> 2: bill_length numeric
#> 3: body_mass integer
#> 4: flipper_length integer
#> 5: island factor
#> 6: sex factor
#> 7: year integer
permutation = MarginalPermutationSampler$new(task_mixed)
permutation$feature_types
#> [1] "numeric" "factor" "ordered" "integer" "logical" "Date"
#> [7] "POSIXct" "character"Two Sampling Methods:
$sample(feature, row_ids) - Sample from the stored
task$sample_newdata(feature, newdata) - Sample using
external dataLet’s demonstrate both with the permutation sampler:
# Sample from stored task (using row_ids)
sampled_task = permutation$sample(
feature = "bill_length",
row_ids = 40:45
)
sampled_task
#> species bill_depth bill_length body_mass flipper_length island sex year
#> <fctr> <num> <num> <int> <int> <fctr> <fctr> <int>
#> 1: Adelie 19.1 37.0 4650 184 Dream male 2007
#> 2: Adelie 18.0 36.5 3150 182 Dream female 2007
#> 3: Adelie 18.4 44.1 3900 195 Dream male 2007
#> 4: Adelie 18.5 36.0 3100 186 Dream female 2007
#> 5: Adelie 19.7 39.8 4400 196 Dream male 2007
#> 6: Adelie 16.9 40.8 3000 185 Dream female 2007
# Sample from "external" data
test_data = task_mixed$data(rows = 40:45)
sampled_external = permutation$sample_newdata(
feature = "bill_length",
newdata = test_data
)
sampled_external
#> species bill_depth bill_length body_mass flipper_length island sex year
#> <fctr> <num> <num> <int> <int> <fctr> <fctr> <int>
#> 1: Adelie 19.1 37.0 4650 184 Dream male 2007
#> 2: Adelie 18.0 36.0 3150 182 Dream female 2007
#> 3: Adelie 18.4 44.1 3900 195 Dream male 2007
#> 4: Adelie 18.5 39.8 3100 186 Dream female 2007
#> 5: Adelie 19.7 36.5 4400 196 Dream male 2007
#> 6: Adelie 16.9 40.8 3000 185 Dream female 2007Notice that:
The MarginalPermutationSampler performs simple random
permutation of features, breaking their relationship with the target and
other features. This is the classic approach used in Permutation Feature
Importance (PFI).
How it works:
# Create permutation sampler
permutation = MarginalPermutationSampler$new(task_mixed)
# Sample a continuous feature
original = task_mixed$data(rows = 1:10)
sampled = permutation$sample("bill_length", row_ids = 1:10)
# Compare original and sampled values
data.table(
original_bill = original$bill_length,
sampled_bill = sampled$bill_length,
sex = original$sex # Unchanged
)
#> original_bill sampled_bill sex
#> <num> <num> <fctr>
#> 1: 39.1 36.7 male
#> 2: 39.5 39.5 female
#> 3: 40.3 34.1 female
#> 4: NA NA <NA>
#> 5: 36.7 39.1 female
#> 6: 39.3 38.9 male
#> 7: 38.9 40.3 female
#> 8: 39.2 42.0 male
#> 9: 34.1 39.3 <NA>
#> 10: 42.0 39.2 <NA>Note that the permutation is only performed within the
requested row_ids.
Use in PFI: The permutation sampler is used by default in Permutation Feature Importance
The MarginalReferenceSampler is another type of marginal
sampler that samples complete rows from reference data. Unlike
MarginalPermutationSampler which shuffles feature values
independently, this sampler preserves within-row dependencies by
sampling intact observations.
Key differences from MarginalPermutationSampler:
This is the approach used in SAGE (Shapley Additive Global
importancE) for marginal feature importance. In this SAGE
implementation, the MarginalImputer,
uses this approach. The “imputer” name comes from the “imputation” of
model predictions using out-of-coalition features by sampling from their
marginal distribution. In xplainfi, we separate the
“feature sampling” and “model prediction” steps (for now), which is why
we keep the sampling infrastructure independent.
How it works:
# Create marginal reference sampler with n_samples reference pool
marginal_ref = MarginalReferenceSampler$new(task_mixed, n_samples = 30L)
# Sample a feature - each row gets values from a randomly sampled reference row
original = task_mixed$data(rows = 1:5)
sampled = marginal_ref$sample("bill_length", row_ids = 1:5)
# Compare
data.table(
original_bill = original$bill_length,
sampled_bill = sampled$bill_length,
sex = original$sex # Unchanged
)
#> original_bill sampled_bill sex
#> <num> <num> <fctr>
#> 1: 39.1 44.4 male
#> 2: 39.5 36.6 female
#> 3: 40.3 34.6 female
#> 4: NA 45.1 <NA>
#> 5: 36.7 47.3 femaleParameters:
n_samples: Controls the size of the reference data pool
NULL: uses all task dataPreserving within-row correlations:
To demonstrate the difference, consider features that are correlated
as in task_numeric. Here, x1 and
x2 are correlated, and if we sample them jointly, only
MarginalReferenceSampler retains their original correlation
(approximately).
# Sample with MarginalPermutationSampler (breaks correlations)
perm = MarginalPermutationSampler$new(task_numeric)
sampled_perm = perm$sample(c("x1", "x2"), row_ids = 1:10)
# Sample with MarginalReferenceSampler (preserves within-row correlations)
ref = MarginalReferenceSampler$new(task_numeric, n_samples = 50L)
sampled_ref = ref$sample(c("x1", "x2"), row_ids = 1:10)
# Check correlations
cor_original = cor(task_numeric$data()$x1, task_numeric$data()$x2)
cor_perm = cor(sampled_perm$x1, sampled_perm$x2)
cor_ref = cor(sampled_ref$x1, sampled_ref$x2)
data.table(
method = c("Original", "Permutation", "Reference"),
correlation = c(cor_original, cor_perm, cor_ref)
)
#> method correlation
#> <char> <num>
#> 1: Original 0.8937600
#> 2: Permutation -0.1390767
#> 3: Reference 0.9352523The reference sampler better preserves the correlation structure because it samples complete rows, while permutation completely breaks the dependency.
Conditional samplers account for dependencies between features by sampling from \(P(X_j | X_{-j})\) rather than the marginal \(P(X_j)\). This is relevant when features are correlated.
All conditional samplers inherit from ConditionalSampler
and support:
conditioning_set$sample() and $sample_newdata()
methodsThe ConditionalGaussianSampler assumes features follow a
multivariate Gaussian distribution and uses closed-form conditional
distributions.
Advantages:
Limitations:
# Create Gaussian conditional sampler
gaussian = ConditionalGaussianSampler$new(task_numeric)
# Sample x1 conditioned on other features
sampled = gaussian$sample(
feature = "x1",
row_ids = 1:10,
conditioning_set = c("x2", "x3", "x4")
)
# Compare original and conditionally sampled values
original = task_numeric$data(rows = 1:10)
data.table(
original = original$x1,
sampled = sampled$x1,
x2 = original$x2 # Conditioning feature (unchanged)
)
#> original sampled x2
#> <num> <num> <num>
#> 1: 0.16263041 0.33337129 0.05484016
#> 2: -1.12534736 -0.01459085 -0.56460970
#> 3: 1.13740298 1.65973385 1.23103526
#> 4: -0.17251462 -0.74192788 -0.06467517
#> 5: 0.23242358 0.53626850 0.68399963
#> 6: -0.89523133 0.01776016 -0.65218453
#> 7: 1.37772605 0.67430459 1.36146493
#> 8: 0.01994107 0.08830762 0.18946096
#> 9: -0.38992835 -0.24925197 -0.90303808
#> 10: 0.23092028 0.58651185 0.23138698Notice that the sampled values respect the conditional distribution - they’re different from the original but plausible given the conditioning features.
The ConditionalARFSampler uses Adversarial Random
Forests to model complex conditional distributions. It’s the most
flexible conditional sampler.
Advantages:
Limitations:
# Create ARF sampler (works with full task including categorical features)
arf = ConditionalARFSampler$new(task_mixed, num_trees = 20, verbose = FALSE)
# Sample island conditioned on body measurements
sampled = arf$sample(
feature = "island",
row_ids = 1:10,
conditioning_set = c("bill_length", "body_mass")
)
# Compare original and sampled island
original = task_mixed$data(rows = 1:10)
data.table(
original_island = original$island,
sampled_island = sampled$island,
bill_length = original$bill_length, # Conditioning feature
body_mass = original$body_mass # Conditioning feature
)
#> original_island sampled_island bill_length body_mass
#> <fctr> <fctr> <num> <int>
#> 1: Torgersen Torgersen 39.1 3750
#> 2: Torgersen Torgersen 39.5 3800
#> 3: Torgersen Dream 40.3 3250
#> 4: Torgersen Dream NA NA
#> 5: Torgersen Biscoe 36.7 3450
#> 6: Torgersen Torgersen 39.3 3650
#> 7: Torgersen Dream 38.9 3625
#> 8: Torgersen Torgersen 39.2 4675
#> 9: Torgersen Dream 34.1 3475
#> 10: Torgersen Biscoe 42.0 4250Use in CFI: ConditionalARFSampler is the default for Conditional Feature Importance since it can be used with any task, unlike other samplers.
The ConditionalCtreeSampler uses conditional inference
trees to partition the feature space and sample from local
neighborhoods.
Advantages:
Limitations:
# Create ctree sampler
ctree = ConditionalCtreeSampler$new(task_mixed)
# Sample with default parameters
sampled = ctree$sample(
feature = "bill_length",
row_ids = 1:10,
conditioning_set = "island"
)
original = task_mixed$data(rows = 1:10)
data.table(
island = original$island, # Conditioning feature
original = original$bill_length,
sampled = sampled$bill_length
)
#> island original sampled
#> <fctr> <num> <num>
#> 1: Torgersen 39.1 38.7
#> 2: Torgersen 39.5 36.6
#> 3: Torgersen 40.3 46.0
#> 4: Torgersen NA 36.2
#> 5: Torgersen 36.7 35.9
#> 6: Torgersen 39.3 39.7
#> 7: Torgersen 38.9 40.3
#> 8: Torgersen 39.2 34.1
#> 9: Torgersen 34.1 39.3
#> 10: Torgersen 42.0 36.2The ctree sampler partitions observations based on the conditioning features and samples from within the same partition (terminal node).
The ConditionalKNNSampler finds k nearest neighbors
based on conditioning features and samples from them.
Advantages:
Limitations:
Distance metric:
The sampler automatically selects the appropriate distance metric based on conditioning features:
# Create kNN sampler with k=5 neighbors
knn_numeric = ConditionalKNNSampler$new(task_numeric, k = 5)
# Sample x1 based on nearest neighbors in (x2, x3) space
sampled_numeric = knn_numeric$sample(
feature = "x1",
row_ids = 1:5,
conditioning_set = c("x2", "x3")
)
original_numeric = task_numeric$data(rows = 1:5)
data.table(
x2 = original_numeric$x2,
x3 = original_numeric$x3,
original_x1 = original_numeric$x1,
sampled_x1 = sampled_numeric$x1
)
#> x2 x3 original_x1 sampled_x1
#> <num> <num> <num> <num>
#> 1: 0.05484016 -0.6513588 0.1626304 -0.5121550
#> 2: -0.56460970 -0.3418385 -1.1253474 -0.2887009
#> 3: 1.23103526 -1.2727419 1.1374030 0.5183023
#> 4: -0.06467517 -0.7588294 -0.1725146 0.1626304
#> 5: 0.68399963 -2.8095712 0.2324236 1.2354913# Use task with categorical features
knn_mixed = ConditionalKNNSampler$new(task_mixed, k = 5)
# Sample bill_length conditioning on island (categorical) and body_mass (numeric)
sampled_mixed = knn_mixed$sample(
feature = "bill_length",
row_ids = 1:5,
conditioning_set = c("island", "body_mass")
)
original_mixed = task_mixed$data(rows = 1:5)
data.table(
island = original_mixed$island,
body_mass = original_mixed$body_mass,
original_bill = original_mixed$bill_length,
sampled_bill = sampled_mixed$bill_length
)
#> island body_mass original_bill sampled_bill
#> <fctr> <int> <num> <num>
#> 1: Torgersen 3750 39.1 NA
#> 2: Torgersen 3800 39.5 39.5
#> 3: Torgersen 3250 40.3 38.8
#> 4: Torgersen NA NA 38.8
#> 5: Torgersen 3450 36.7 38.7The kNN sampler finds the k most similar observations (based on conditioning features) and samples from their feature values. The distance metric is chosen automatically based on feature types.
Now that we’ve seen conditional samplers, we can understand an important limitation of knockoff samplers: unlike the conditional samplers above, knockoffs don’t support arbitrary conditioning sets.
Knockoff samplers create synthetic features (knockoffs) that satisfy specific statistical properties. They must fulfill the knockoff swap property: swapping a feature with its knockoff should not change the joint distribution.
Knockoffs are a separate category because:
For multivariate Gaussian data, we can construct exact knockoffs:
# Create Gaussian knockoff sampler (using task_numeric from earlier)
knockoff = KnockoffGaussianSampler$new(task_numeric)
# Generate knockoffs
original = task_numeric$data(rows = 1:5)
knockoffs = knockoff$sample(
feature = task_numeric$feature_names,
row_ids = 1:5
)
# Original vs knockoff values
data.table(
x1_original = original$x1,
x1_knockoff = knockoffs$x1,
x2_original = original$x2,
x2_knockoff = knockoffs$x2
)
#> x1_original x1_knockoff x2_original x2_knockoff
#> <num> <num> <num> <num>
#> 1: 0.1626304 0.2797155 0.05484016 0.30939036
#> 2: -1.1253474 -0.5596204 -0.56460970 -1.01756047
#> 3: 1.1374030 1.4330571 1.23103526 1.33693941
#> 4: -0.1725146 -0.2972019 -0.06467517 -0.41006558
#> 5: 0.2324236 0.2766873 0.68399963 -0.05528232Key properties of knockoffs:
Conditional Independence Testing: Knockoffs are particularly relevant for conditional independence testing as implemented in the cpi package. You can combine knockoff samplers with CFI and perform inference:
# CFI with knockoff sampler for conditional independence testing
cfi_knockoff = CFI$new(
task = task_numeric,
learner = lrn("regr.ranger"),
measure = msr("regr.mse"),
sampler = knockoff
)
# Compute importance with CPI-based inference
cfi_knockoff$compute()
cfi_knockoff$importance(ci_method = "cpi")See vignette("inference") for more details on
statistical inference with feature importance.
Key takeaways:
| Sampler | Feature Types | Assumptions | Speed | Use Case |
|---|---|---|---|---|
MarginalPermutationSampler |
All | None | Very fast | PFI, uncorrelated features |
KnockoffGaussianSampler |
Continuous | Multivariate normal | Fast | Model-X knockoffs |
ConditionalGaussianSampler |
Continuous | Multivariate normal | Very fast | CFI with continuous features |
ConditionalARFSampler |
All | None | Moderate | CFI, complex dependencies |
ConditionalCtreeSampler |
All | None | Moderate | CFI, interpretable sampling |
ConditionalKNNSampler |
All | None (auto-selects distance) | Fast | CFI, simple local structure |
General guidelines: