tutorial

Basic Workflow

library(TransHDM)

Simulation Data Generation Design

We generate synthetic data for high-dimensional mediation analysis under different settings.

The target data follow a homogeneous data-generating process, where treatment, mediators, and outcome depend on covariates. Additional source datasets with larger sample sizes are generated under both transferable and non-transferable settings, as well as with covariate shift (homogeneous and heterogeneous settings).

This setup is used to illustrate how the method behaves under different data scenarios.

seed <- 1
set.seed(seed)

# ---------------- Simulation Parameters ---------------- #
p_m <- 50     # num of mediators
n <- 100      # num of target samples
rho <- 0.1    # rho for simulation data generation
p_x <- 5      # num of covariates
n_s <- 300    # num of source samples

# ---------------- Target Data Generation ---------------- #
target_sim <- gen_simData_homo(n = n, p_x = p_x, p_m = p_m, rho = rho)
target_data <- target_sim$data

# true effect of target data
true_effect <- target_sim$coef$beta2 * target_sim$coef$alpha1

# column names
M_col <- paste0("M", 1:p_m)
X_col <- paste0("X", 1:p_x)

# ---------------- Source Data Generation ---------------- #
# source, transferable, homogeneous
s_data <- gen_simData_homo(n = n_s, p_x = p_x, p_m = p_m, rho = rho,
                          source = TRUE, transferable = TRUE, h = 2, seed=seed)$data

# source, not transferable, homogeneous
s_f_data <- gen_simData_homo(n = n_s, p_x = p_x, p_m = p_m, rho = rho,
                            source = TRUE, transferable = FALSE, h = 2, seed=seed)$data

# source, transferable, heterogeneous
s_h_data <- gen_simData_hetero(n = n_s, p_x = p_x, p_m = p_m, rho = rho,
                              source = TRUE, transferable = TRUE, h = 2, seed=seed)$data

# source, not transferable, heterogeneous
s_hf_data <- gen_simData_hetero(n = n_s, p_x = p_x, p_m = p_m, rho = rho,
                               source = TRUE, transferable = FALSE, h = 2, seed=seed)$data

Display the true mediator effects used in the simulation and briefly inspect the structure of the target dataset.

# ---------------- Show Data ---------------- #
# show true mediator effect
true_effect
#>  [1] 0.2400 0.2475 0.2925 0.2500 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
#> [11] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
#> [21] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
#> [31] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
#> [41] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

# # show target data
# head(target_data)

Source Data Detection

We first apply the source detection procedure to identify which external datasets are transferable to the target data.

The function source_detection() compares prediction performance across candidate source datasets using cross-validation, with the target-only model as a baseline.

In this example, four source datasets are considered under different data settings. The summary output reports the validation losses and indicates which sources are selected as transferable.

detect_all <- source_detection(
  target_data = target_data,
  source_data = list(s_data, s_f_data,s_h_data, s_hf_data),
  Y = "Y",
  D = "D",
  M = M_col,
  X = X_col,
  kfold = 5,
  C0 = 0.01,
  verbose = TRUE
)
#> Transfer candidate sources: 1, 3
summary(detect_all)
#> Source detection summary
#> ------------------------
#> Number of sources: 4 
#> Number of transferable sources: 2 
#> 
#> Target validation loss:
#>   mean = 31.9718 
#> 
#> Threshold:
#>   0.3197 
#> 
#> Source-wise comparison:
#>  Source SourceLoss T_index Transferable
#>       1    31.4739 -0.4978          YES
#>       2    35.1045  3.1327           NO
#>       3    31.2694 -0.7024          YES
#>       4    34.3462  2.3744           NO

High-Dimensional Mediation Analysis with TransHDM

We illustrate the use of TransHDM for conducting high-dimensional mediation analysis under a transfer learning framework.

We first apply TransHDM to the target data only as a baseline analysis.

set.seed(seed)

# mediation analysis without transfer learning
res_n <- TransHDM(
  target_data = target_data,
  source_data = NULL,
  Y = "Y",
  D = "D",
  M = M_col,
  X = X_col,
  transfer = FALSE,
  topN = NULL,
  dblasso_SIS = FALSE,
  verbose = TRUE,
  ncore = 1
)
#> Step 1: Sure Independence Screening ...  (21:07:57)
#> Top 44 mediators selected: M1, M2, M3, M4, M42, M7, M26, M5, M49, M16, M35, M24, M11, M33, M12, M38, M6, M13, M21, M20, M36, M22, M39, M27, M50, M19, M43, M46, M8, M9, M10, M14, M15, M17, M18, M23, M25, M28, M29, M30, M31, M32, M34, M37  (21:07:59)
#> After SIS, 44 / 50 mediators are retained.
#> Step 2: De-biased Lasso Estimates ... (21:07:59)
#> Estimation of mediator-outcome effects in the outcome model completed.
#> Estimation of exposure-mediator effects in the mediator model completed.
#> Step 3: Multiple-testing procedure ... (21:08:18)
#> Identified mediator(s): M1, M2, M3, M4, M42
summary(res_n)
#> ====================================================
#>         Summary of TransHDM Mediation Analysis        
#> ====================================================
#> 
#> Overall Effects:
#>    effect estimate
#>  indirect   1.4594
#>    direct   1.0103
#>     total   2.4697
#>        pe   0.5909
#> 
#> Identified Mediators:
#>   Number of selected mediators: 5 
#> 
#> Top 5 mediators by |alpha * beta|:
#> 
#>  mediator  alpha   alpha_pv   beta    beta_pv alpha_beta      ab_pv     pa
#>        M1 0.6422 5.3018e-14 0.6805 1.6636e-29     0.4370 5.3018e-14 0.1770
#>        M2 0.6519 1.4928e-15 0.5096 2.2062e-15     0.3322 2.2062e-15 0.1345
#>        M3 0.4811 6.5654e-09 0.6564 2.3574e-24     0.3158 6.5654e-09 0.1279
#>        M4 0.5282 5.3849e-11 0.5156 9.6466e-13     0.2723 5.3849e-11 0.1103
#>       M42 0.3725 3.7118e-06 0.2738 4.0122e-04     0.1020 4.0122e-04 0.0413
#> 
#> Note:
#>   alpha: exposure-mediator effect
#>   beta : mediator-outcome effect
#>   pa   : proportion of total effect explained
#>   P-values are shown in scientific notation
#> 
#> ====================================================

We then incorporate external source data to enable transfer learning, considering both homogeneous and heterogeneous source settings.

The outputs summarize the selected mediators and their estimated indirect effects, allowing comparison between analyses with and without transfer learning.

# mediation analysis with transfer learning (using homogeneous data)
res_t <- TransHDM(
  target_data = target_data,
  source_data = s_data,
  Y = "Y",
  D = "D",
  M = M_col,
  X = X_col,
  transfer = TRUE,
  topN = NULL,
  dblasso_SIS = FALSE,
  verbose = TRUE,
  ncore = 1
)
#> Step 1: Sure Independence Screening ...  (21:08:18)
#> Top 44 mediators selected: M3, M1, M4, M2, M11, M7, M32, M5, M9, M35, M17, M27, M49, M20, M13, M30, M43, M25, M26, M21, M23, M45, M6, M8, M10, M12, M14, M15, M16, M18, M19, M22, M24, M28, M29, M31, M33, M34, M36, M37, M38, M39, M40, M41  (21:08:32)
#> After SIS, 44 / 50 mediators are retained.
#> Step 2: De-biased Lasso Estimates ... (21:08:32)
#> Estimation of mediator-outcome effects in the outcome model completed.
#> Estimation of exposure-mediator effects in the mediator model completed.
#> Step 3: Multiple-testing procedure ... (21:09:08)
#> Identified mediator(s): M1, M2, M3, M4
summary(res_t)
#> ====================================================
#>         Summary of TransHDM Mediation Analysis        
#> ====================================================
#> 
#> Overall Effects:
#>    effect estimate
#>  indirect   1.0095
#>    direct   0.9790
#>     total   1.9885
#>        pe   0.5077
#> 
#> Identified Mediators:
#>   Number of selected mediators: 4 
#> 
#> Top 4 mediators by |alpha * beta|:
#> 
#>  mediator  alpha   alpha_pv   beta    beta_pv alpha_beta      ab_pv     pa
#>        M1 0.6240 1.2345e-14 0.5275 1.4378e-12     0.3292 1.4378e-12 0.1655
#>        M3 0.5234 9.7402e-11 0.4725 1.1895e-09     0.2473 1.1895e-09 0.1244
#>        M4 0.4692 6.5982e-09 0.4743 9.8147e-10     0.2226 6.5982e-09 0.1119
#>        M2 0.5476 7.3236e-12 0.3844 4.2838e-07     0.2105 4.2838e-07 0.1059
#> 
#> Note:
#>   alpha: exposure-mediator effect
#>   beta : mediator-outcome effect
#>   pa   : proportion of total effect explained
#>   P-values are shown in scientific notation
#> 
#> ====================================================

# mediation analysis with transfer learning (using heterogeneous data)
res_h <- TransHDM(
  target_data = target_data,
  source_data = s_h_data,
  Y = "Y",
  D = "D",
  M = M_col,
  X = X_col,
  transfer = TRUE,
  topN = NULL,
  dblasso_SIS = FALSE,
  verbose = TRUE,
  ncore = 1
)
summary(res_h)

With limited target sample size, some spurious mediators may be selected. Incorporating transferable source data leads to more stable estimates and fewer false discoveries.

Parallel Computation

The TransHDM function supports parallel computation through the ncore argument, which allows mediator models to be fitted using multiple CPU cores.

res_p <- TransHDM(
  target_data = target_data,
  source_data = s_data,
  Y = "Y",
  D = "D",
  M = M_col,
  X = X_col,
  transfer = TRUE,
  topN = NULL,
  dblasso_SIS = FALSE,
  verbose = TRUE,
  ncore = 4
)
summary(res_h)

Utility functions

Sure Independence Screening (SIS)

We use Sure Independence Screening (SIS) as the first step to reduce the number of candidate mediators in high-dimensional settings. Mediators are ranked by their marginal associations with the exposure and the outcome, and only the top mediators are retained. SIS can be applied with or without transfer learning.

# SIS without transfer learning
SIS_n <- SIS(
  target_data = target_data,
  source_data = NULL,
  Y = "Y",
  D = "D",
  M = M_col,
  X = X_col,
  topN = 10,
  transfer = FALSE,
  verbose = TRUE,
  ncore = 1,
  dblasso_method = FALSE
)
summary(SIS_n)

# SIS with transfer learning
SIS_t <- SIS(
  target_data = target_data,
  source_data = s_data,
  Y = "Y",
  D = "D",
  M = M_col,
  X = X_col,
  topN = 10,
  transfer = TRUE,
  verbose = TRUE,
  ncore = 1,
  dblasso_method = FALSE
)
summary(SIS_t)

Linear Regression Models for Transfer Learning

This section demonstrates two transfer learning methods for linear regression implemented in the package: standard lasso transfer learning and double/debiased lasso transfer learning.

library(MASS)
n_target <- 1000
n_source <- 2000
p <- 20

Sigma <- 0.2^abs(outer(1:p, 1:p, "-"))  # Autocorrelation structure, weak correlation
X_target <- mvrnorm(n_target, mu = rep(0, p), Sigma = Sigma)
X_source <- mvrnorm(n_source, mu = rep(0, p), Sigma = Sigma)

# Construct signal coefficients
# First 3 variables are strong signals, next 2 are weak signals, rest are zero
beta <- c(1.5, -1, 1.0, 0.5, -0.5, rep(0, p-5))

# Construct response variables with noise
y_target <- X_target %*% beta + rnorm(n_target, sd = 1)
y_source <- X_source %*% beta + rnorm(n_source, sd = 1)

# Build target/source lists
target <- list(x = X_target, y = y_target)
source <- list(x = X_source, y = y_source)

Lasso for transfer learning

Primary used in source_detection() and SIS() functions. Applies lasso regression with optional transfer learning.

# Fit lasso without transfer learning
coef_n_l <- lasso(target = target, transfer = FALSE, lambda = 'lambda.1se')
summary(coef_n_l)
#> 
#> Lasso regression summary
#> -------------------------
#> Intercept:
#>   -0.0176811 
#> 
#> Model size:
#>   Number of predictors: 20 
#>   Nonzero coefficients: 5 
#> 
#> Selected variables:
#>   X1, X2, X3, X4, X5 
#> 
#> Coefficient norms (excluding intercept):
#>   L1 norm:  3.8859 
#>   L2 norm: 1.93817 
#>   Max |coef|: 1.36963 
#> -------------------------

# Fit lasso with transfer learning
coef_t_l <- lasso(target = target, source = source, transfer = TRUE, lambda = 'lambda.1se')
summary(coef_t_l)
#> 
#> Lasso regression summary
#> -------------------------
#> Intercept:
#>   -0.0176995 
#> 
#> Model size:
#>   Number of predictors: 20 
#>   Nonzero coefficients: 5 
#> 
#> Selected variables:
#>   X1, X2, X3, X4, X5 
#> 
#> Coefficient norms (excluding intercept):
#>   L1 norm: 4.06288 
#>   L2 norm: 1.98887 
#>   Max |coef|: 1.38344 
#> -------------------------

Debiased Lasso for transfer learning

Primary used in estimation of TransHDM() function. This two-stage debiasing procedure addresses regularization bias: 1. First debiasing step: Transfer learning bias correction using source data. 2. Second debiasing step: Lasso bias correction within target data.

dblasso algorithm

The method produces debiased coefficient estimates with valid confidence intervals, making it suitable for high-dimensional inference.

# Fit dblasso without transfer learning
coef_n_d <- dblasso(target = target, transfer = FALSE, lambda = 'lambda.1se')
summary(coef_n_d)
#> Debiased Lasso Inference Summary
#> --------------------------------
#>                Estimate   Std.Error     Z.value    Pr...z..    CI.Lower
#> (Intercept)  -1.814e-02   3.162e-02  -5.738e-01   5.661e-01  -8.012e-02
#> X1            1.468e+00   3.116e-02   4.710e+01   0.000e+00   1.406e+00
#> X2           -9.969e-01   3.057e-02  -3.261e+01  2.833e-233  -1.057e+00
#> X3            1.016e+00   3.191e-02   3.185e+01  1.412e-222   9.537e-01
#> X4            4.655e-01   3.240e-02   1.437e+01   8.152e-47   4.020e-01
#> X5           -4.615e-01   3.271e-02  -1.411e+01   3.288e-45  -5.256e-01
#> X6           -1.671e-02   3.277e-02  -5.100e-01   6.100e-01  -8.093e-02
#> X7            4.255e-02   3.160e-02   1.346e+00   1.782e-01  -1.939e-02
#> X8            1.061e-03   3.194e-02   3.321e-02   9.735e-01  -6.154e-02
#> X9           -5.090e-02   3.182e-02  -1.600e+00   1.097e-01  -1.133e-01
#> X10          -3.948e-02   3.332e-02  -1.185e+00   2.360e-01  -1.048e-01
#> X11          -1.586e-02   3.291e-02  -4.820e-01   6.298e-01  -8.037e-02
#> X12          -7.776e-04   3.132e-02  -2.483e-02   9.802e-01  -6.216e-02
#> X13          -5.478e-03   3.331e-02  -1.644e-01   8.694e-01  -7.077e-02
#> X14          -5.139e-02   3.166e-02  -1.623e+00   1.045e-01  -1.134e-01
#> X15           7.210e-03   3.147e-02   2.291e-01   8.188e-01  -5.446e-02
#> X16           4.631e-02   3.225e-02   1.436e+00   1.510e-01  -1.690e-02
#> X17          -4.588e-02   3.125e-02  -1.468e+00   1.421e-01  -1.071e-01
#> X18          -1.964e-02   3.194e-02  -6.147e-01   5.387e-01  -8.224e-02
#> X19          -1.764e-03   3.091e-02  -5.706e-02   9.545e-01  -6.234e-02
#> X20           6.787e-02   3.158e-02   2.149e+00   3.162e-02   5.975e-03
#>             CI.Upper
#> (Intercept)    0.044
#> X1             1.529
#> X2            -0.937
#> X3             1.079
#> X4             0.529
#> X5            -0.397
#> X6             0.048
#> X7             0.104
#> X8             0.064
#> X9             0.011
#> X10            0.026
#> X11            0.049
#> X12            0.061
#> X13            0.060
#> X14            0.011
#> X15            0.069
#> X16            0.110
#> X17            0.015
#> X18            0.043
#> X19            0.059
#> X20            0.130

# Fit dblasso with transfer learning
coef_t_d <- dblasso(target = target, source = source, transfer = TRUE, lambda = 'lambda.1se')
summary(coef_t_d)
#> Debiased Lasso Inference Summary
#> --------------------------------
#>                Estimate   Std.Error     Z.value    Pr...z..    CI.Lower
#> (Intercept)  -1.740e-02   3.162e-02  -5.501e-01   5.822e-01  -7.938e-02
#> X1            1.477e+00   3.124e-02   4.728e+01   0.000e+00   1.416e+00
#> X2           -1.018e+00   3.130e-02  -3.254e+01  2.953e-232  -1.080e+00
#> X3            1.021e+00   3.206e-02   3.186e+01  9.559e-223   9.585e-01
#> X4            4.706e-01   3.209e-02   1.467e+01   1.052e-48   4.077e-01
#> X5           -4.702e-01   3.181e-02  -1.478e+01   1.926e-49  -5.326e-01
#> X6           -6.993e-03   3.241e-02  -2.158e-01   8.292e-01  -7.052e-02
#> X7            5.027e-02   3.172e-02   1.585e+00   1.130e-01  -1.190e-02
#> X8            4.651e-03   3.245e-02   1.433e-01   8.860e-01  -5.894e-02
#> X9           -4.835e-02   3.195e-02  -1.513e+00   1.303e-01  -1.110e-01
#> X10          -3.885e-02   3.278e-02  -1.185e+00   2.359e-01  -1.031e-01
#> X11          -1.253e-02   3.212e-02  -3.901e-01   6.964e-01  -7.549e-02
#> X12          -1.797e-03   3.151e-02  -5.703e-02   9.545e-01  -6.355e-02
#> X13          -8.051e-03   3.244e-02  -2.482e-01   8.040e-01  -7.164e-02
#> X14          -5.384e-02   3.209e-02  -1.678e+00   9.338e-02  -1.167e-01
#> X15           3.258e-03   3.252e-02   1.002e-01   9.202e-01  -6.047e-02
#> X16           3.713e-02   3.197e-02   1.161e+00   2.456e-01  -2.554e-02
#> X17          -4.685e-02   3.215e-02  -1.457e+00   1.451e-01  -1.099e-01
#> X18          -2.069e-02   3.164e-02  -6.539e-01   5.132e-01  -8.270e-02
#> X19           4.055e-04   3.120e-02   1.300e-02   9.896e-01  -6.074e-02
#> X20           6.437e-02   3.156e-02   2.040e+00   4.137e-02   2.521e-03
#>             CI.Upper
#> (Intercept)    0.045
#> X1             1.538
#> X2            -0.957
#> X3             1.084
#> X4             0.533
#> X5            -0.408
#> X6             0.057
#> X7             0.112
#> X8             0.068
#> X9             0.014
#> X10            0.025
#> X11            0.050
#> X12            0.060
#> X13            0.056
#> X14            0.009
#> X15            0.067
#> X16            0.100
#> X17            0.016
#> X18            0.041
#> X19            0.062
#> X20            0.126