Type: | Package |
Title: | Tools for Applying Distribution Mapping Based Transfer Learning |
Version: | 0.1.2 |
Description: | Implementation of a transfer learning framework employing distribution mapping based domain transfer. Uses the renowned concept of histogram matching (see Gonzalez and Fittes (1977) <doi:10.1016/0094-114X(77)90062-3>, Gonzalez and Woods (2008) <isbn:9780131687288>) and extends it to include distribution measures like kernel density estimates (KDE; see Wand and Jones (1995) <isbn:978-0-412-55270-0>, Jones et al. (1996) <doi:10.2307/2291420). In the typical application scenario, one can use the underlying sample distributions (histogram or KDE) to generate a map between two distinct but related domains to transfer the target data to the source domain and utilize the available source data for better predictive modeling design. Suitable for the case where a one-to-one sample matching is not possible, thus one needs to transform the underlying data distribution to utilize the more available data for modeling. |
Encoding: | UTF-8 |
Depends: | R (≥ 3.6) |
Imports: | caret (≥ 6.0-86), glmnet (≥ 4.1), kernlab (≥ 0.9-29), ks (≥ 1.11.7), randomForest (≥ 4.6-14) |
License: | GPL-3 |
URL: | https://github.com/dhruba018/DMTL |
LazyData: | true |
RoxygenNote: | 7.1.1 |
NeedsCompilation: | no |
Packaged: | 2021-02-17 19:05:38 UTC; SRDhruba |
Author: | Saugato Rahman Dhruba
|
Maintainer: | Saugato Rahman Dhruba <dhruba018@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2021-02-18 10:50:02 UTC |
Distribution Mapping based Transfer Learning
Description
This function performs distribution mapping based transfer learning (DMTL) regression for given target (primary) and source (secondary) datasets. The data available in the source domain are used to design an appropriate predictive model. The target features with unknown response values are transferred to the source domain via distribution matching and then the corresponding response values in the source domain are predicted using the aforementioned predictive model. The response values are then transferred to the original target space by applying distribution matching again. Hence, this function needs an unmatched pair of target datasets (features and response values) and a matched pair of source datasets.
Usage
DMTL(
target_set,
source_set,
use_density = FALSE,
pred_model = "RF",
model_optimize = FALSE,
sample_size = 1000,
random_seed = NULL,
all_pred = FALSE,
get_verbose = FALSE,
allow_parallel = FALSE
)
Arguments
target_set |
List containing the target datasets. A named list with
components |
source_set |
List containing the source datasets. A named list with
components |
use_density |
Flag for using kernel density as distribution estimate
instead of histogram counts. Defaults to |
pred_model |
String indicating the underlying predictive model. The currently available options are -
|
model_optimize |
Flag for model parameter tuning. If |
sample_size |
Sample size for estimating distributions of target and
source datasets. Defaults to |
random_seed |
Seed for random number generator (for reproducible
outcomes). Defaults to |
all_pred |
Flag for returning the prediction values in the source space.
If |
get_verbose |
Flag for displaying the progress when optimizing the
predictive model i.e., |
allow_parallel |
Flag for allowing parallel processing when performing
grid search i.e., |
Value
If all_pred = FALSE
, a vector containing the final prediction values.
If all_pred = TRUE
, a named list with two components target
and source
i.e., predictions in the original target space and in source space,
respectively.
Note
The datasets in
target_set
(i.e.,X
andy
) do not need to be matched (i.e., have the same number of rows) since the response values are used only to estimate distribution for mapping while the feature values are used for both mapping and final prediction. In contrast, the datasets insource_set
(i.e.,X
andy
) must have matched samples.It is recommended to normalize the two response values (
y
) so that they will be in the same range. If normalization is not performed,DMTL()
uses the range of targety
values as the prediction range.
Examples
set.seed(8644)
## Generate two dataset with different underlying distributions...
x1 <- matrix(rnorm(3000, 0.3, 0.6), ncol = 3)
dimnames(x1) <- list(paste0("sample", 1:1000), paste0("f", 1:3))
y1 <- 0.3*x1[, 1] + 0.1*x1[, 2] - x1[, 3] + rnorm(1000, 0, 0.05)
x2 <- matrix(rnorm(3000, 0, 0.5), ncol = 3)
dimnames(x2) <- list(paste0("sample", 1:1000), paste0("f", 1:3))
y2 <- -0.2*x2[, 1] + 0.3*x2[, 2] - x2[, 3] + rnorm(1000, 0, 0.05)
## Model datasets using DMTL & compare with a baseline model...
library(DMTL)
target <- list(X = x1, y = y1)
source <- list(X = x2, y = y2)
y1_pred <- DMTL(target_set = target, source_set = source, pred_model = "RF")
y1_pred_bl <- RF_predict(x_train = x2, y_train = y2, x_test = x1)
print(performance(y1, y1_pred, measures = c("MSE", "PCC")))
print(performance(y1, y1_pred_bl, measures = c("MSE", "PCC")))
Predictive Modeling using Elastic Net
Description
This function trains a Elastic Net regressor using the training data
provided and predict response for the test features. This implementation
depends on the glmnet
package.
Usage
EN_predict(
x_train,
y_train,
x_test,
lims,
optimize = FALSE,
alpha = 0.8,
seed = NULL,
verbose = FALSE,
parallel = FALSE
)
Arguments
x_train |
Training features for designing the EN regressor. |
y_train |
Training response for designing the EN regressor. |
x_test |
Test features for which response values are to be predicted.
If |
lims |
Vector providing the range of the response values for modeling. If missing, these values are estimated from the training response. |
optimize |
Flag for model tuning. If |
alpha |
EN mixing parameter with |
seed |
Seed for random number generator (for reproducible outcomes).
Defaults to |
verbose |
Flag for printing the tuning progress when |
parallel |
Flag for allowing parallel processing when performing grid
search i.e., |
Value
If x_test
is missing, the trained EN regressor.
If x_test
is provided, the predicted values using the model.
Note
The response values are filtered to be bound by range in lims
.
Examples
set.seed(86420)
x <- matrix(rnorm(3000, 0.2, 1.2), ncol = 3); colnames(x) <- paste0("x", 1:3)
y <- 0.3*x[, 1] + 0.1*x[, 2] - x[, 3] + rnorm(1000, 0, 0.05)
## Get the model only...
model <- EN_predict(x_train = x[1:800, ], y_train = y[1:800], alpha = 0.6)
## Get predictive performance...
y_pred <- EN_predict(x_train = x[1:800, ], y_train = y[1:800], x_test = x[801:1000, ])
y_test <- y[801:1000]
print(performance(y_test, y_pred, measures = "RSQ"))
Predictive Modeling using Random Forest Regression
Description
This function trains a Random Forest regressor using the training data
provided and predict response for the test features. This implementation
depends on the randomForest
package.
Usage
RF_predict(
x_train,
y_train,
x_test,
lims,
optimize = FALSE,
n_tree = 300,
m_try = 0.3333,
seed = NULL,
verbose = FALSE,
parallel = FALSE
)
Arguments
x_train |
Training features for designing the RF regressor. |
y_train |
Training response for designing the RF regressor. |
x_test |
Test features for which response values are to be predicted.
If |
lims |
Vector providing the range of the response values for modeling. If missing, these values are estimated from the training response. |
optimize |
Flag for model tuning. If |
n_tree |
Number of decision trees to be built in the forest. Defaults
to |
m_try |
Fraction of the features to be used for building each tree.
Defaults to |
seed |
Seed for random number generator (for reproducible outcomes).
Defaults to |
verbose |
Flag for printing the tuning progress when |
parallel |
Flag for allowing parallel processing when performing grid
search i.e., |
Value
If x_test
is missing, the trained RF regressor.
If x_test
is provided, the predicted values using the model.
Note
The response values are filtered to be bound by range in lims
.
Examples
set.seed(86420)
x <- matrix(rnorm(3000, 0.2, 1.2), ncol = 3); colnames(x) <- paste0("x", 1:3)
y <- 0.3*x[, 1] + 0.1*x[, 2] - x[, 3] + rnorm(1000, 0, 0.05)
## Get the model only...
model <- RF_predict(x_train = x[1:800, ], y_train = y[1:800], n_tree = 300)
## Get predictive performance...
y_pred <- RF_predict(x_train = x[1:800, ], y_train = y[1:800], x_test = x[801:1000, ])
y_test <- y[801:1000]
print(performance(y_test, y_pred, measures = "RSQ"))
Predictive Modeling using Support Vector Machine
Description
This function trains a Support Vector Machine regressor using the training
data provided and predict response for the test features. This implementation
depends on the kernlab
package.
Usage
SVM_predict(
x_train,
y_train,
x_test,
lims,
kernel = "rbf",
optimize = FALSE,
C = 2,
kpar = list(sigma = 0.1),
eps = 0.01,
seed = NULL,
verbose = FALSE,
parallel = FALSE
)
Arguments
x_train |
Training features for designing the SVM regressor. |
y_train |
Training response for designing the SVM regressor. |
x_test |
Test features for which response values are to be predicted.
If |
lims |
Vector providing the range of the response values for modeling. If missing, these values are estimated from the training response. |
kernel |
Kernel function for SVM implementation. The available options
are |
optimize |
Flag for model tuning. If |
C |
Cost of constraints violation. This is the constant "C" of the
regularization term in the Lagrange formulation. Defaults to |
kpar |
List of kernel parameters. This is a named list that contains the parameters to be used with the specified kernel. The valid parameters for the existing kernels are -
Valid only when |
eps |
The insensitive-loss function used for epsilon-SVR. Defaults to
|
seed |
Seed for random number generator (for reproducible outcomes).
Defaults to |
verbose |
Flag for printing the tuning progress when |
parallel |
Flag for allowing parallel processing when performing grid
search i.e., |
Value
If x_test
is missing, the trained SVM regressor.
If x_test
is provided, the predicted values using the model.
Note
The response values are filtered to be bound by range in lims
.
Examples
set.seed(86420)
x <- matrix(rnorm(3000, 0.2, 1.2), ncol = 3); colnames(x) <- paste0("x", 1:3)
y <- 0.3*x[, 1] + 0.1*x[, 2] - x[, 3] + rnorm(1000, 0, 0.05)
## Get the model only...
model <- SVM_predict(x_train = x[1:800, ], y_train = y[1:800], kernel = "rbf")
## Get predictive performance...
y_pred <- SVM_predict(x_train = x[1:800, ], y_train = y[1:800], x_test = x[801:1000, ])
y_test <- y[801:1000]
print(performance(y_test, y_pred, measures = "RSQ"))
Restrict data in a given interval
Description
This function filters a data vector using a given interval so that only the values falling inside the interval remains and any value that is less than the leftmost end gets replaced by that end-value, and similarly, any value greater than the rightmost end gets replaced by that end-value.
Usage
confined(x, lims = c(0, 1))
Arguments
x |
Vector containing data. |
lims |
Limit for the values. Values falling within this limit will pass
without any change. Any value |
Value
The filtered vector.
Examples
x <- rnorm(100, 0, 1)
x_filt <- confined(x, lims = c(-0.5, 0.5))
print(range(x_filt))
Distribution Matching for Source and Reference Datasets
Description
This function matches a source distribution to a given reference distribution such that the data in the source space can effectively be transferred to the reference space i.e. domain transfer via distribution matching.
Usage
dist_match(
src,
ref,
src_cdf,
ref_cdf,
lims,
density = FALSE,
samples = 1e+06,
seed = NULL
)
Arguments
src |
Vector containing the source data to be matched. |
ref |
Vector containing the reference data to estimate the reference distribution for matching. |
src_cdf |
Vector containing source distribution values. If missing,
these values are estimated from the source data using |
ref_cdf |
Vector containing reference distribution values. If missing,
these values are estimated from the reference data using |
lims |
Vector providing the range of the knot values for mapping. If missing, these values are estimated from the reference data. |
density |
Flag for using kernel density estimates for matching instead
of histogram counts. Defaults to |
samples |
Sample size for estimating distributions if |
seed |
Seed for random number generator (for reproducible outcomes).
Defaults to |
Value
A vector containing the matched values corresponding to src
.
Examples
set.seed(7531)
x1 <- rnorm(100, 0.2, 0.6)
x2 <- runif(200)
matched <- dist_match(src = x1, ref = x2, lims = c(0, 1))
## Plot histograms...
opar <- par(mfrow = c(1, 3))
hist(x1); hist(x2); hist(matched)
par(opar) # Reset par
Estimate Cumulative Distribution
Description
This function estimates the values of the cumulative distribution function (CDF) for a vector.
Usage
estimate_cdf(
x,
bootstrap = TRUE,
samples = 1e+06,
density = FALSE,
binned = TRUE,
grids = 10000,
unit_range = FALSE,
seed = NULL,
...
)
Arguments
x |
Vector containing data. |
bootstrap |
Flag for performing bootstrapping on |
samples |
Sample size for bootstrapping. Defaults to |
density |
Flag for calculating kernel density estimates (KDE) instead
of histogram counts. Depends on the |
binned |
Flag for calculating binned KDE. Defaults to |
grids |
Size parameter for the estimation grid when |
unit_range |
Flag for unity data range (i.e., data is normalized
between 0 and 1). Defaults to |
seed |
Seed for random number generator (for reproducible outcomes).
Defaults to |
... |
Other options relevant for distribution estimation. |
Value
If density = FALSE
, a function of class ecdf
, inheriting from the
stepfun
class, and hence inheriting a knots()
method.
If density = TRUE
, an object of class kcde
which has the fields
eval.points
and estimate
necessary for calculating a map.
Examples
x <- runif(100)
x_hist_cdf <- estimate_cdf(x, samples = 1000, unit_range = TRUE)
x_kde_cdf <- estimate_cdf(x, density = TRUE, unit_range = TRUE)
Estimate Inverse Mapping
Description
This function estimates an inverse map g
for a given set of knots
(input) and values (output) corresponding to a certain map f
i.e.,
given x, y | f: x --> y
, match_func()
estimates g: y --> x
using linear interpolation.
Usage
match_func(knots, vals, new_vals, lims, get_func = FALSE)
Arguments
knots |
Vector containing knots for the distribution estimate. |
vals |
Vector containing distribution values corresponding to the knots. |
new_vals |
Vector containing distribution values for which the knots
are unknown. If missing, |
lims |
Vector providing the range of the knot values for mapping. If missing, these values are estimated from the given knots. |
get_func |
Flag for returning the map function if |
Value
If new_vals
is missing, a function performing interpolation
(linear or constant) of the given data points.
If get_func = FALSE
, a vector containing the matched knots that will
produce new_vals
for the map f
.
If get_func = TRUE
, a named list with two components- mapped
and func
(mapped knots for new_vals
and the mapping function, respectively).
Examples
set.seed(654321)
x <- rnorm(100, 1, 0.5)
F <- ecdf(x)
fval <- F(x)
map <- match_func(knots = x, vals = fval)
x2 <- rnorm(20, 0.8, 0.5)
F2 <- ecdf(x2)
fval2 <- F2(x2)
matched <- match_func(knots = x, vals = fval, new_vals = fval2)
## Plot histograms...
opar <- par(mfrow = c(1, 3))
hist(x); hist(x2); hist(matched)
par(opar) # Reset par
Normalize vector in [0, 1]
Description
This function normalizes a given vector between 0 and 1.
Usage
norm01(x)
Arguments
x |
Vector containing data. |
Value
The normalized vector.
Examples
x <- rnorm(100, 0.2, 0.3)
x_norm <- norm01(x)
print(range(x_norm))
Normalize matrix per column in [0, 1]
Description
This function normalizes each column of a dataframe or matrix (-alike) between 0 and 1.
Usage
norm_data(X)
Arguments
X |
Dataframe or matrix (-alike) containing data. |
Value
The normalized dataframe.
Examples
X <- matrix(rnorm(1000, 0.2, 0.3), nrow = 100)
X_norm <- norm_data(X)
print(range(X_norm))
Evaluate Regression Model Performance using Various Metrics
Description
This function produces the predictive performance for a regression model using various common performance metrics such as MSE, R-squared, or Correlation coefficients.
Usage
performance(y_obs, y_pred, measures = c("NRMSE", "NMAE", "PCC"))
Arguments
y_obs |
Observed response values |
y_pred |
Predicted response values |
measures |
Performance measures. One can specify a single measure or a vector containing multiple measures in terms of common error or similarity metrics. The available options are roughly divided into 3 categories -
Defaults to |
Value
A vector containing the performance metric values.
Examples
set.seed(654321)
x <- rnorm(1000, 0.2, 0.5)
y <- x^2 + rnorm(1000, 0, 0.1)
y_fit <- predict(lm(y ~ x))
print(performance(y, y_fit, measures = c("MSE", "RSQ")))
Standardize matrix per column
Description
This function standardized each column of a dataframe or matrix (-alike) to
have mean = 0
and sd = 1
.
Usage
zscore(X)
Arguments
X |
Dataframe or matrix (-alike) containing data. |
Value
The standardized dataframe.
Examples
X <- matrix(rnorm(100, 0.2, 0.3), nrow = 20)
X_std <- zscore(X)
print(apply(X_std, 2, mean))
print(apply(X_std, 2, sd))