Type: | Package |
Title: | Fast Computation of Latent Correlations for Mixed Data |
Version: | 2.0.1 |
Maintainer: | Mingze Huang <mingzehuang@gmail.com> |
Description: | The first stand-alone R package for computation of latent correlation that takes into account all variable types (continuous/binary/ordinal/zero-inflated), comes with an optimized memory footprint, and is computationally efficient, essentially making latent correlation estimation almost as fast as rank-based correlation estimation. The estimation is based on latent copula Gaussian models. For continuous/binary types, see Fan, J., Liu, H., Ning, Y., and Zou, H. (2017). For ternary type, see Quan X., Booth J.G. and Wells M.T. (2018) <doi:10.48550/arXiv.1809.06255>. For truncated type or zero-inflated type, see Yoon G., Carroll R.J. and Gaynanova I. (2020) <doi:10.1093/biomet/asaa007>. For approximation method of computation, see Yoon G., Müller C.L. and Gaynanova I. (2021) <doi:10.1080/10618600.2021.1882468>. The latter method uses multi-linear interpolation originally implemented in the R package https://cran.r-project.org/package=chebpol. |
Depends: | R (≥ 3.0.0) |
Imports: | stats, pcaPP, fMultivar, mnormt, Matrix, MASS, heatmaply, ggplot2, plotly, graphics, geometry, doFuture, foreach, future, doRNG, microbenchmark |
License: | GPL-3 |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.1 |
Suggests: | rmarkdown, markdown, knitr, testthat (≥ 3.0.0), lattice, cubature, plot3D, covr |
VignetteBuilder: | knitr |
Config/testthat/edition: | 3 |
NeedsCompilation: | yes |
Packaged: | 2022-09-05 20:16:14 UTC; mingz |
Author: | Mingze Huang |
Repository: | CRAN |
Date/Publication: | 2022-09-05 20:50:02 UTC |
Numerical evaluation for different estimation methods.
Description
Speed and accuracy comparison of two different estimation methods.
Usage
evaluation(
genfun,
estfun_1,
estfun_2,
grid_list,
nrep = 100,
showplot = FALSE,
cores = detectCores(),
...
)
Arguments
genfun |
A data generation function. |
estfun_1 |
A function for first estimation method. |
estfun_2 |
A function for second estimation method. |
grid_list |
A list for grid points to be evaluated (each element of list is a vector represents ticklabels on a dimension). The number of list elements are the dimension of function inputs. |
nrep |
Number of replications in simulation. |
showplot |
Logical indicator. |
cores |
The numbers of cores (threads) of your machine to conduct parallel computing. |
... |
Other inputs for data generation or estimation functions to be passed through. |
Value
evaluation
returns
meanAE_1: An array for mean absolute error of first estimation method.
meanAE_2: An array for mean absolute error of second estimation method.
medianAE_1: An array for median absolute error of first estimation method.
medianAE_2: An array for median absolute error of second estimation method.
maxAE_1: An array for maximum absolute error of first estimation method.
maxAE_2: An array for maximum absolute error of second estimation method.
meanAE_diff: An array for mean absolute error of difference between two estimations.
medianAE_diff: An array for median absolute error of difference between two estimations.
maxAE_diff: An array for maximum absolute error of difference between two estimations.
mediantime_1: An array for median time of first estimation method.
mediantime_2: An array for median time of second estimation method.
plot_meanAE_1: A plot for mean absolute error of first estimation method.
plot_meanAE_2: A plot for mean absolute error of second estimation method.
plot_medianAE_1: A plot for median absolute error of first estimation method.
plot_medianAE_2: A plot for median absolute error of second estimation method.
plot_maxAE_1: A plot for maximum absolute error of first estimation method.
plot_maxAE_2: A plot for maximum absolute error of second estimation method.
plot_meanAE_diff: A plot for mean absolute error of difference between two estimations.
plot_medianAE_diff: A plot for median absolute error of difference between two estimations.
plot_maxAE_diff: A plot for maximum absolute error of difference between two estimations.
plot_mediantime_1: A plot for median time of first estimation method.
plot_mediantime_2: A plot for median time of second estimation method.
Mixed type simulation data generator
Description
Generates data of mixed types from the latent Gaussian copula model.
Usage
gen_data(
n = 100,
types = c("ter", "con"),
rhos = 0.5,
copulas = "no",
XP = NULL,
showplot = FALSE
)
Arguments
n |
A positive integer indicating the sample size. The default value is 100. |
types |
A vector indicating the type of each variable, could be |
rhos |
A vector with lower-triangular elements of desired correlation matrix, e.g. |
copulas |
A vector indicating the copula transformation f for each of the p variables, e.g. U = f(Z). Each element can take value |
XP |
A list of length p indicating proportion of zeros (for binary and truncated), and proportions of zeros and ones (for ternary) for each of the variables. For continuous variable, NA should be supplied. If |
showplot |
Logical indicator. If TRUE, generates the plot of the data when number of variables p is no more than 3. The default value is FALSE. |
Value
gen_data
returns a list containing
X: Generated data matrix (n by p) of observed variables.
plotX: Visualization of the data matrix X. Histogram if
p=1
. 2D Scatter plot ifp=2
. 3D scatter plot ifp=3
. Returns NULL ifshowplot = FALSE
.
References
Fan J., Liu H., Ning Y. and Zou H. (2017) "High dimensional semiparametric latent graphicalmodel for mixed data" doi:10.1111/rssb.12168.
Yoon G., Carroll R.J. and Gaynanova I. (2020) "Sparse semiparametric canonical correlation analysis for data of mixed types" doi:10.1093/biomet/asaa007.
Examples
# Generate single continuous variable with exponential transformation (always greater than 0)
# and show histogram.
simdata = gen_data(n = 100, copulas = "expo", types = "con", showplot = FALSE)
X = simdata$X; plotX = simdata$plotX
# Generate a pair of variables (ternary and continuous) with default proportions
# and without copula transformation.
simdata = gen_data()
X = simdata$X
# Generate 3 variables (binary, ternary and truncated)
# corresponding copulas for each variables are "no" (no transformation),
# "cube" (cube transformation) and "cube" (cube transformation).
# binary variable has 30% of zeros, ternary variable has 20% of zeros
# and 40% of ones, truncated variable has 50% of zeros.
# Then show the 3D scatter plot (data points project on either 0 or 1 on Axis X1;
# on 0, 1 or 2 on Axas X2; on positive domain on Axis X3)
simdata = gen_data(n = 100, rhos = c(.3, .4, .5), copulas = c("no", "cube", "cube"),
types = c("bin", "ter", "tru"), XP = list(.3, c(.2, .4), .5), showplot = TRUE)
X = simdata$X; plotX = simdata$plotX
# Check the proportion of zeros for the binary variable.
sum(simdata$X[ , 1] == 0)
# Check the proportion of zeros and ones for the ternary variable.
sum(simdata$X[ , 2] == 0); sum(simdata$X[ , 2] == 1)
# Check the proportion of zeros for the truncated variable.
sum(simdata$X[ , 3] == 0)
Automatically determine types of each variable (continuous/binary/ternary/truncated) in a data matrix.
Description
Automatically determine types of each variable (continuous/binary/ternary/truncated) in a data matrix.
Usage
get_types(X, tru_prop = 0.05)
Arguments
X |
A numeric data matrix (n by p), where n is number of samples, and p is number of variables. Missing values (NA) are allowed. |
tru_prop |
A scalar between 0 and 1 indicating the minimal proportion of zeros that should be present in a variable to be treated as |
Value
get_types
returns
types: A vector of length p indicating the type of each of the p variables in
X
. Each element is one of"con"
(continuous),"bin"
(binary),"ter"
(ternary) or"tru"
(truncated).
Examples
X = gen_data(types = c("ter", "con"))$X
get_types(X)
Evaluate an interpolant in a point
Description
An interpolant is a function returned by ipol
which has prespecified values in some
points, and which fills in between with some reasonable values.
Arguments
x |
The argument of the function. A function of more then one variable takes a
vector. |
threads |
The number of threads to use for evaluation. All interpolants created by
|
... |
Other parameters. Currently used for simplex linear interpolants with the logical argument
The |
Value
A numeric. If more than one point was evaluated, a vector.
Author(s)
Simen Gaure
Parallel version of multilinear interpolation generator for function approximation
Description
Parallel version of multilinear interpolation generator for function approximation
The primary method is ipol
which
dispatches to some other method. All the generated
interpolants accept as an argument a matrix of column
vectors. The generated functions also accept an argument
threads=getOption('ipol.threads')
to utilize more than
one CPU if a matrix of column vectors is evaluated. The option
ipol.threads
is initialized from the environment variable
IPOL_THREADS
upon loading of the package. It defaults to 1
.
Usage
interpolation(evalfun, grid_list, cores = detectCores(), int = FALSE, ...)
Arguments
evalfun |
The objective function to be approximated. |
grid_list |
A list for grid points (each element of list is a vector represents ticklabels on a dimension). The number of list elements are the dimension of function inputs. |
cores |
The numbers of cores (threads) of your machine to conduct parallel computing. |
int |
Logical indicator. |
... |
Other inputs for objective functions to be passed through. |
Details
The interpolants are ordinary R-objects and can be saved with save()
and loaded
later with load()
or serialized/unserialized with other tools, just like any R-object.
However, they contain calls to functions in the package, and while the author will make efforts
to ensure that generated interpolants are compatible with future versions of ipol,
I can issue no such absolute guarantee.
Value
interpolation
returns
value: A list of of length p corresponding to each variable. Returns NA for continuous variable; proportion of zeros for binary/truncated variables; the cumulative proportions of zeros and ones (e.g. first value is proportion of zeros, second value is proportion of zeros and ones) for ternary variable.
interpolant: An interpolant function generated by
chebpol::chebpol
for interplation.
See Also
Examples
## Not run: grid_list = list(seq(-0.5, 0.5, by = 0.5), seq(-0.5, 0.5, by = 0.5))
objfun = function(x, y) {x^2 + sqrt(y)}
evalfun = function(X) {objfun(X[1], X[2])}
value = interpolation(evalfun = evalfun, grid_list = grid_list)$value
interpolant = interpolation(evalfun = evalfun, grid_list = grid_list)$interpolant
## End(Not run)
Create an interpolating function from given values. Several interpolation methods are supported.
Description
Create an interpolating function from given values. Several interpolation methods are supported.
Usage
ipol(val, grid = NULL, ...)
Arguments
val |
array or function. Function values on a grid. |
grid |
list. Each element is a vector of ordered grid-points for a dimension. |
... |
Further arguments to the function, if |
Value
A function(x, threads=getOption('chebpol.threads'))
defined on a hypercube, an interpolant
for the given function. The argument x
can be a matrix of column
vectors which are evaluated in parallel in a number of threads. The
function yields values for arguments outside the hypercube as well, though
it will typically be a poor approximation. threads
is an integer
specifying the number of parallel threads which should be used when
evaluating a matrix of column vectors.
Author(s)
Simen Gaure
Examples
## evenly spaced grid-points
su <- seq(0,1,length.out=10)
## irregularly spaced grid-points
s <- su^3
## create approximation on the irregularly spaced grid
ml1 <- ipol(exp(s), grid=list(s))
## test it, since exp is convex, the linear approximation lies above
## the exp between the grid points
ml1(su) - exp(su)
## multi dimensional approximation
f <- function(x) 10/(1+25*mean(x^2))
# a 3-dimensional 10x10x10 grid, first and third coordinate are non-uniform
grid <- list(s, su, sort(1-s))
# make multilinear spline.
ml2 <- ipol(array(apply(expand.grid(grid), 1, f), c(10, 10, 10)), grid=grid)
# make 7 points in R3 to test them on
m <- matrix(runif(3*7),3)
rbind(true=apply(m,2,f), ml=ml2(m))
Estimate latent correlation for mixed types.
Description
Estimation of latent correlation matrix from observed data of (possibly) mixed types (continuous/binary/truncated/ternary) based on the latent Gaussian copula model. Missing values (NA) are allowed. The estimation is based on pairwise complete observations.
Usage
latentcor(
X,
types = NULL,
method = c("approx", "original"),
use.nearPD = TRUE,
nu = 0.001,
tol = 1e-08,
ratio = 0.9,
showplot = FALSE
)
Arguments
X |
A numeric matrix or numeric data frame (n by p), where n is number of samples, and p is number of variables. Missing values (NA) are allowed, in which case the estimation is based on pairwise complete observations. |
types |
A vector of length p indicating the type of each of the p variables in |
method |
The calculation method for latent correlations. Either |
use.nearPD |
Logical indicator. |
nu |
Shrinkage parameter for the correlation matrix, must be between 0 and 1. Guarantees that the minimal eigenvalue of returned correlation matrix is greater or equal to |
tol |
When |
ratio |
When |
showplot |
Logical indicator. |
Details
The function estimates latent correlation by calculating inverse bridge function (Fan et al., 2017) evaluated at the value of sample Kendall's tau (\hat \tau
). The bridge function F connects Kendall's tau to latent correlation r so that F(r) = E(\hat \tau)
. The form of function F depends on the variable types (continuous/binary/truncated/ternary), but is exact. The exact form of inverse is not available, so has to be evaluated numerically for each pair of variables leading to Rpointwise
.
When method = "original"
, the inversion is done numerically by solving
minimize_r (F(r) - \hat \tau)^2
using optimize
. The parameter tol
is used to control the accuracy of the solution.
When method = "approx"
, the inversion is done approximately by interpolating previously calculated and stored values of F^{-1}(\hat \tau)
. This is significantly faster than the original method (Yoon et al., 2021) for binary/ternary/truncated cases, however the approximation errors may be non-negligible on some regions of the space. The parameter ratio
controls the region where the interpolation is performed with default recommended value of 0.9 giving a good balance of accuracy and computational speed . Increasing the value of ratio may improve speed (but possibly sacrifice the accuracy), whereas decreasing the value of ratio my improve accuracy (but possibly sacrifice the speed). See Yoon et al. 2021 and vignette for more details.
In case the pointwise estimator Rpointwise
is has negative eigenvalues, it is projected onto the space of positive semi-definite matrices using nearPD
. The parameter nu
further allows to perform additional shrinkage towards identity matrix (desirable in cases where the number of variables p is very large) as
R = (1 - \nu) \tilde R + \nu I,
where \tilde R
is Rpointwise
after projection by nearPD
.
Value
latentcor
returns
zratios: A list of of length p corresponding to each variable. Returns NA for continuous variable; proportion of zeros for binary/truncated variables; the cumulative proportions of zeros and ones (e.g. first value is proportion of zeros, second value is proportion of zeros and ones) for ternary variable.
K: (p x p) Kendall Tau (Tau-a) Matrix for
X
R: (p x p) Estimated latent correlation matrix for
X
Rpointwise: (p x p) Point-wise estimates of latent correlations for
X
. This matrix is not guaranteed to be semi-positive definite. This is the original estimated latent correlation matrix without adjustment for positive-definiteness.plotR: Heatmap plot of latent correlation matrix
R
, NULL ifshowplot = FALSE
References
Fan J., Liu H., Ning Y. and Zou H. (2017) "High dimensional semiparametric latent graphical model for mixed data" doi:10.1111/rssb.12168.
Yoon G., Carroll R.J. and Gaynanova I. (2020) "Sparse semiparametric canonical correlation analysis for data of mixed types" doi:10.1093/biomet/asaa007.
Yoon G., Müller C.L., Gaynanova I. (2021) "Fast computation of latent correlations" doi:10.1080/10618600.2021.1882468.
Examples
# Example 1 - truncated data type, same type for all variables
# Generate data
X = gen_data(n = 300, types = rep("tru", 5))$X
# Estimate latent correlation matrix with original method and check the timing
start_time = proc.time()
R_org = latentcor(X = X, types = "tru", method = "original")$R
proc.time() - start_time
# Estimate latent correlation matrix with approximation method and check the timing
start_time = proc.time()
R_approx = latentcor(X = X, types = "tru", method = "approx")$R
proc.time() - start_time
# Heatmap for latent correlation matrix.
Heatmap_R_approx = latentcor(X = X, types = "tru", method = "approx",
showplot = TRUE)$plotR
# Example 2 - ternary/continuous case
X = gen_data()$X
# Estimate latent correlation matrix with original method
R_nc_org = latentcor(X = X, types = c("ter", "con"), method = "original")$R
# Estimate latent correlation matrix with aprroximation method
R_nc_approx = latentcor(X = X, types = c("ter", "con"), method = "approx")$R
Multilinear interpolation on a grid
Description
Multilinear interpolation on an arbitrary Cartesian product.
Usage
mlappx.real(val, grid, ...)
Arguments
val |
Array or function. Function values on a grid, or the function
itself. If it is the values, the |
grid |
A list. Each element is a vector of ordered grid-points for a dimension. These need not be Chebyshev-knots, nor evenly spaced. |
... |
Further arguments to the function, if |
Details
A call fun <- mlappx(val,grid)
creates a multilinear interpolant on
the grid. The value on the grid points will be exact, the value between the
grid points is a convex combination of the values in the corners of the
hypercube surrounding it.
If val
is a function it will be evaluated on the grid.
Value
A function(x)
defined on the hypercube, approximating the
given function. The function yields values for arguments outside the
hypercube as well, as a linear extension.
Port function to call multilinear interpolants for continuous developers.
Description
Port function to call multilinear interpolants for continuous developers.
Usage
r_ml_wrapper(K, zratio1, zratio2, comb)
Arguments
K |
Kendall's tau, can be provided as a number or a vector for interpolation in batch. |
zratio1 |
zratio for first variable. NA for continuous variables, zeros proportions for binary, truncated variables. For ternary variables, a vector of proportion of zeros as first element, proportions of zeros and ones as second element. It can be provided as a vector (matrix) for interpolation in batch. See vignettes for detail. |
zratio2 |
zratio for second variable. NA for continuous variables, zeros proportions for binary, truncated variables. For ternary variables, a vector of proportion of zeros as first element, proportions of zeros and ones as second element. It can be provided as a vector (matrix) for interpolation in batch. See vignettes for detail. |
comb |
Numeric code for types: "10" for binary/continuous; "11" for binary/binary; "20" for truncated/continuous; "21" for truncated/binary; "22" for truncated/truncated; "30" for ternary/continuous; "31" for ternary/binary; "32" for ternary/truncated; "33" for ternary/ternary. |
Value
A number or vector of interpolation results.