Type: Package
Title: Measure Dependence Between Categorical and Continuous Variables
Version: 0.1.0
Date: 2023-11-19
Description: Semi-distance and mean-variance (MV) index are proposed to measure the dependence between a categorical random variable and a continuous variable. Test of independence and feature screening for classification problems can be implemented via the two dependence measures. For the details of the methods, see Zhong et al. (2023) <doi:10.1080/01621459.2023.2284988>; Cui and Zhong (2019) <doi:10.1016/j.csda.2019.05.004>; Cui, Li and Zhong (2015) <doi:10.1080/01621459.2014.920256>.
License: MIT + file LICENSE
Encoding: UTF-8
Imports: energy, FNN, furrr, purrr, Rcpp, stats
RoxygenNote: 7.2.3
LinkingTo: Rcpp, RcppArmadillo
Suggests: testthat (≥ 3.0.0)
Config/testthat/edition: 3
URL: https://github.com/wzhong41/semidist
BugReports: https://github.com/wzhong41/semidist/issues
NeedsCompilation: yes
Packaged: 2023-11-20 19:43:13 UTC; Chain
Author: Wei Zhong [aut], Zhuoxi Li [aut, cre, cph], Wenwen Guo [aut], Hengjian Cui [aut], Runze Li [aut]
Maintainer: Zhuoxi Li <chainchei@gmail.com>
Repository: CRAN
Date/Publication: 2023-11-21 06:50:02 UTC

semidist: Measure Dependence Between Categorical and Continuous Variables

Description

Semi-distance and mean-variance (MV) index are proposed to measure the dependence between a categorical random variable and a continuous variable. Test of independence and feature screening for classification problems can be implemented via the two dependence measures. For the details of the methods, see Zhong et al. (2023) doi:10.1080/01621459.2023.2284988; Cui and Zhong (2019) doi:10.1016/j.csda.2019.05.004; Cui, Li and Zhong (2015) doi:10.1080/01621459.2014.920256.

Author(s)

Maintainer: Zhuoxi Li chainchei@gmail.com [copyright holder]

Authors:

See Also

Useful links:


Mutual information independence test (categorical-continuous case)

Description

Implement the mutual information independence test (MINT) (Berrett and Samworth, 2019), but with some modification in estimating the mutual informaion (MI) between a categorical random variable and a continuous variable. The modification is based on the idea of Ross (2014).

MINTsemiperm() implements the permutation independence test via mutual information, but the parameter k should be pre-specified.

MINTsemiauto() automatically selects an appropriate k based on a data-driven procedure, and conducts MINTsemiperm() with the k chosen.

Usage

MINTsemiperm(X, y, k, B = 1000)

MINTsemiauto(X, y, kmax, B1 = 1000, B2 = 1000)

Arguments

X

Data of multivariate continuous variables, which should be an n-by-p matrix, or, a vector of length n (for univariate variable).

y

Data of categorical variables, which should be a factor of length n.

k

Number of nearest neighbor. See References for details.

B, B1, B2

Number of permutations to use. Defaults to 1000.

kmax

Maximum k in the automatic search for optimal k.

Value

A list with class "indtest" containing the following components

For MINTsemiauto(), the list also contains

References

  1. Berrett, Thomas B., and Richard J. Samworth. "Nonparametric independence testing via mutual information." Biometrika 106, no. 3 (2019): 547-566.

  2. Ross, Brian C. "Mutual information between discrete and continuous data sets." PloS one 9, no. 2 (2014): e87357.

Examples

X <- mtcars[, c("mpg", "disp", "drat", "wt")]
y <- factor(mtcars[, "am"])

MINTsemiperm(X, y, 5)
MINTsemiauto(X, y, kmax = 32)


Mean Variance (MV) statistics

Description

Compute the statistics of mean variance (MV) index, which can measure the dependence between a univariate continuous variable and a categorical variable. See Cui, Li and Zhong (2015); Cui and Zhong (2019) for details.

Usage

mv(x, y, return_mat = FALSE)

Arguments

x

Data of univariate continuous variables, which should be a vector of length n.

y

Data of categorical variables, which should be a factor of length n.

return_mat

A boolean. If FALSE (the default), only the calculated statistic is returned. If TRUE, also return the matrix of the indicator for x <= x_i, which is useful for the permutation test.

Value

The value of the corresponding sample statistic.

If the argument return_mat of mv() is set as TRUE, a list with elements

will be returned.

See Also

Examples

x <- mtcars[, "mpg"]
y <- factor(mtcars[, "am"])
print(mv(x, y))

# Man-made independent data -------------------------------------------------
n <- 30; R <- 5; prob <- rep(1/R, R)
x <- rnorm(n)
y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R)
print(mv(x, y))

# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3
x <- rep(0, n)
x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1
y <- factor(rep(1:3, each = 10))
print(mv(x, y))


Feature screening via MV Index

Description

Implement the feature screening for the classification problem via MV index.

Usage

mv_sis(X, y, d = NULL, parallel = FALSE)

Arguments

X

Data of multivariate covariates, which should be an n-by-p matrix.

y

Data of categorical response, which should be a factor of length n.

d

An integer specifying how many features should be kept after screening. Defaults to NULL. If NULL, then it will be set as [n / log(n)], where [x] denotes the integer part of x.

parallel

A boolean indicating whether to calculate parallelly via furrr::future_map. Defaults to FALSE.

Value

A list of the objects about the implemented feature screening:

Examples

X <- mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec")]
y <- factor(mtcars[, "am"])

mv_sis(X, y, d = 4)


MV independence test

Description

Implement the MV independence test via permutation test, or via the asymptotic approximation

Usage

mv_test(x, y, test_type = "perm", num_perm = 10000)

Arguments

x

Data of univariate continuous variables, which should be a vector of length n.

y

Data of categorical variables, which should be a factor of length n.

test_type

Type of the test:

  • "perm" (the default): Implement the test via permutation test;

  • "asym": Implement the test via the asymptotic approximation.

See the Reference for details.

num_perm

The number of replications in permutation test.

Value

A list with class "indtest" containing the following components

Examples

x <- mtcars[, "mpg"]
y <- factor(mtcars[, "am"])
test <- mv_test(x, y)
print(test)
test_asym <- mv_test(x, y, test_type = "asym")
print(test_asym)

# Man-made independent data -------------------------------------------------
n <- 30; R <- 5; prob <- rep(1/R, R)
x <- rnorm(n)
y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R)
test <- mv_test(x, y)
print(test)
test_asym <- mv_test(x, y, test_type = "asym")
print(test_asym)

# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3
x <- rep(0, n)
x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1
y <- factor(rep(1:3, each = 10))
test <- mv_test(x, y)
print(test)
test_asym <- mv_test(x, y, test_type = "asym")
print(test_asym)


Print Method for Independence Tests Between Categorical and Continuous Variables

Description

Printing object of class "indtest", by simple print method.

Usage

## S3 method for class 'indtest'
print(x, digits = getOption("digits"), ...)

Arguments

x

"indtest" class object.

digits

minimal number of significant digits.

...

further arguments passed to or from other methods.

Value

None

Examples

# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3
x <- rep(0, n)
x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1
y <- factor(rep(1:3, each = 10))
test <- mv_test(x, y)
print(test)
test_asym <- mv_test(x, y, test_type = "asym")
print(test_asym)


Feature screening via semi-distance correlation

Description

Implement the (grouped) feature screening for the classification problem via semi-distance correlation.

Usage

sd_sis(X, y, group_info = NULL, d = NULL, parallel = FALSE)

Arguments

X

Data of multivariate covariates, which should be an n-by-p matrix.

y

Data of categorical response, which should be a factor of length n.

group_info

A list specifying the group information, with elements being sets of indicies of covariates in a same group. For example, list(c(1, 2, 3), c(4, 5)) specifies that covariates 1, 2, 3 are in a group and covariates 4, 5 are in another group.

Defaults to NULL. If NULL, then it will be set as list(1, 2, ..., p), that is, treat each single covariate as a group.

If X has colnames, then the colnames can be used to specified the group_info. For example, list(c("a", "b"), c("c", "d")).

The names of the list can help recoginize the group. For example, list(grp_ab = c("a", "b"), grp_cd = c("c", "d")). If names of the list are not specified, c("Grp 1", "Grp 2", ..., "Grp J") will be applied.

d

An integer specifying at least how many (single) features should be kept after screening. For example, if group_info = list(c(1, 2), c(3, 4)) and d = 3, then all features 1, 2, 3, 4 must be selected since it should guarantee at least 3 features are kept.

Defaults to NULL. If NULL, then it will be set as [n / log(n)], where [x] denotes the integer part of x.

parallel

A boolean indicating whether to calculate parallelly via furrr::future_map. Defaults to FALSE.

Value

A list of the objects about the implemented feature screening:

See Also

sdcor() for calculating the sample semi-distance correlation.

Examples

X <- mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec")]
y <- factor(mtcars[, "am"])

sd_sis(X, y, d = 4)

# Suppose we have prior information for the group structure as
# ("mpg", "drat"), ("disp", "hp") and ("wt", "qsec")
group_info <- list(
  mpg_drat = c("mpg", "drat"),
  disp_hp = c("disp", "hp"),
  wt_qsec = c("wt", "qsec")
)
sd_sis(X, y, group_info, d = 4)


Semi-distance independence test

Description

Implement the semi-distance independence test via permutation test, or via the asymptotic approximation when the dimensionality of continuous variables p is high.

Usage

sd_test(X, y, test_type = "perm", num_perm = 10000)

Arguments

X

Data of multivariate continuous variables, which should be an n-by-p matrix, or, a vector of length n (for univariate variable).

y

Data of categorical variables, which should be a factor of length n.

test_type

Type of the test:

  • "perm" (the default): Implement the test via permutation test;

  • "asym": Implement the test via the asymptotic approximation when the dimension of continuous variables p is high.

See the Reference for details.

num_perm

The number of replications in permutation test. Defaults to 10000. See Details and Reference.

Details

The semi-distance independence test statistic is

T_n = n \cdot \widetilde{\text{SDcov}}_n(X, y),

where the \widetilde{\text{SDcov}}_n(X, y) can be computed by sdcov(X, y, type = "U").

For the permutation test (test_type = "perm"), totally K replications of permutation will be conducted, and the argument num_perm specifies the K here. The p-value of permutation test is computed by

\text{p-value} = (\sum_{k=1}^K I(T^{\ast (k)}_{n} \ge T_{n}) + 1) / (K + 1),

where T_{n} is the semi-distance test statistic and T^{\ast (k)}_{n} is the test statistic with k-th permutation sample.

When the dimension of the continuous variables is high, the asymptotic approximation approach can be applied (test_type = "asym"), which is computationally faster since no permutation is needed.

Value

A list with class "indtest" containing the following components

See Also

sdcov() for computing the statistic of semi-distance covariance.

Examples

X <- mtcars[, c("mpg", "disp", "drat", "wt")]
y <- factor(mtcars[, "am"])
test <- sd_test(X, y)
print(test)

# Man-made independent data -------------------------------------------------
n <- 30; R <- 5; p <- 3; prob <- rep(1/R, R)
X <- matrix(rnorm(n*p), n, p)
y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R)
test <- sd_test(X, y)
print(test)

# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3; p <- 3
X <- matrix(0, n, p)
X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1
y <- factor(rep(1:3, each = 10))
test <- sd_test(X, y)
print(test)

#' Man-made high-dimensionally independent data -----------------------------
n <- 30; R <- 3; p <- 100
X <- matrix(rnorm(n*p), n, p)
y <- factor(rep(1:3, each = 10))
test <- sd_test(X, y)
print(test)

test <- sd_test(X, y, test_type = "asym")
print(test)

# Man-made high-dimensionally dependent data --------------------------------
n <- 30; R <- 3; p <- 100
X <- matrix(0, n, p)
X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1
y <- factor(rep(1:3, each = 10))
test <- sd_test(X, y)
print(test)

test <- sd_test(X, y, test_type = "asym")
print(test)


Semi-distance covariance and correlation statistics

Description

Compute the statistics (or sample estimates) of semi-distance covariance and correlation. The semi-distance correlation is a standardized version of semi-distance covariance, and it can measure the dependence between a multivariate continuous variable and a categorical variable. See Details for the definition of semi-distance covariance and semi-distance correlation.

Usage

sdcov(X, y, type = "V", return_mat = FALSE)

sdcor(X, y)

Arguments

X

Data of multivariate continuous variables, which should be an n-by-p matrix, or, a vector of length n (for univariate variable).

y

Data of categorical variables, which should be a factor of length n.

type

Type of statistic: "V" (the default) or "U". See Details.

return_mat

A boolean. If FALSE (the default), only the calculated statistic is returned. If TRUE, also return the matrix of the distances of X and the divergences of y, which is useful for the permutation test.

Details

For \bm{X} \in \mathbb{R}^{p} and Y \in \{1, 2, \cdots, R\}, the (population-level) semi-distance covariance is defined as

\mathrm{SDcov}(\bm{X}, Y) = \mathrm{E}\left[\|\bm{X}-\widetilde{\bm{X}}\|\left(1-\sum_{r=1}^R I(Y=r,\widetilde{Y}=r)/p_r\right)\right],

where p_r = P(Y = r) and (\widetilde{\bm{X}}, \widetilde{Y}) is an iid copy of (\bm{X}, Y). The (population-level) semi-distance correlation is defined as

\mathrm{SDcor}(\bm{X}, Y) = \dfrac{\mathrm{SDcov}(\bm{X}, Y)}{\mathrm{dvar}(\bm{X})\sqrt{R-1}},

where \mathrm{dvar}(\bm{X}) is the distance variance (Szekely, Rizzo, and Bakirov 2007) of \bm{X}.

With n observations \{(\bm{X}_i, Y_i)\}_{i=1}^{n}, sdcov() and sdcor() can compute the sample estimates for the semi-distance covariance and correlation.

If type = "V", the semi-distance covariance statistic is computed as a V-statistic, which takes a very similar form as the energy-based statistic with double centering, and is always non-negative. Specifically,

\text{SDcov}_n(\bm{X}, y) = \frac{1}{n^2} \sum_{k=1}^{n} \sum_{l=1}^{n} A_{kl} B_{kl},

where

A_{kl} = a_{kl} - \bar{a}_{k.} - \bar{a}_{.l} + \bar{a}_{..}

is the double centering (Szekely, Rizzo, and Bakirov 2007) of a_{kl} = \| \bm{X}_k - \bm{X}_l \|, and

B_{kl} = 1 - \sum_{r=1}^{R} I(Y_k = r) I(Y_l = r) / \hat{p}_r

with \hat{p}_r = n_r / n = n^{-1}\sum_{i=1}^{n} I(Y_i = r). The semi-distance correlation statistic is

\text{SDcor}_n(\bm{X}, y) = \dfrac{\text{SDcov}_n(\bm{X}, y)}{\text{dvar}_n(\bm{X})\sqrt{R - 1}},

where \text{dvar}_n(\bm{X}) is the V-statistic of distance variance of \bm{X}.

If type = "U", then the semi-distance covariance statistic is computed as an “estimated U-statistic”, which is utilized in the independence test statistic and is not necessarily non-negative. Specifically,

\widetilde{\text{SDcov}}_n(\bm{X}, y) = \frac{1}{n(n-1)} \sum_{i \ne j} \| \bm{X}_i - \bm{X}_j \| \left(1 - \sum_{r=1}^{R} I(Y_i = r) I(Y_j = r) / \tilde{p}_r\right),

where \tilde{p}_r = (n_r-1) / (n-1) = (n-1)^{-1}(\sum_{i=1}^{n} I(Y_i = r) - 1). Note that the test statistic of the semi-distance independence test is

T_n = n \cdot \widetilde{\text{SDcov}}_n(\bm{X}, y).

Value

The value of the corresponding sample statistic.

If the argument return_mat of sdcov() is set as TRUE, a list with elements

will be returned.

See Also

Examples

X <- mtcars[, c("mpg", "disp", "drat", "wt")]
y <- factor(mtcars[, "am"])
print(sdcov(X, y))
print(sdcor(X, y))

# Man-made independent data -------------------------------------------------
n <- 30; R <- 5; p <- 3; prob <- rep(1/R, R)
X <- matrix(rnorm(n*p), n, p)
y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R)
print(sdcov(X, y))
print(sdcor(X, y))

# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3; p <- 3
X <- matrix(0, n, p)
X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1
y <- factor(rep(1:3, each = 10))
print(sdcov(X, y))
print(sdcor(X, y))


Switch the representation of a categorical object

Description

Categorical data with n observations and R levels can typically be represented as two forms in R: a factor with length n, or an n by K indicator matrix with elements being 0 or 1. This function is to switch the form of a categorical object from one to the another.

Usage

switch_cat_repr(obj)

Arguments

obj

an object representing categorical data, either a factor or an indicator matrix with each row representing an observation.

Value

categorical object in the another form.


Estimate the trace of the covariance matrix and its square

Description

For a design matrix \mathbf{X}, estimate the trace of its covariance matrix \Sigma = \mathrm{cov}(\mathbf{X}), and the square of covariance matrix \Sigma^2.

Usage

tr_estimate(X)

Arguments

X

The design matrix.

Value

A list with elements:

mirror server hosted at Truenetwork, Russian Federation.