Help for package SCCI

Type:

Package

Title:

Stochastic Complexity-Based Conditional Independence Test for Discrete Data

Version:

1.2

Date:

2019-06-04

Author:

Alexander Marx [aut,cre] Jilles Vreeken [aut]

Maintainer:

Alexander Marx <amarx@mpi-inf.mpg.de>

Description:

An efficient implementation of SCCI using 'Rcpp'. SCCI is short for the Stochastic Complexity-based Conditional Independence criterium (Marx and Vreeken, 2019). SCCI is an asymptotically unbiased and L2 consistent estimator of (conditional) mutual information for discrete data.

License:

GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

Imports:

Rcpp (≥ 0.12.13)

LinkingTo:

Rcpp

Suggests:

pcalg, Rgraphviz

Encoding:

UTF-8

NeedsCompilation:

yes

Packaged:

2019-06-04 14:44:27 UTC; mbk-97-72

Repository:

CRAN

Date/Publication:

2019-06-04 15:00:06 UTC

Stochastic Complexity-based Conditional Independence Criterium

Description

Calculates whether two random variables X and Y are independent given a set of variables Z using SCCI. A score of 0 denotes that independence holds and values greater than 0 mean that X is not independent of Y given Z. For details on SCCI, we refer to Marx and Vreeken (AISTATS, 2019). If you use SCCI in your work, please cite Marx and Vreeken (AISTATS, 19).

The output of SCCI(..) is the difference in number of bits between condtioning X only on Z and conditioning on Z and Y. For the variant of SCCI that gives outputs that can be intpreted as p-values, please refer to pSCCI.

Usage

SCCI(x, y, Z, score="fNML", sym=FALSE)

Arguments

x

A discrete vector.

y

A discrete vector.

Z

A data frame consisting of zero or more columns of discrete vectors.

score

Default: fNML, optionally qNML can be passed.

sym

sym can be true or false

References

Alexander Marx and Jilles Vreeken; Testing Conditional Independence on Discrete Data using Stochastic Complexity, Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, 2019

Examples

set.seed(1)
x = round((runif(1000, min=0, max=5)))
y = round((runif(1000, min=0, max=5)))
Z = data.frame(round((runif(1000, min=0, max=5))), round((runif(1000, min=0, max=5))))
SCCI(x=x,y=y,Z=Z,score="fNML",sym=FALSE)	## 0

Conditional Shannon Entropy

Description

Calculates the Shannon entropy of a discrete random variable X conditioned on a discrete (possibly multivariate) random variable Y.

Usage

conditionalShannonEntropy(x, y)

Arguments

x

A discrete vector.

y

A data frame.

Examples

set.seed(1)
x = round((runif(1000, min=0, max=5)))
Y = data.frame(round((runif(1000, min=0, max=5))), round((runif(1000, min=0, max=5))))
conditionalShannonEntropy(x=x,y=Y) ## 2.411972

Conditional Stochastic Complexity for Multinomials

Description

Calculates the Stochastic Complexity of a discrete random variable X conditioned on a discrete (possibly multivariate) random variable Y. Variants for both factorized NML (fNML, Silander et al. 2008) and quotient NML (qNML, Silander et al. 2018) are included.

Usage

conditionalStochasticComplexity(x, y, score="fNML")

Arguments

x

A discrete vector.

y

A discrete vector or a data frame containing multiple discrete vectors to condition X on.

score

Default: fNML, optionally qNML can be passed.

References

Tomi Silander, Janne Leppä-aho, Elias Jääsaari, Teemu Roos; Quotient normalized maximum likelihood criterion for learning bayesian network structures, Proceedings of the 21nd International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, 2018

Tomi Silander, Teemu Roos, Petri Kontkanen and Petri Myllymäki; Factorized Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures, Proceedings of the 4th European Workshop on Probabilistic Graphical Models, 2008

Examples

set.seed(1)
x = round((runif(1000, min=0, max=5)))
Y = data.frame(round((runif(1000, min=0, max=5))), round((runif(1000, min=0, max=5))))
conditionalStochasticComplexity(x=x,y=Y,score="fNML")	## 2779.477

Stochastic Complexity-based Conditional Independence Criterium (p-value)

Description

This is an adapted version of SCCI for which the output can be interpreted as a p-value. For this, we adapted SCCI such that if SCCI = 0 (X is independent of Y given Z) it gives a p-value greater than 0.01 and for SCCI > 0 (X is not independent of Y given Z) gives a p-value smaller or equal to 0.01. Note that we just transformed the output of SCCI and do not obtain a real p-value. In essence, we define the artificial p-value as follows. Let v the output of SCCI divided by the number of samples n. p = 2^{-(6.643855 - v)}, which is equal to 0.01000001 if v = 0. Further, p \le 0.01 for SCCI \ge 0.000001. We restrict the p-values to be between 0 and 1.

Unlike SCCI, pSCCI is currently only instantiated with fNML.

pSCCI can be used directly in the PC algorithm developed by Spirtes et al. (2000), which was implemented in the 'pcalg' R-package by Kalisch et al. (2012), as shown in the example.

Usage

pSCCI(x, y, S, suffStat)

Arguments

x

Position of x variabe (integer).

y

Position of y variabe (integer).

S

Vector of the position of zero or more conditioning variables (integer).

suffStat

This format was adapted such that it can be used in the PC algorithm and other algorithms from the 'pcalg' package. SCCI only need the filed "dm" that contains the data matrix.

References

Markus Kalisch, Martin Mächler, Diego Colombo, Marloes H. Maathuis, Peter Bühlmann; Causal inference using graphical models with the R package pcalg, Journal of Statistical Software, 2012

Peter Spirtes, Clark N. Glymour, Richard Scheines, David Heckerman, Christopher Meek, Gregory Cooper and Thomas Richardson; Causation, Prediction, and Search, MIT press, 2000

Examples

set.seed(1)
x = round((runif(1000, min=0, max=5)))
y = round((runif(1000, min=0, max=5)))
Z = data.frame(round((runif(1000, min=0, max=5))), round((runif(1000, min=0, max=5))))
## create data matrix
data_matrix = as.matrix(data.frame(x,y,S1=Z[,1], S2=Z[,2]))
suffStat = list(dm=data_matrix)
pSCCI(x=1,y=2,S=c(3,4),suffStat=suffStat)	## 0.01000001

### Using SCI within the PC algorithm
if(require(pcalg)){
  ## Load data
  data(gmD)
  V <- colnames(gmD$x)
  ## define sufficient statistics
  suffStat <- list(dm = gmD$x, nlev = c(3,2,3,4,2), adaptDF = FALSE)
  ## estimate CPDAG
  pc.D <- pc(suffStat,
            ## independence test: SCCI using fNML
            indepTest = pSCCI, alpha = 0.01, labels = V, verbose = TRUE)
}
if (require(pcalg) & require(Rgraphviz)) {
  ## show estimated CPDAG
  par(mfrow = c(1,2))
  plot(pc.D, main = "Estimated CPDAG")
  plot(gmD$g, main = "True DAG")
}

Multinomial Regret Term

Description

Calculates the multinomial regret term for for a discrete random variable with domain size k and sample size n (see Silander et al. 2018). Note that we use the logarithm to basis 2 to calculate the result. To compare the results to Silander et al. (2018), we need to multiply the result with log(2) to compare the results.

Usage

regret(n,k)

Arguments

n

Integer (sample size)

k

Integer (domain size)

References

Examples

regret(50,10)           ## 19.1
regret(50,10) * log(2)  ## 13.24 (see Silander et al. 2018)

Shannon Entropy

Description

Calculates the Shannon entropy over data of a discrete random variable X.

Usage

shannonEntropy(x)

Arguments

x

A discrete vector.

Examples

set.seed(1)
x = round((runif(1000, min=0, max=5)))
shannonEntropy(x=x)	## 2.522265

Stochastic Complexity for Multinomials

Description

Efficient implementation of the exact computation of stochastic complexity for multinomials (Silander et al. 2008) for data over a discrete random variable X.

Usage

stochasticComplexity(x)

Arguments

x

A discrete vector.

References

Examples

set.seed(1)
x = round((runif(1000, min=0, max=5)))
stochasticComplexity(x=x)	## 2544.698