Help for package MD2sample

Title:

Various Methods for the Two Sample Problem in D>1 Dimensions

Version:

1.0.0

Description:

The routine twosample_test() in this package runs the two-sample test using various test statistic for multivariate data. The user can also run several tests and then find a p value adjusted for simultaneous inference. The p values are found via permutation or via the parametric bootstrap. The routine twosample_power() allows the estimation of the power of the tests. The routine run.studies() allows a user to quickly study the power of a new method and how it compares to those included in the package. For details of the methods and references see the included vignettes.

License:

GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

Encoding:

UTF-8

RoxygenNote:

7.3.2

LinkingTo:

Rcpp

Imports:

Rcpp, parallel, stats, FNN, copula, ade4, gTests, igraph, lsa, microbenchmark, mvtnorm, Ball

Suggests:

rmarkdown, knitr, ggplot2, egg, testthat (≥ 3.0.0)

VignetteBuilder:

knitr

Depends:

R (≥ 3.5)

LazyData:

true

NeedsCompilation:

yes

Packaged:

2025-07-21 17:52:08 UTC; Wolfgang

Author:

Wolfgang Rolke

[aut, cre]

Maintainer:

Wolfgang Rolke <wolfgang.rolke@upr.edu>

Repository:

CRAN

Date/Publication:

2025-07-22 11:40:11 UTC

MD2sample: Various Methods for the Two Sample Problem in D>1 Dimensions

Description

Author(s)

Maintainer: Wolfgang Rolke wolfgang.rolke@upr.edu (ORCID)

Friedman-Rafsky (FR) test

Description

FR test is a multivariate generalization of nonparametric two-sample test. This function is an implementation with customized options, including a visualization of the minimum spanning tree (MST).

Usage

FR.test(
  samp1,
  samp2,
  use.cosine = FALSE,
  binary = FALSE,
  binary.cutoff = 2,
  plot.MST = FALSE,
  col = c("#F0E442", "#56B4E9"),
  label.names = c("Sample 1", "Sample 2"),
  vertex.size = 5,
  edge.width = 1,
  ...
)

Arguments

samp1

Numeric matrix or data frame for Sample 1. Rows are multivariate dimensions, and columns are samples. E.g. gene by cell.

samp2

Numeric matrix or data frame for Sample 2.

use.cosine

An option if to use cosine distance. Logical variable. By default (FALSE), Euclidean distance is used.

binary

An option if to use binary values. Logical variable. Default: FALSE. If TRUE, use binary.cutoff to dichotomize samp1 and samp2.

binary.cutoff

Numeric value for binary cutoff. Binary value = 1 if greater than binary.cutoff, 0 otherwise. Default: 2.

plot.MST

Boolean variable indicating if to plot the minimum spanning tree (MST). Default: FALSE.

col

Character vector of length two for customized colors of the nodes in MST. Default: c("#F0E442", "#56B4E9").

label.names

Character vector of length two for customized names of the two samples. Default: c("Sample 1","Sample 2").

vertex.size, edge.width, ...

Additional plotting parammeters passed to plot.igraph. Default: vertex.size=5, edge.width=1.

Value

Test statistics and p-values.

runs

Total number of subtrees.

runs.samp1

Number of subtrees of Sample 1.

runs.samp2

Number of subtrees of Sample 2.

stat

The standardized FR statistic.

p.value

P-value of the FR test.

equivalent to R command order

Description

equivalent to R command order

Usage

Order(x)

Arguments

x

numeric vector

Value

integer vector

equivalent to R command rank

Description

equivalent to R command rank

Usage

Rank(x)

Arguments

x

numeric vector

Value

integer vector

Find test statistics for continuous data

Description

Find test statistics for continuous data

Usage

TS_cont(x, y, TSextra)

Arguments

x

A matrix.

y

A matrix

TSextra

list with some helper functions

Value

A numeric vector with test statistics

This function runs a number of two sample tests which find their own p value.

Description

This function runs a number of two sample tests which find their own p value.

Usage

TS_cont_pval(x, y)

Arguments

x

a matrix of numbers if data is continuous or of counts if data is discrete.

y

a matrix of numbers if data is continuous or of counts if data is discrete.

Value

A list of two numeric vectors, the test statistics and the p values.

Find test statistics for discrete data

Description

Find test statistics for discrete data

Usage

TS_disc(x, y, vx, vy, TSextra)

Arguments

x

a vector of integers.

y

a vector of integers.

vx

a vector of real.

vy

a vector of real.

TSextra

list with extra infos

Value

A numeric vector with test statistics

Count Events in Bins

Description

This function discretizes continuous data in two dimensions

Usage

bincounterR(
  x,
  y,
  nbins = c(5, 5),
  Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2)
)

Arguments

x

a matrix of numbers or a list

y

a matrix of numbers

nbins

=c(5,5) number of bins in x and in y direction

Ranges

=matrix(c(-Inf, Inf, -Inf, Inf),2,2) a 2x2 matrix with lower and upper bounds

Value

a list counts and vals

This function calculates the test statistics

Description

This function calculates the test statistics

Usage

calcTS(dta, TS, typeTS, TSextra)

Arguments

dta

data set

TS

routine

typeTS

format of TS

TSextra

list passed to TS function

Value

A vector of numbers

Create case studies

Description

This function creates the functions needed to run the various case studies.

Usage

case.studies(
  which,
  n = 200,
  nx = n,
  ny = n,
  nbins = -1,
  Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2),
  ReturnCaseNames = FALSE
)

Arguments

which

name of the case study.

n

=200, sample size for both data sets

nx

=n, sample size for first data sets

ny

=n, sample size for second data sets

nbins

=-1, number of bins for chi-square test, 2D only

Ranges

= matrix(c(-Inf, Inf, -Inf, Inf), 2, 2), ranges of variables

ReturnCaseNames

=FALSE, should list of case studies be returned?

Value

a list of functions and vectors

Change Marginals of 2D Data

Description

This function modifies the marginals of some data set

Usage

change.marginals(dta, which, theta)

Arguments

dta

a list with two data sets

which

method for modifying the data sets.

theta

how much to modify

Value

a list of functions

Chi-square test for continuous data, example of user supplied function

Description

This function does the chi square two sample test for continuous data in two dimensions

Usage

chiTS.cont(x, y, TSextra)

Arguments

x

a matrix of numbers

y

a matrix of numbers

TSextra

list with some info

Value

either a test statistic or a p value

Chi-square test for discrete data, example of user supplied function

Description

This function does the chi square two sample test for discrete data in two dimensions

Usage

chiTS.disc(x, y, vals_x, vals_y, TSextra)

Arguments

x

a vector with counts

y

a vector with counts

vals_x

a vector with points

vals_y

a vector with points

TSextra

list with some info

Value

either a test statistic or a p value

This function does the chi square two sample test for continuous data in two dimensions

Description

This function does the chi square two sample test for continuous data in two dimensions

Usage

chisq2D_test_cont(
  dta_x,
  dta_y,
  Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2),
  nbins = c(5, 5),
  minexpcount = 5
)

Arguments

dta_x

a matrix of numbers

dta_y

a matrix of numbers

Ranges

=matrix(c(-Inf, Inf, -Inf, Inf),2,2) a 2x2 matrix with lower and upper bounds

nbins

=c(5,5) number of bins in x and y direction

minexpcount

=5 minimum counts required per bin

Value

a list with statistics and p values

This function does the chi square two sample test for continuous data in two dimensions

Description

This function does the chi square two sample test for continuous data in two dimensions

Usage

chisq2D_test_disc(dta, minexpcount = 5)

Arguments

dta

a list with values and counts

minexpcount

=5 minimum counts required per bin

Value

a list with statistics and p values

This function does the tests by Chen and Friedman

Description

This function does the tests by Chen and Friedman

Usage

edge.tests(x, y)

Arguments

x

a matrix of numbers

y

a matrix of numbers

Value

a list with statistics and p values

find pairwise Euclidean distances

Description

find pairwise Euclidean distances

Usage

find_dist(dta)

Arguments

dta

A list of matrices

Value

A matrix

simulate continuous data

Description

simulate continuous data

Usage

gen_cont(x, y, TSextra)

Arguments

x

first data set

y

second data set

TSextra

extra stuff

Value

A list of permuted vectors

simulate new discrete data

Description

simulate new discrete data

Usage

gen_disc(x, y, vals_x, vals_y, TSextra)

Arguments

x

first data set, counts

y

second data set, counts

vals_x

values of discrete random variable x

vals_y

values of discrete random variable y

TSextra

extra stuff

Value

A list of permuted vectors

simulate new data

Description

simulate new data

Usage

gen_sim_data(dta, TSextra)

Arguments

dta

data set

TSextra

extra stuff

Value

A list of permuted vectors

Finds the empirical distribution function

Description

Finds the empirical distribution function

Usage

mdecdf(dta, pts)

Arguments

dta

a matrix of data points

pts

a matrix of evaluation points

Value

a numeric vector

Find the power of various continuous tests via simutation or permutation.

Description

Find the power of various continuous tests via simutation or permutation.

Usage

powerC(rxy, xparam, yparam, TS, typeTS, TSextra, B = 1000L)

Arguments

rxy

a function that generates x and y data.

xparam

arguments for r1.

yparam

arguments for r2.

TS

routine to calculate test statistics for non-chi-square tests

typeTS

indicator for type of test statistics

TSextra

additional info passed to TS, if necessary

B

=1000 number of simulation runs

Value

A list values of test statistics

This function finds the power for tests that find their own p values

Description

This function finds the power for tests that find their own p values

Usage

power_pvals(
  rxy,
  avals,
  bvals,
  TS,
  typeTS,
  TSextra = list(a = 0),
  nbins = c(5, 5),
  minexpcount = 5,
  Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2),
  alpha = 0.05,
  B = 1000
)

Arguments

rxy

a function to generate data

avals

vector of parameter values

bvals

vector of parameter values

TS

routine that runs the test and returns p values

typeTS

type of problem, continuous or discrete

TSextra

=list(a=0) a list of things passed to TS, if needed

nbins

=c(5, 5), bins for chi square tests.

minexpcount

=5, lowest required count for chi-square test

Ranges

=matrix(c(-Inf, Inf, -Inf, Inf),2,2) a 2x2 matrix with lower and upper bounds

alpha

=0.05 type I error probability of test

B

=1000 number of simulation runs

Value

A matrix of power values

power_studies_results

Description

the results of the included power studies

Usage

power_studies_results

Format

'power_studies_results'

A list of matrices with powers

This function creates data for Dalitz plot

Description

This function creates data for Dalitz plot

Usage

rDalitz(nyadd = 0, which = 1, addon, n = 200, nx = n, ny = n)

Arguments

nyadd

which

type of Daliz plot.

addon

what to add to uniform

n

=200, sample size of both data sets.

nx

=n, sample size of x data set.

ny

=n, sample size of y data set.

Value

a list of functions

Benchmarking for Multivariate Two-Sample Tests

Description

This function runs the case studies included in the package.

Usage

run.studies(
  Continuous = TRUE,
  study,
  TS,
  TSextra,
  With.p.value = FALSE,
  nsample = 200,
  alpha = 0.05,
  param_alt,
  SuppressMessages = FALSE,
  B = 1000,
  maxProcessor
)

Arguments

Continuous

=TRUE, run cases for continuous data.

study

either the name of the study, or its number in the list. If missing all the studies are run.

TS

routine to calculate new test statistics.

TSextra

list passed to TS (optional).

With.p.value

=FALSE, does user supplied routine return p values?

nsample

= 200, desired sample size. 200 is used in included case studies.

alpha

=0.05, type I error probability of tests. 0.05 is used in included case studies.

param_alt

(list of) values of parameter under the alternative hypothesis. If missing included values are used.

SuppressMessages

=FALSE, should informative messages be printed?

B

= 1000, number of simulation runs.

maxProcessor

number of cores to use. If missing the number of physical cores-1 is used. If set to 1 no parallel processing is done.

Details

For details consult vignette(package="MD2sample")

Value

A (list of ) matrices of p.values.

Examples

#The new test is a (included) chi square test:
TSextra=list(which="pval", nbins=rbind(c(3,3), c(4,4)))
run.studies(Continuous=TRUE, study=c("NormalD2", "tD2"), 
          TS=MD2sample::chiTS.cont, TSextra=TSextra, B=100)

This function does some rounding to nice numbers

Description

This function does some rounding to nice numbers

Usage

## S3 method for class 'digits'
signif(x, d = 4)

Arguments

x

a list of two vectors

d

=4 number of digits to round to

Value

A list with rounded vectors

Helper function to find test statistics of simulated data.

Description

Helper function to find test statistics of simulated data.

Usage

simTS(dta, TS, typeTS, TSextra, B)

Arguments

dta

a list

TS

test statistic routine

typeTS

type of routine

TSextra

a list

B

number of simulation runs

Value

a matrix

Helper function to find test statistics of simulated data.

Description

Helper function to find test statistics of simulated data.

Usage

simpvals(
  dta,
  TS,
  typeTS,
  TSextra,
  A,
  Continuous,
  Ranges,
  nbins,
  minexpcount,
  B
)

Arguments

dta

a list

TS

test statistic routine

typeTS

type of routine

TSextra

a list

A

a matrix

Continuous

logical

Ranges

a matrix

nbins

a vector

minexpcount

an integer

B

number of simulation runs

Value

a matrix

run tests using either simulation or permutation.

Description

run tests using either simulation or permutation.

Usage

testC(dta, TS, typeTS, TSextra, B = 5000L)

Arguments

dta

a list with the data

TS

routine to calculate test statistics for non-chi-square tests

typeTS

type of a test statistic

TSextra

additional info passed to TS, if necessary

B

=5000, number of simulation runs.

Value

A list with test statistics and p values

This function checks whether the correct methods have been requested

Description

This function checks whether the correct methods have been requested

Usage

test_methods(doMethods, Continuous, ReturnMethodNames = FALSE)

Arguments

doMethods

="all" Which methods should be included?

Continuous

is data continuous

ReturnMethodNames

= FALSE should names of methods be returned?

Value

TRUE/FALSE or a character vector

test function

Description

test function

Usage

timecheck(dta, TS, typeTS, TSextra)

Arguments

dta

data set

TS

test statistics

typeTS

format of TS

TSextra

additional info TS

Value

Mean computation time

transform data to interval (0,1).

Description

transform data to interval (0,1).

Usage

transform01(dta)

Arguments

dta

a list.

Value

a list

Power Estimation for Multivariate Two-Sample Tests

Description

Estimate the power of various two sample tests using Rcpp and parallel computing.

Usage

twosample_power(
  f,
  ...,
  TS,
  TSextra,
  alpha = 0.05,
  B = 1000,
  nbins = c(5, 5),
  minexpcount = 5,
  Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2),
  samplingmethod = "Binomial",
  rnull,
  With.p.value = FALSE,
  DoTransform = TRUE,
  SuppressMessages = FALSE,
  LargeSampleOnly = FALSE,
  maxProcessor,
  doMethods = "all"
)

Arguments

f

function to generate a list with data sets x and y for continuous data or a matrix with columns vals_x, vals_y, x and y for discrete data.

...

additional arguments passed to f, up to 2.

TS

routine to calculate test statistics for new tests.

TSextra

additional info passed to TS, if necessary.

alpha

=0.05, the type I error probability of the hypothesis test.

B

=1000, number of simulation runs.

nbins

=c(5, 5), number of bins for chi square test if Dim=2.

minexpcount

=5, lowest required count for chi-square test.

Ranges

=matrix(c(-Inf, Inf, -Inf, Inf),2,2), a 2x2 matrix with lower and upper bounds.

samplingmethod

="Binomial" for Binomial sampling or "independence" for independence sampling in the discrete data case.

rnull

function to generate new data sets for parametric bootstrap.

With.p.value

=FALSE, does user supplied routine return p values?

DoTransform

=TRUE, should data be transformed to to unit hypercube?

SuppressMessages

=FALSE, should messages be printed?

LargeSampleOnly

=FALSE, should only methods with large sample theories be run?

maxProcessor

number of cores to use. If missing the number of physical cores-1 is used. If set to 1 no parallel processing is done.

doMethods

="all", which methods should be included?

Details

For details consult vignette("MD2sample","MD2sample")

Value

A numeric matrix or vector of power values.

Examples

#Note that the resulting power estimates are meaningless because
#of the extremely low number of simulation runs B, required because of CRAN timing rule
#
#Power of tests when one data set comes from a standard normal multivariate distribution function
#and the other data set from a multivariate normal with correlation
#number of simulation runs is ridiculously small because of CRAN submission rules
f=function(a=0) {
 S=diag(2) 
 x=mvtnorm::rmvnorm(100, sigma = S)
 S[1,2]=a
 S[2,1]=a
 y=mvtnorm::rmvnorm(120, sigma = S)
 list(x=x, y=y)
}
twosample_power(f, c(0, 0.5), B=10, maxProcessor=1)
#Power of use supplied test. Example is a (included) chi-square test:
TSextra=list(which="statistics", nbins=rbind(c(3,3), c(4,4)))
twosample_power(f, c(0, 0.5), TS=chiTS.cont, TSextra=TSextra, B=10, maxProcessor=1)
#Same example, but this time the user supplied routine calculates p values:
TSextra=list(which="pvalues", nbins=c(4,4))
twosample_power(f, c(0, 0.5), TS=chiTS.cont, TSextra=TSextra, B=10, 
             With.p.value=TRUE, maxProcessor=1)
#Example for discrete data
g=function(p1, p2) {
  x = table(sample(1:4, size=1000, replace = TRUE))
  y = table(sample(1:4, size=500, replace = TRUE, prob=c(p1,p2,1,1)))
  cbind(vals_x=rep(1:2,2),  vals_y=rep(1:2, each=2), x=x, y=y)
}  
twosample_power(g, 1.5, 1.6, B=10, maxProcessor=1)

Multivariate Two-Sample Tests

Description

This function runs a number of two sample tests using Rcpp and parallel computing.

Usage

twosample_test(
  x,
  y,
  vals_x = NA,
  vals_y = NA,
  TS,
  TSextra,
  B = 5000,
  nbins = c(5, 5),
  minexpcount = 5,
  Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2),
  DoTransform = TRUE,
  samplingmethod = "Binomial",
  rnull,
  SuppressMessages = FALSE,
  LargeSampleOnly = FALSE,
  maxProcessor,
  doMethods = "all"
)

Arguments

x

Continuous data: either a matrix of numbers, or a list with two matrices called x and y. if it is a matrix Observations are in different rows. Discrete data: a vector of counts or a matrix with columns named vals_x, vals_y, x and y.

y

a matrix of numbers if data is continuous or a vector of counts if data is discrete.

vals_x

=NA, a vector of values for discrete random variables, or NA if data is continuous.

vals_y

=NA, a vector of values for discrete random variables, or NA if data is continuous.

TS

user supplied routine to calculate test statistics for new tests.

TSextra

(optional) additional info passed to TS, if necessary.

B

=5000, number of simulation runs for permutation test.

nbins

=c(5,5), for chi square tests (2D only).

minexpcount

=5, lowest required count for chi-square test (2D only).

Ranges

=matrix(c(-Inf, Inf, -Inf, Inf),2,2), a 2x2 matrix with lower and upper bounds (2D only).

DoTransform

=TRUE, should data be transformed to unit hypercube?

samplingmethod

="Binomial" for Binomial sampling or "independence" for independence sampling.

rnull

function to generate new data sets for simulation as an alternative to the permutation method.

SuppressMessages

=FALSE, should informative messages be printed?

LargeSampleOnly

=FALSE, should only methods with large sample theories be run?

maxProcessor

number of cores to use. If missing the number of physical cores-1 is used. If set to 1 no parallel processing is done.

doMethods

="all", Which methods should be included?

Details

For details consult vignette("MD2sample","MD2sample")

Value

A list of two numeric vectors, the test statistics and the p values.

Examples

#Two continuous data sets from a multivariate normal:
x = mvtnorm::rmvnorm(100, c(0,0))
y = mvtnorm::rmvnorm(120, c(0,0))
twosample_test(x, y, B=100, maxProcessor=1)
#Using a new test, this one is an (included) chi square test. 
#Also enter data as a list:
TSextra=list(which="statistics", nbins=rbind(c(3,3), c(4,4)))
dta=list(x=x, y=y)
twosample_test(dta, TS=chiTS.cont, TSextra=TSextra, B=100, maxProcessor=1)
#Two discrete data sets from some distribution:
x = table(sample(1:4, size=1000, replace = TRUE))
y = table(sample(1:4, size=1000, replace = TRUE, prob=c(1,2,1,1)))
vals_x=rep(1:2,2)
vals_y=rep(1:2, each=2)
twosample_test(x, y, vals_x, vals_y, B=100, maxProcessor=1)
#Run a discrete chi square test and enter the data as a matrix:
TSextra=list(which="statistics")
dta=cbind(x=x, y=y, vals_x=vals_x, vals_y=vals_y)
twosample_test(dta, TS=chiTS.disc, TSextra=TSextra, B=100, maxProcessor=1)

Adjusted p values

Description

This function runs a number of two sample tests using Rcpp and parallel computing and then finds the correct p value for the combined tests.

Usage

twosample_test_adjusted_pvalue(
  x,
  y,
  vals_x = NA,
  vals_y = NA,
  B = c(5000, 1000),
  nbins = c(5, 5),
  minexpcount = 5,
  samplingmethod = "Binomial",
  Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2),
  DoTransform = TRUE,
  rnull,
  SuppressMessages = FALSE,
  maxProcessor,
  doMethods
)

Arguments

x

y

a matrix of numbers if data if data is continuous or a vector of counts if data is discrete.

vals_x

=NA, a vector of values for discrete random variable, or NA if data is continuous.

vals_y

=NA, a vector of values for discrete random variable, or NA if data is continuous.

B

=c(5000, 1000), number of simulation runs for permutation test and for estimation of the empirical distribution function.

nbins

=c(5, 5), number of bins for chi square tests (2D only).

minexpcount

= 5, minimum required expected counts for chi-square tests.

samplingmethod

="Binomial" or "independence" for discrete data.

Ranges

=matrix(c(-Inf, Inf, -Inf, Inf),2,2) a 2x2 matrix with lower and upper bounds.

DoTransform

=TRUE, should data be transformed to interval (0,1)?

rnull

routine for parametric bootstrap.

SuppressMessages

= FALSE, print informative messages?

maxProcessor

number of cores for parallel processing.

doMethods

Which methods should be included? If missing a small number of methods that generally have good power are used.

Details

For details consult the vignette("MD2sample","MD2sample")

Value

NULL, results are printed out.

Examples

#Note that the number of simulation runs B is very small to
#satisfy CRAN's run time constraints. 
#Two continuous data sets from a multivariate normal:
x = mvtnorm::rmvnorm(100, c(0,0))
y = mvtnorm::rmvnorm(120, c(0,0))
twosample_test_adjusted_pvalue(x, y, maxProcessor=1, B=20)
#Two discrete data sets from some distribution:
x = table(sample(1:4, size=1000, replace = TRUE))
y = table(sample(1:4, size=500, replace = TRUE, prob=c(1, 1.5, 1, 1)))
twosample_test_adjusted_pvalue(x, y, rep(1:2,2), rep(1:2, each=2), maxProcessor=1, B=20)

sort vector y by values in vector x

Description

sort vector y by values in vector x

Usage

xysort(y, x)

Arguments

y

numeric vector

x

numeric vector

Value

numeric vector