Title: | Partition Exchangeability Toolkit |
Version: | 1.0.0.1000 |
Description: | Bayesian supervised predictive classifiers, hypothesis testing, and parametric estimation under Partition Exchangeability are implemented. The two classifiers presented are the marginal classifier (that assumes test data is i.i.d.) next to a more computationally costly but accurate simultaneous classifier (that finds a labelling for the entire test dataset at once based on simultanous use of all the test data to predict each label). We also provide the Maximum Likelihood Estimation (MLE) of the only underlying parameter of the partition exchangeability generative model as well as hypothesis testing statistics for equality of this parameter with a single value, alternative, or multiple samples. We present functions to simulate the sequences from Ewens Sampling Formula as the realisation of the Poisson-Dirichlet distribution and their respective probabilities. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.1.1.9001 |
Suggests: | testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
Imports: | stats (≥ 4.1.0) |
NeedsCompilation: | no |
Packaged: | 2021-11-19 13:16:51 UTC; villekin |
Author: | Ville Kinnula [aut],
Jing Tang |
Maintainer: | Ali Amiryousefi <ali.amiryousefi@helsinki.fi> |
Repository: | CRAN |
Date/Publication: | 2021-11-22 08:50:02 UTC |
Maximum Likelihood Estimate of \psi
Description
Numerically searches for the MLE of \psi
given an abundance vector with a binary search algorithm.
Usage
MLEp(abund)
Arguments
abund |
An abundance vector. |
Details
Numerically searches for the MLE of \psi
as the root of equation
K=\sum_{i=1}^n\psi/(\psi+i-1),
where K
is the observed number of
different species in the sample. The right side of the equation is monotonically
increasing when \psi>0
, so a binary search is used to find the root.
An accepted \psi
sets value of the right side
of the equation within R's smallest possible value of the actual value of K
.
Value
The MLE of \psi
.
References
W.J. Ewens, The sampling theory of selectively neutral alleles, Theoretical Population Biology, Volume 3, Issue 1, 1972, Pages 87-112, ISSN 0040-5809, <doi: 10.1016/0040-5809(72)90035-4>.
Examples
##Find the MLE of psi of the vector (1,2,2).
##The frequencies of the frequencies of the data vector are given as input:
MLEp(abundance(c(1,2,2)))
##Find the MLE of psi of a sample from the Poisson-Dirichlet distribution:
set.seed(1000)
x<-rPD(n=10000, psi=100)
MLEp(abundance(x))
Bootstrap confidence interval for the MLE of \psi
Description
A bootstrapped confidence interval for the Maximum Likelihood Estimate for
\psi
.
Usage
MLEp.bsci(x, level = 0.95, rounds = 1000, frac = 0.8)
Arguments
x |
A data vector. |
level |
Level of confidence interval as number between 0 and 1. |
rounds |
Number of bootstrap rounds. Default is 1000. |
frac |
Percentage of data |
Value
The MLE of \psi
as well as lower and upper bounds of the bootstrap
confidence interval.
Examples
## Find a 95% -confidence interval for the MLE of psi given a sample from the
## Poisson-Dirichlet distribution:
x<-rPD(n=10000, psi=100)
MLEp.bsci(x, 0.95, 100, 0.8)
Vector of frequencies of frequencies
Description
A function to calculate the abundance vector, or frequencies of frequencies of discrete or partly discrete
data vector x
. The abundance vector is used as input in the functions dPD()
, MLEp()
, and LMTp()
.
Usage
abundance(x)
Arguments
x |
Data vector |
Details
This function is equivalent to table(table(x))
.
Value
This function returns a named vector with the frequencies of the frequencies in the data vector x.
The function base::table(x)
returns a contingency table with the frequencies in the input data vector x
as
values. The names(table(x))
are the unique values in data vector x
. In abundance(x)
,
the unique values in table(x)
become the names of the values, while the values
themselves are the frequencies of the frequencies of data vector x
.
Examples
set.seed(111)
x<-rpois(10,10)
## The frequency table of x:
print(table(x))
## The frequency table of the frequency table of x:
abundance(x)
Fit the supervised classifier under partition exchangeability
Description
Fits the model according to training data x, where x is assumed to follow the Poisson-Dirichlet distribution, and discrete labels y.
Usage
classifier.fit(x, y)
Arguments
x |
data vector, or matrix with rows as data points and columns as features. |
y |
training data label vector of length equal to the amount of rows in |
Details
This function is used to learn the model parameters from the
training data, and gather them into an object that is used by the
classification algorithms tMarLab()
and tSimLab()
. The parameters it learns
are the Maximum Likelihood Estimate of the \psi
of each feature within
each class in the training data. It also records the frequencies of the data
for each feature within each class as well. These are used in calculating the
predictive probability of each test data being in each of the classes.
Value
Returns an object used as training data objects for the classification
algorithms tMarLab()
and tSimLab()
.
If x
is multidimensional, each list described below is returned for each dimension.
Returns a list of classwise lists, each with components:
frequencies
: the frequencies of values in the class.
psi
: the Maximum Likelihood estimate of \psi
for the class.
Examples
## Create training data x and its class labels y from Poisson-Dirichlet distributions
## with different psis:
set.seed(111)
x1<-rPD(5000,10)
x2<-rPD(5000,100)
x<-c(x1,x2)
y1<-rep("1", 5000)
y2<-rep("2", 5000)
y<-c(y1,y2)
fit<-classifier.fit(x,y)
## With multidimensional x:
set.seed(111)
x1<-cbind(rPD(5000,10),rPD(5000,50))
x2<-cbind(rPD(5000,100),rPD(5000,500))
x<-rbind(x1,x2)
y1<-rep("1", 5000)
y2<-rep("2", 5000)
y<-c(y1,y2)
fit<-classifier.fit(x,y)
The Poisson-Dirichlet distribution
Description
Distribution function for the Poisson-Dirichlet distribution.
Usage
dPD(abund, psi)
Arguments
abund |
An abundance vector. |
psi |
Dispersal parameter |
Details
Given an abundance vector abunds
, calculates the probability
of a data vector x
given by the Poisson-Dirichlet distribution. The higher the
dispersal parameter \psi
, the higher the amount of distinct observed
species. In terms of the paintbox process, a high \psi
increases the
size of the continuous part p_0
of the process, while a low \psi
will increase
the size of the discrete parts p_{\neq 0}
.
Value
The probability of the Poisson-Dirichlet distribution for the input
abundance vector, e.g. an exchangeable random partition, and a dispersal parameter \psi
.
References
W.J. Ewens, The sampling theory of selectively neutral alleles, Theoretical Population Biology, Volume 3, Issue 1, 1972, Pages 87-112, ISSN 0040-5809, <doi: 10.1016/0040-5809(72)90035-4>.
Examples
## Get a random sample from the Poisson Dirichlet distribution, and
## find the probability of such a sample with psi=5:
set.seed(111)
s <- rPD(n=100,psi=5)
a=abundance(s)
dPD(a, psi=5)
Test for the shape of the distribution
Description
This function performs a statistical test on the null hypothesis that a given sample's underlying distribution is the Poisson-Dirichlet distribution. It calculates a test statistic that is then used to gain a p-value from an empirical distribution of the statistic from simulated samples from a PD distribution.
Usage
is.PD(x, rounds)
Arguments
x |
A discrete data vector. |
rounds |
How many samples are simulated to obtain the empirical distribution. |
Details
The calculated test statistic is
W=\sum_{i=1}^n n_i^2 / n ,
which is calculated from the sample. Here n_i
are the frequencies of each unique value in the sample.
The MLE of \psi
is then estimated from the sample with the function MLEp()
, and an amount of samples
equal to the input parameter rounds
are generated with that estimate of \psi
and sample size n
. The test statistic W
is then calculated for each of the simulated samples.
The original W
is then given a p-value based on what percentage of the simulated W
it exceeds.
Value
A p-value.
References
Watterson, G.A., (1978), The homozygosity test of neutrality. Genetics. 88(2):405-417.
Examples
##Test whether a typical sample follows PD:
x<-rPD(100,10)
is.PD(x, 100)
##Test whether a very atypical sample where frequencies of different values
## are similar:
x<-c(rep(1, 200), rep(2, 200), rep(3, 200), rep(4, 200), rep(5,200))
is.PD(x,50)
Test for \psi
of multiple samples
Description
Likelihood ratio test for the hypotheses H_0: \: \psi_1=\psi_2=...=\psi_d
and
H_1: \: \psi_1 \neq \psi_2 \neq ... \neq \psi_d
, where \psi_1,\psi_2,
...,\psi_d
are the
dispersal parameters of the d
samples in the columns of the input data array x
.
Usage
mult.sample.test(x)
Arguments
x |
The data array to be tested. Each column of |
Details
Calculates the Likelihood Ratio Test statistic
-2log(L(\hat{\psi})/L(\hat{\psi}_1, \hat{\psi}_2, ..., \hat{\psi}_d)),
where L is the likelihood function of observing the d
input samples given
a single \psi
in the numerator and d
different parameters \psi_1,\psi_2,
...,\psi_d
for each sample respectively in the denominator. According
to the theory of Likelihood Ratio Tests, this statistic converges in
distribution to a \chi_{d-1}^2
-distribution when the null-hypothesis is true, where d-1
is the
difference in the amount of parameters between the considered models. To
calculate the statistic, the Maximum Likelihood Estimate for
\psi_1,\: \psi_2,\: ..., \: \psi_d
of H_1
and the shared \psi
of H_0
are calculated.
Value
Gives a vector with the Likelihood Ratio Test -statistic Lambda
, as well as the
p-value of the test p
.
References
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical Or Physical Character, 231(694-706), 289-337. <doi: 10.1098/rsta.1933.0009>.
Examples
##Create samples with different n and psi:
set.seed(111)
x<-rPD(1200, 15)
y<-c( rPD(1000, 20), rep(NA, 200) )
z<-c( rPD(800, 30), rep(NA, 400) )
samples<-cbind(cbind(x, y), z)
##Run test
mult.sample.test(samples)
Random sampling from the Poisson-Dirichlet Distribution
Description
rPD samples randomly from the PD distribution with a given \psi
by simulating the Hoppe urn model.
Usage
rPD(n, psi)
Arguments
n |
number of observations. |
psi |
dispersal parameter. |
Details
Samples random values with a given \psi
from the Poisson-Dirichlet distribution by simulating the Hoppe urn model.
Value
Returns a vector with a sample of size n
from the Hoppe urn model with parameter \psi
.
References
Hoppe, F.M. The sampling theory of neutral alleles and an urn model in population genetics. J. Math. Biology 25, 123–159 (1987). <doi: 10.1007/BF00276386>.
W.J. Ewens, The sampling theory of selectively neutral alleles, Theoretical Population Biology, Volume 3, Issue 1, 1972, Pages 87-112, ISSN 0040-5809, <doi: 10.1016/0040-5809(72)90035-4>.
Examples
## Get random sample from the PD distribution with different psi,
## and estimate the psi of the samples:
s1<-rPD(1000, 10)
s2<- rPD(1000, 50)
print(c(MLEp(abundance(s1)), MLEp(abundance(s2))))
Lagrange Multiplier Test for \psi
Description
Performs the Lagrange Multiplier test for the equality of the dispersion parameter \psi
of a sample.
The null hypothesis of the test is H_0: \psi = \psi_0
, where \psi_0
is given as input here.
Usage
sample.test(abund, psi = "a")
Arguments
abund |
An abundance vector of a sample. |
psi |
Target positive number |
Details
Calculates the Lagrange Multiplier test statistic
S\, = \,U(\psi_0)^2 / I(\psi_0),
where U
is the log-likelihood function of \psi
and I
is its Fisher information.
The statistic S
follows \chi^2
-distribution with 1 degree of freedom
when the null hypothesis H_0:\psi=\psi_0
is true.
Value
The statistic S
and a p-value of the two-sided test of the hypothesis.
References
Radhakrishna Rao, C, (1948), Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44(1), 50-57. <doi: 10.1017/S0305004100023987>
Examples
## Test the psi of a sample from the Poisson-Dirichlet distribution:
set.seed(10000)
x<-rPD(1000, 10)
## Find the abundance of the data vector:
abund=abundance(x)
## Test for the psi that was used, as well as a higher and a lower one:
sample.test(abund, 10)
sample.test(abund, 15)
sample.test(abund, 5)
sample.test(abund) #test for psi=1
sample.test(abund, "r") #test for psi=n
Marginally predicted labels of the test data given training data classification.
Description
Classifies the test data x
based on the training data object.
The test data is considered i.i.d., so each
data point is classified one by one.
Usage
tMarLab(training, x)
Arguments
training |
A training data object from the function |
x |
Test data vector or matrix with rows as data points and columns as features. |
Details
Independently assigns a class label for each test data point according to a
maximum \, a \, posteriori
rule. The predictive probability of data point
x_i
arising from class c
assuming the training data of size m_c
in the class
arises from a Poisson-Dirichlet(\hat{\psi}_c
) distribution is:
\hat{\psi}_c / (m_c + \hat{\psi}_c),
if no value equal to x_i
exists in the training data of class c
, and
m_{ci} / (m_c + \hat{\psi}_c),
if there does, where m_{ci}
is the frequency of the value of x_i
in the training data.
Value
A vector of predicted labels for test data x.
References
Amiryousefi A. Asymptotic supervised predictive classifiers under partition exchangeability. . 2021. https://arxiv.org/abs/2101.10950.
Corander, J., Cui, Y., Koski, T., and Siren, J.: Have I seen you before? Principles of Bayesian predictive classification revisited. Springer, Stat. Comput. 23, (2011), 59–73, (<doi: 10.1007/s11222-011-9291-7>).
Examples
## Create random samples x from Poisson-Dirichlet distributions with different
## psis, treating each sample as coming from a class of its own:
set.seed(111)
x1<-rPD(10500,10)
x2<-rPD(10500,1000)
test.ind1<-sample.int(10500,500) # Sample test datasets from the
test.ind2<-sample.int(10500,500) # original samples
x<-c(x1[-test.ind1],x2[-test.ind2])
## create training data labels:
y1<-rep("1", 10000)
y2<-rep("2", 10000)
y<-c(y1,y2)
## Test data t, with first half belonging to class "1", second have in "2":
t1<-x1[test.ind1]
t2<-x2[test.ind2]
t<-c(t1,t2)
fit<-classifier.fit(x,y)
## Run the classifier, which returns
tM<-tMarLab(fit, t)
##With multidimensional x:
set.seed(111)
x1<-cbind(rPD(5500,10),rPD(5500,50))
x2<-cbind(rPD(5500,100),rPD(5500,500))
test.ind1<-sample.int(5500,500)
test.ind2<-sample.int(5500,500)
x<-rbind(x1[-test.ind1,],x2[-test.ind2,])
y1<-rep("1", 5000)
y2<-rep("2", 5000)
y<-c(y1,y2)
fit<-classifier.fit(x,y)
t1<-x1[test.ind1,]
t2<-x2[test.ind2,]
t<-rbind(t1,t2)
tM<-tMarLab(fit, t)
Simultaneously predicted labels of the test data given the training data classification.
Description
Classifies the test data x
based on the training data object.
All of the test data is used simultaneously to make the classification.
Usage
tSimLab(training, x)
Arguments
training |
A training data object from the function |
x |
Test data vector or matrix with rows as data points and columns as features. |
Details
The test data are first labeled with the marginal classifier. The simultaneous classifier then iterates over all test data, assigning each a label by finding the maximum predictive probability given the current classification structure of the test data as a whole. This is repeated until the classification structure converges after iterating over all data.
Value
A vector of predicted labels for test data x.
References
Amiryousefi A. Asymptotic supervised predictive classifiers under partition exchangeability. . 2021. https://arxiv.org/abs/2101.10950.
Corander, J., Cui, Y., Koski, T., and Siren, J.: Have I seen you before? Principles of Bayesian predictive classification revisited. Springer, Stat. Comput. 23, (2011), 59–73, (<doi: 10.1007/s11222-011-9291-7>).
Examples
## Create random samples x from Poisson-Dirichlet distributions with different
## psis, treating each sample as coming from a class of its own:
set.seed(111)
x1<-rPD(1050,10)
x2<-rPD(1050,1000)
test.ind1<-sample.int(1050,50) # Sample test datasets from the
test.ind2<-sample.int(1050,50) # original samples
x<-c(x1[-test.ind1],x2[-test.ind2])
## create training data labels:
y1<-rep("1", 1000)
y2<-rep("2", 1000)
y<-c(y1,y2)
## Test data t, with first half belonging to class "1", second have in "2":
t1<-x1[test.ind1]
t2<-x2[test.ind2]
t<-c(t1,t2)
fit<-classifier.fit(x,y)
## Run the classifier, which returns
tS<-tSimLab(fit, t)
##With multidimensional x:
set.seed(111)
x1<-cbind(rPD(500,1),rPD(500,5))
x2<-cbind(rPD(500,10),rPD(500,50))
test.ind1<-sample.int(500,50)
test.ind2<-sample.int(500,50)
x<-rbind(x1[-test.ind1,],x2[-test.ind2,])
y1<-rep("1", 450)
y2<-rep("2", 450)
y<-c(y1,y2)
fit<-classifier.fit(x,y)
t1<-x1[test.ind1,]
t2<-x2[test.ind2,]
t<-rbind(t1,t2)
tS<-tSimLab(fit, t)
Two sample test for \psi
Description
Likelihood ratio test for the hypotheses H_0: \: \psi_1=\psi_2
and
H_1: \: \psi_1 \neq \psi_2
, where \psi_1
and \psi_2
are the
dispersal parameters of two input samples s1
and s2
.
Usage
two.sample.test(s1, s2)
Arguments
s1 , s2 |
The two data vectors to be tested. |
Details
Calculates the Likelihood Ratio Test statistic
-2log(L(\hat{\psi})/L(\hat{\psi}_1, \hat{\psi}_2)),
where L is the likelihood function of observing the two input samples given
a single \psi
in the numerator and two different parameters \psi_1
and \psi_2
for each sample respectively in the denominator. According
to the theory of Likelihood Ratio Tests, this statistic converges in
distribution to a \chi_d^2
-distribution under the null-hypothesis, where d
is the
difference in the amount of parameters between the considered models, which
is 1 here. To calculate the statistic, the Maximum Likelihood Estimate for
\psi_1,\: \psi_2
of H_1
and the shared \psi
of H_0
are calculated.
Value
Gives a vector with the Likelihood Ratio Test -statistic Lambda
, as well as the
p-value of the test p
.
References
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical Or Physical Character, 231(694-706), 289-337. <doi: 10.1098/rsta.1933.0009>.
Examples
##Create samples with different n and psi:
set.seed(111)
x<-rPD(500, 15)
y<-rPD(1000, 20)
z<-rPD(800, 30)
##Run tests
two.sample.test(x,y)
two.sample.test(x,z)
two.sample.test(y,z)