Type: | Package |
Title: | Goodness-of-Fit Tests using Kernelized Stein Discrepancy |
Version: | 1.0.1 |
Date: | 2021-01-11 |
Description: | An adaptation of Kernelized Stein Discrepancy, this package provides a goodness-of-fit test of whether a given i.i.d. sample is drawn from a given distribution. It works for any distribution once its score function (the derivative of log-density) can be provided. This method is based on "A Kernelized Stein Discrepancy for Goodness-of-fit Tests and Model Evaluation" by Liu, Lee, and Jordan, available at <doi:10.48550/arXiv.1602.03253>. |
License: | MIT + file LICENSE |
LazyData: | TRUE |
RoxygenNote: | 7.1.1 |
Imports: | pryr, graphics, stats |
Suggests: | datasets, ggplot2, gridExtra, mclust, mvtnorm |
NeedsCompilation: | no |
Packaged: | 2021-01-11 07:33:36 UTC; danielkang |
Author: | Min Hyung Kang [aut, cre], Qiang Liu [aut] |
Maintainer: | Min Hyung Kang <Minhyung.Daniel.Kang@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2021-01-11 08:50:16 UTC |
Estimate Kernelized Stein Discrepancy (KSD)
Description
Estimate kernelized Stein discrepancy (KSD) using U-statistics,
and use bootstrap to test H0: x_i
is drawn from p(X)
(via KSD=0).
Usage
KSD(x, score_function, kernel = "rbf", width = -1, nboot = 1000)
Arguments
x |
Sample of size Num_Instance x Num_Dimension |
score_function |
( |
kernel |
Type of kernel (default = 'rbf') |
width |
Bandwidth of the kernel (when width = -1 or 'median', set it to be the median distance between data points) |
nboot |
Bootstrap sample size |
Value
A list which includes the following variables :
"ksd" : Estimated Kernelized Stein Discrepancy (KSD)
"p" : p-Value for rejecting the null hypothesis that ksd = 0
"bootstrapSamples" : the bootstrap sample
"info": other information, including : bandwidth, M, nboot, ksd_V
Examples
# Pass in a dataset generated by Gaussian distribution,
# use pryr package to pass in score function
model <- gmm()
X <- rgmm(model, n=100)
score_function = pryr::partial(scorefunctiongmm, model=model)
result <- KSD(X,score_function=score_function)
# Pass in a dataset generated by Gaussian distribution,
# pass in computed score rather than score function
model <- gmm()
X <- rgmm(model, n=100)
score_function = scorefunctiongmm(model=model, X=X)
result <- KSD(X,score_function=score_function)
# Pass in a dataset generated by Gaussian distribution,
# pass in computed score rather than score function
# Use median_heuristic by specifying width to be -2.0
model <- gmm()
X <- rgmm(model, n=100)
score_function = pryr::partial(scorefunctiongmm, model=model)
result <- KSD(X,score_function=score_function, 'rbf',-2.0)
# Pass in a dataset generated by specific Gaussian distribution,
# pass in computed score rather than score function
# Use median_heuristic by specifying width to be -2.0
model <- gmm()
X <- rgmm(model, n=100)
score_function = pryr::partial(scorefunctiongmm, model=model)
result <- KSD(X,score_function=score_function, 'rbf',-2.0)
Tests 1-dimensional Gaussian Mixture Models.
Description
Tests 1-dimensional Gaussian Mixture Models.
Usage
demo_gmm()
Tests multidimensional Gaussian Mixture Models.
Description
Tests multidimensional Gaussian Mixture Models.
Usage
demo_gmm_multi()
Fits Gaussian Mixture model and computes the KSD value for the model
Description
We fit a Gaussian Mixture Model for a given dataset (Fisher's Iris), and we compute the KSD P-value on the hold-out test dataset. User may tune the parameters and observe the change in results. Reports average of p-values obtained during each k-fold. It also plots the contour for each k-fold iteration if only 2 dimensions of data are used. If a vector is specified for nClust, the code tries each element as the number of clusters and reports the optimal parameter by choosing one with highest p-value.
Usage
demo_iris(cols = c(1, 2), nClust = 3, kfold = 5)
Arguments
cols |
: Columns of iris data set to use. If 2 dimensions, plots the contour for each k-fold. |
nClust |
: Number of clusters want to estimate with If vector, use each element as number of clusters and reports the optimal number. |
kfold |
: Number of k to use for k-fold |
Shows KSD p value change with respect variation in noise
Description
We generate a standard normal distribution, and add varying gaussian noise to this dataset and see the change in pvalues.
Usage
demo_normal_performance()
Tests 1-dimensional Gamma Distribution with customized parameters
Description
We generate a gamma distribution with given parameters, and add gaussian noise to this dataset. We then compute the score of each dataset for the original true distribution.
Usage
demo_simple_gamma(
trueshape = 10,
truescale = 3,
noisemu = 5,
noisesd = 2,
n = 100
)
Arguments
trueshape |
shape of true gamma distribution |
truescale |
scale of true gamma distribution |
noisemu |
mean of gaussian noise to add |
noisesd |
standard deviation of gaussian noise to add |
n |
number of samples to generate |
Tests 1-dimensional Gaussian Distribution with customized parameters
Description
We generate a gaussian distribution with given parameters, and add noise to this dataset. We then compute the score of each dataset for the original true distribution.
Usage
demo_simple_gaussian(truemu = 5, truesd = 1, noisemu = 0, noisesd = 2, n = 100)
Arguments
truemu |
mean of true distribution |
truesd |
standard deviation of true distribution |
noisemu |
mean of gaussian noise to add |
noisesd |
standard deviation of gaussian noise to add |
n |
number of samples to generate |
Returns a Gaussian Mixture Model
Description
Returns a Gaussian Mixture Model
Usage
gmm(nComp = NULL, mu = NULL, sigma = NULL, weights = NULL, d = NULL)
Arguments
nComp |
(scalar) : number of components |
mu |
(d by k): mean of each component |
sigma |
(d by d by k): covariance of each component |
weights |
(1 by k) : mixing weight of each proportion (optional) |
d |
: number of dimensions of vector (optional) |
Value
model : A Gaussian Mixture Model generated from the given parameters
Examples
# Default 1-d gaussian mixture model
model <- gmm()
# 1-d Gaussian mixture model with 3 components
model <- gmm(nComp = 3)
# 3-d Gaussian mixture model with 3 components, with specified mu,sigma and weights
mu <- matrix(c(1,2,3,2,3,4,5,6,7),ncol=3)
sigma <- array(diag(3),c(3,3,3))
model <- gmm(nComp = 3, mu = mu, sigma=sigma, weights = c(0.2,0.4,0.4), d = 3)
Calculates the likelihood for a given dataset for a GMM
Description
Calculates the likelihood for a given dataset for a GMM
Usage
likelihoodgmm(model = NULL, X = NULL)
Arguments
model |
: The Gaussian Mixture Model |
X |
(n by d): The dataset of interest, where n is the number of samples and d is the dimension |
Value
P (n by k) : The likelihood of each dataset belonging to each of the k component
Examples
# compute likelihood for a default 1-d gaussian mixture model
# and dataset generated from it
model <- gmm()
X <- rgmm(model)
p <- likelihoodgmm(model=model, X=X)
Returns a perturbed model of given GMM
Description
Returns a perturbed model of given GMM
Usage
perturbgmm(model = NULL)
Arguments
model |
: The base Gaussian Mixture Model |
Value
perturbedModel : Perturbed model with added noise to the supplied GMM
Examples
#Add noise to default 1-d gaussian mixture model
model <- gmm()
noisymodel <- perturbgmm(model)
Plots histogram for 1-d GMM given the dataset
Description
Plots histogram for 1-d GMM given the dataset
Usage
plotgmm(data, mu = NULL)
Arguments
data |
(n by 1): The dataset of interest, where n is the number of samples. |
mu |
: True mean of the GMM (optional) |
Examples
# Plot pdf histogram for a given dataset
model <- gmm()
X <- rgmm(model)
plotgmm(data=X)
# Plot pdf histogram for a given dataset, with lines that indicate the mean
model <- gmm()
mu <- model$mu
X <- rgmm(model)
plotgmm(data=X, mu=mu)
Calculates the posterior probability for a given dataset for a GMM
Description
Calculates the posterior probability for a given dataset for a GMM
Usage
posteriorgmm(model = NULL, X = NULL)
Arguments
model |
: The Gaussian Mixture Model |
X |
(n by d): The dataset of interest, where n is the number of samples and d is the dimension |
Value
P (n by k) : The posterior probabilty of each dataset belonging to each of the k component
Examples
# compute posterior probability for a default 1-d gaussian mixture model
# and dataset generated from it
model <- gmm()
X <- rgmm(model)
p <- posteriorgmm(model=model, X=X)
Generates dataset from Gaussian Mixture Model
Description
Generates dataset from Gaussian Mixture Model
Usage
rgmm(model = NULL, n = 100)
Arguments
model |
: Gaussian Mixture Model defined by gmm() |
n |
: number of samples desired |
Value
data (n by d): Random dataset generated from given the Gaussian Mixture Model
Note
Requires library mvtnorm
Examples
#Generate 100 samples from default gaussian mixture model
model <- gmm()
X <- rgmm(model)
#Generate 300 samples from 3-d gaussian mixture model
model <- gmm(d=3)
X <- rgmm(model,n=300)
Score function for given GMM : calculates score function dlogp(x)/dx for a given Gaussian Mixture Model
Description
Score function for given GMM : calculates score function dlogp(x)/dx for a given Gaussian Mixture Model
Usage
scorefunctiongmm(model = NULL, X = NULL)
Arguments
model |
: The Gaussian Mixture Model |
X |
(n by d): The dataset of interest, where n is the number of samples and d is the dimension |
Value
y : The score computed by the given function
Examples
# Compute score for a given gaussianmixture model and dataset
model <- gmm()
X <- rgmm(model)
score <- scorefunctiongmm(model=model, X=X)