Type: | Package |
Title: | Bayesian Nonparametrics for Automatic Gating of Flow-Cytometry Data |
Version: | 0.13.5 |
Date: | 2024-01-13 |
LinkingTo: | Rcpp, RcppArmadillo |
Depends: | R (≥ 3.1), Rcpp (≥ 0.12.11), truncnorm |
Imports: | graphics, stats, grDevices, ellipse, fastcluster, ggplot2, pheatmap, reshape2, GGally |
Suggests: | foreach, parallel, doParallel, itertools, microbenchmark |
Description: | Dirichlet process mixture of multivariate normal, skew normal or skew t-distributions modeling oriented towards flow-cytometry data preprocessing applications. Method is detailed in: Hejblum, Alkhassimn, Gottardo, Caron & Thiebaut (2019) <doi:10.1214/18-AOAS1209>. |
License: | LGPL-3 | file LICENSE |
BugReports: | https://github.com/sistm/NPflow/issues |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.0 |
NeedsCompilation: | yes |
Packaged: | 2024-01-13 08:40:49 UTC; boris |
Author: | Boris P Hejblum [aut, cre], Chariff Alkhassim [aut], Francois Caron [aut] |
Maintainer: | Boris P Hejblum <boris.hejblum@u-bordeaux.fr> |
Repository: | CRAN |
Date/Publication: | 2024-01-13 10:00:02 UTC |
Bayesian Nonparametrics for Automatic Gating of Flow Cytometry data
Description
Dirichlet process mixture of multivariate normal, skew normal or skew t-distributions modeling oriented towards flow-cytometry data pre-processing applications.
Details
Package: | NPflow |
Type: | Package |
Version: | 0.13.5 |
Date: | 2024-01-13 |
License: | LGPL-3 |
The main function in this package is DPMpost
.
Author(s)
Boris P. Hejblum, Chariff Alkhassim, Francois Caron — Maintainer: Boris P. Hejblum
References
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
See Also
Useful links:
Report bugs at https://github.com/sistm/NPflow/issues
Slice Sampling of the Dirichlet Process Mixture Model with a prior on alpha
Description
Slice Sampling of the Dirichlet Process Mixture Model with a prior on alpha
Usage
DPMGibbsN(
z,
hyperG0,
a = 1e-04,
b = 1e-04,
N,
doPlot = TRUE,
nbclust_init = 30,
plotevery = N/10,
diagVar = TRUE,
use_variance_hyperprior = TRUE,
verbose = TRUE,
...
)
Arguments
z |
data matrix |
hyperG0 |
prior mixing distribution. |
a |
shape hyperparameter of the Gamma prior on the concentration parameter of the Dirichlet
Process. Default is |
b |
scale hyperparameter of the Gamma prior on the concentration parameter of the Dirichlet
Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not. Default to
|
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is estimated as a
diagonal matrix, or as a full matrix. Default is |
use_variance_hyperprior |
logical flag indicating whether a hyperprior is added
for the variance parameter. Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
... |
additional arguments to be passed to |
Value
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
listU_mu: |
a list of length |
listU_Sigma: |
a list of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Author(s)
Boris Hejblum
Examples
rm(list=ls())
#Number of data
n <- 500
d <- 4
#n <- 2000
set.seed(1234)
#set.seed(123)
#set.seed(4321)
# Sample data
m <- matrix(nrow=d, ncol=4, c(-1, 1, 1.5, 2, 2, -2, -1.5, -2))
p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters
sdev <- array(dim=c(d,d,4))
sdev[, ,1] <- 0.3*diag(d)
sdev[, ,2] <- c(0.1, 0.3)*diag(d)
sdev[, ,3] <- matrix(nrow=d, ncol=d, 0.15)
diag(sdev[, ,3]) <- 0.3
sdev[, ,4] <- 0.3*diag(d)
c <- rep(0,n)
z <- matrix(0, nrow=d, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1)
#cat(k, "/", n, " observations simulated\n", sep="")
}
# Set parameters of G0
hyperG0 <- list()
hyperG0[["mu"]] <- rep(0,d)
hyperG0[["kappa"]] <- 0.001
hyperG0[["nu"]] <- d+2
hyperG0[["lambda"]] <- diag(d)/10
# hyperprior on the Scale parameter of DPM
a <- 0.0001
b <- 0.0001
# Number of iterations
N <- 30
# do some plots
doPlot <- TRUE
nbclust_init <- 30
## Data
########
library(ggplot2)
p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y))
+ geom_point()
+ ggtitle("Toy example Data"))
p
## alpha priors plots
#####################
prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b),
"distribution" =factor(rep("prior",5000),
levels=c("prior", "posterior")))
p <- (ggplot(prioralpha, aes(x=alpha))
+ geom_histogram(aes(y=..density..),
colour="black", fill="white", bins=30)
+ geom_density(alpha=.6, fill="red", color=NA)
+ ggtitle(paste("Prior distribution on alpha: Gamma(", a,
",", b, ")\n", sep=""))
+ theme_bw()
)
p
if(interactive()){
# Gibbs sampler for Dirichlet Process Mixtures
##############################################
MCMCsample <- DPMGibbsN(z, hyperG0, a, b, N=500, doPlot, nbclust_init, plotevery=100,
gg.add=list(theme_bw(),
guides(shape=guide_legend(override.aes = list(fill="grey45")))),
diagVar=FALSE)
plot_ConvDPM(MCMCsample, from=2)
s <- summary(MCMCsample, burnin = 200, thin=2, posterior_approx=FALSE,
lossFn = "MBinderN")
F <- FmeasureC(pred=s$point_estim$c_est, ref=c)
postalpha <- data.frame("alpha"=MCMCsample$alpha[50:500],
"distribution" = factor(rep("posterior",500-49),
levels=c("prior", "posterior")))
p <- (ggplot(postalpha, aes(x=alpha))
+ geom_histogram(aes(y=..density..), binwidth=.1,
colour="black", fill="white")
+ geom_density(alpha=.2, fill="blue")
+ ggtitle("Posterior distribution of alpha\n")
# Ignore NA values for mean
# Overlay with transparent density plot
+ geom_vline(aes(xintercept=mean(alpha, na.rm=TRUE)),
color="red", linetype="dashed", size=1)
)
p
p <- (ggplot(drop=FALSE, alpha=.6)
+ geom_density(aes(x=alpha, fill=distribution),
color=NA, alpha=.6,
data=prioralpha)
#+ geom_density(aes(x=alpha, fill=distribution),
# color=NA, alpha=.6,
# data=postalpha)
+ ggtitle("Prior and posterior distributions of alpha\n")
+ scale_fill_discrete(drop=FALSE)
+ theme_bw()
+xlim(0,10)
+ylim(0, 1.3)
)
p
}
# k-means comparison
####################
plot(x=z[1,], y=z[2,], col=kmeans(t(z), centers=4)$cluster,
xlab = "d = 1", ylab= "d = 2", main="k-means with K=4 clusters")
KM <- kmeans(t(z), centers=4)
dataKM <- data.frame("X"=z[1,], "Y"=z[2,],
"Cluster"=as.character(KM$cluster))
dataCenters <- data.frame("X"=KM$centers[,1],
"Y"=KM$centers[,2],
"Cluster"=rownames(KM$centers))
p <- (ggplot(dataKM)
+ geom_point(aes(x=X, y=Y, col=Cluster))
+ geom_point(aes(x=X, y=Y, fill=Cluster, order=Cluster),
data=dataCenters, shape=22, size=5)
+ scale_colour_discrete(name="Cluster")
+ ggtitle("K-means with K=4 clusters\n"))
p
Slice Sampling of Dirichlet Process Mixture of Gaussian distributions
Description
Slice Sampling of Dirichlet Process Mixture of Gaussian distributions
Usage
DPMGibbsN_SeqPrior(
z,
prior_inform,
hyperG0,
N,
nbclust_init,
add.vagueprior = TRUE,
weightnoninfo = NULL,
doPlot = TRUE,
plotevery = N/10,
diagVar = TRUE,
verbose = TRUE,
...
)
Arguments
z |
data matrix |
prior_inform |
an informative prior such as the approximation computed by |
hyperG0 |
a non informative prior component for the mixing distribution.
Only used if |
N |
number of MCMC iterations. |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
add.vagueprior |
logical flag indicating whether a non informative component should
be added to the informative prior. Default is |
weightnoninfo |
a real between 0 and 1 giving the weights of the non informative component in the prior. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
... |
additional arguments to be passed to |
Value
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
listU_mu: |
a list of length |
listU_Sigma: |
a list of length |
U_SS_list: |
a list of length |
weights_list: |
|
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Author(s)
Boris Hejblum, Chariff Alkhassim
References
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
See Also
postProcess.DPMMclust
DPMGibbsN
Examples
rm(list=ls())
library(NPflow)
#Number of data
n <- 1500
# Sample data
#m <- matrix(nrow=2, ncol=4, c(-1, 1, 1.5, 2, 2, -2, 0.5, -2))
m <- matrix(nrow=2, ncol=4, c(-.8, .7, .5, .7, .5, -.7, -.5, -.7))
p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters
sdev <- array(dim=c(2,2,4))
sdev[, ,1] <- matrix(nrow=2, ncol=2, c(0.3, 0, 0, 0.3))
sdev[, ,2] <- matrix(nrow=2, ncol=2, c(0.1, 0, 0, 0.3))
sdev[, ,3] <- matrix(nrow=2, ncol=2, c(0.3, 0.15, 0.15, 0.3))
sdev[, ,4] <- .3*diag(2)
c <- rep(0,n)
z <- matrix(0, nrow=2, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(2, mean = 0, sd = 1), nrow=2, ncol=1)
#cat(k, "/", n, " observations simulated\n", sep="")
}
d<-2
# Set parameters of G0
hyperG0 <- list()
hyperG0[["mu"]] <- rep(0,d)
hyperG0[["kappa"]] <- 0.001
hyperG0[["nu"]] <- d+2
hyperG0[["lambda"]] <- diag(d)/10
# hyperprior on the Scale parameter of DPM
a <- 0.0001
b <- 0.0001
# Number of iterations
N <- 30
# do some plots
doPlot <- TRUE
nbclust_init <- 20
## Data
########
library(ggplot2)
p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y))
+ geom_point()
+ ggtitle("Toy example Data"))
p
if(interactive()){
# Gibbs sampler for Dirichlet Process Mixtures
##############################################
MCMCsample <- DPMGibbsN(z, hyperG0, a, b, N=1500, doPlot, nbclust_init, plotevery=200,
gg.add=list(theme_bw(),
guides(shape=guide_legend(override.aes = list(fill="grey45")))),
diagVar=FALSE)
s <- summary(MCMCsample, posterior_approx=TRUE, burnin = 1000, thin=5)
F1 <- FmeasureC(pred=s$point_estim$c_est, ref=c)
F1
MCMCsample2 <- DPMGibbsN_SeqPrior(z, prior_inform=s$param_posterior,
hyperG0, N=1500,
add.vagueprior = TRUE,
doPlot=TRUE, plotevery=100,
nbclust_init=nbclust_init,
gg.add=list(theme_bw(),
guides(shape=guide_legend(override.aes = list(fill="grey45")))),
diagVar=FALSE)
s2 <- summary(MCMCsample2, burnin = 500, thin=5)
F2 <- FmeasureC(pred=s2$point_estim$c_est, ref=c)
F2
}
Slice Sampling of the Dirichlet Process Mixture Model with a prior on alpha
Description
Slice Sampling of the Dirichlet Process Mixture Model with a prior on alpha
Usage
DPMGibbsN_parallel(
Ncpus,
type_connec,
z,
hyperG0,
a = 1e-04,
b = 1e-04,
N,
doPlot = TRUE,
nbclust_init = 30,
plotevery = N/10,
diagVar = TRUE,
use_variance_hyperprior = TRUE,
verbose = TRUE,
monitorfile = "",
...
)
Arguments
Ncpus |
the number of processors available |
type_connec |
The type of connection between the processors. Supported
cluster types are |
z |
data matrix |
hyperG0 |
prior mixing distribution. |
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
use_variance_hyperprior |
logical flag indicating whether a hyperprior is added
for the variance parameter. Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
monitorfile |
a writable connections or a character string naming a file to write into,
to monitor the progress of the analysis.
Default is |
... |
additional arguments to be passed to |
Value
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
listU_mu: |
a list of length |
listU_Sigma: |
a list of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Author(s)
Boris Hejblum
See Also
Examples
# Scaling up: ----
rm(list=ls())
#Number of data
n <- 2000
set.seed(1234)
# Sample data
d <- 3
nclust <- 5
m <- matrix(nrow=d, ncol=nclust, runif(d*nclust)*8)
# p: cluster probabilities
p <- runif(nclust)
p <- p/sum(p)
# Covariance matrix of the clusters
sdev <- array(dim=c(d, d, nclust))
for (j in 1:nclust){
sdev[, ,j] <- matrix(NA, nrow=d, ncol=d)
diag(sdev[, ,j]) <- abs(rnorm(n=d, mean=0.3, sd=0.1))
sdev[, ,j][lower.tri(sdev[, ,j], diag = FALSE)] <- rnorm(n=d*(d-1)/2,
mean=0, sd=0.05)
sdev[, ,j][upper.tri(sdev[, ,j], diag = FALSE)] <- (sdev[, ,j][
lower.tri(sdev[, ,j], diag = FALSE)])
}
c <- rep(0,n)
z <- matrix(0, nrow=d, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1)
#cat(k, "/", n, " observations simulated\n", sep="")
}
# hyperprior on the Scale parameter of DPM
a <- 0.001
b <- 0.001
# Number of iterations
N <- 25
# do some plots
doPlot <- TRUE
# Set parameters of G0
hyperG0 <- list()
hyperG0[["mu"]] <- rep(0, d)
hyperG0[["kappa"]] <- 0.01
hyperG0[["nu"]] <- d + 2
hyperG0[["lambda"]] <- diag(d)/10
nbclust_init <- 30
if(interactive()){
library(doParallel)
MCMCsample <- DPMGibbsN_parallel(Ncpus=2, type_connec="FORK", z, hyperG0, a, b,
N=1000, doPlot=FALSE, nbclust_init=30,
plotevery=100, gg.add=list(ggplot2::theme_bw(),
ggplot2::guides(shape =
ggplot2::guide_legend(override.aes = list(fill="grey45")))),
diagVar=FALSE)
}
Slice Sampling of Dirichlet Process Mixture of skew normal distributions
Description
Slice Sampling of Dirichlet Process Mixture of skew normal distributions
Usage
DPMGibbsSkewN(
z,
hyperG0,
a = 1e-04,
b = 1e-04,
N,
doPlot = TRUE,
nbclust_init = 30,
plotevery = N/10,
diagVar = TRUE,
use_variance_hyperprior = TRUE,
verbose = TRUE,
...
)
Arguments
z |
data matrix |
hyperG0 |
prior mixing distribution. |
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of a cluster is a diagonal matrix.
Default is |
use_variance_hyperprior |
logical flag indicating whether a hyperprior is added
for the variance parameter. Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
... |
additional arguments to be passed to |
Value
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
|
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Author(s)
Boris Hejblum
References
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
Examples
rm(list=ls())
#Number of data
n <- 1000
set.seed(123)
d <- 2
ncl <- 4
# Sample data
sdev <- array(dim=c(d,d,ncl))
#xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1, 1.5, 1, 1.5, -2, -2, -2))
xi <- matrix(nrow=d, ncol=ncl, c(-0.5, 0, 0.5, 0, 0.5, -1, -1, 1))
##xi <- matrix(nrow=d, ncol=ncl, c(-0.3, 0, 0.5, 0.5, 0.5, -1.2, -1, 1))
psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -1.2))
p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters
sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3))
sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3))
sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3))
sdev[, ,4] <- .3*diag(2)
c <- rep(0,n)
z <- matrix(0, nrow=d, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
z[,k] <- xi[, c[k]] + psi[, c[k]]*abs(rnorm(1)) + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0,
sd = 1), nrow=d, ncol=1)
#cat(k, "/", n, " observations simulated\n", sep="")
}
# Set parameters of G0
hyperG0 <- list()
hyperG0[["b_xi"]] <- rep(0,d)
hyperG0[["b_psi"]] <- rep(0,d)
hyperG0[["kappa"]] <- 0.0001
hyperG0[["D_xi"]] <- 100
hyperG0[["D_psi"]] <- 100
hyperG0[["nu"]] <- d + 1
hyperG0[["lambda"]] <- diag(d)
# hyperprior on the Scale parameter of DPM
a <- 0.0001
b <- 0.0001
# do some plots
doPlot <- TRUE
nbclust_init <- 30
## Data
########
library(ggplot2)
p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y))
+ geom_point()
+ ggtitle("Simple example in 2d data")
+xlab("D1")
+ylab("D2")
+theme_bw())
p
c2plot <- factor(c)
levels(c2plot) <- c("3", "2", "4", "1")
pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot)))
+ geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster))
+ ggtitle("Slightly overlapping skew-normal simulation\n")
+ xlab("D1")
+ ylab("D2")
+ theme_bw()
+ scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22))))
pp
## alpha priors plots
#####################
prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b),
"distribution" =factor(rep("prior",5000),
levels=c("prior", "posterior")))
p <- (ggplot(prioralpha, aes(x=alpha))
+ geom_histogram(aes(y=..density..),
colour="black", fill="white")
+ geom_density(alpha=.2, fill="red")
+ ggtitle(paste("Prior distribution on alpha: Gamma(", a,
",", b, ")\n", sep=""))
)
p
if(interactive()){
# Gibbs sampler for Dirichlet Process Mixtures
##############################################
MCMCsample_sn <- DPMGibbsSkewN(z, hyperG0, a, b, N=2500,
doPlot, nbclust_init, plotevery=200,
gg.add=list(theme_bw(),
guides(shape=guide_legend(override.aes = list(fill="grey45")))),
diagVar=FALSE)
s <- summary(MCMCsample_sn, burnin = 2000, thin=10)
#cluster_est_binder(MCMCsample_sn$mcmc_partitions[1000:1500])
print(s)
plot(s)
#plot_ConvDPM(MCMCsample_sn, from=2)
# k-means
plot(x=z[1,], y=z[2,], col=kmeans(t(z), centers=4)$cluster,
xlab = "d = 1", ylab= "d = 2", main="k-means with K=4 clusters")
KM <- kmeans(t(z), centers=4)
KMclust <- factor(KM$cluster)
levels(KMclust) <- c("2", "4", "1", "3")
dataKM <- data.frame("X"=z[1,], "Y"=z[2,],
"Cluster"=as.character(KMclust))
dataCenters <- data.frame("X"=KM$centers[,1],
"Y"=KM$centers[,2],
"Cluster"=c("2", "4", "1", "3"))
p <- (ggplot(dataKM)
+ geom_point(aes(x=X, y=Y, col=Cluster))
+ geom_point(aes(x=X, y=Y, fill=Cluster, order=Cluster),
data=dataCenters, shape=22, size=5)
+ scale_colour_discrete(name="Cluster",
guide=guide_legend(override.aes=list(size=6, shape=22)))
+ ggtitle("K-means with K=4 clusters\n")
+ theme_bw()
)
p
postalpha <- data.frame("alpha"=MCMCsample_sn$alpha[501:1000],
"distribution" = factor(rep("posterior",1000-500),
levels=c("prior", "posterior")))
postK <- data.frame("K"=sapply(lapply(postalpha$alpha, "["),
function(x){sum(x/(x+0:(1000-1)))}))
p <- (ggplot(postalpha, aes(x=alpha))
+ geom_histogram(aes(y=..density..), binwidth=.1,
colour="black", fill="white")
+ geom_density(alpha=.2, fill="blue")
+ ggtitle("Posterior distribution of alpha\n")
# Ignore NA values for mean
# Overlay with transparent density plot
+ geom_vline(aes(xintercept=mean(alpha, na.rm=T)),
color="red", linetype="dashed", size=1)
)
p
p <- (ggplot(postK, aes(x=K))
+ geom_histogram(aes(y=..density..),
colour="black", fill="white")
+ geom_density(alpha=.2, fill="blue")
+ ggtitle("Posterior distribution of predicted K\n")
# Ignore NA values for mean
# Overlay with transparent density plot
+ geom_vline(aes(xintercept=mean(K, na.rm=T)),
color="red", linetype="dashed", size=1)
#+ scale_x_continuous(breaks=c(0:6)*2, minor_breaks=c(0:6)*2+1)
+ scale_x_continuous(breaks=c(1:12))
)
p
p <- (ggplot(drop=FALSE, alpha=.6)
+ geom_density(aes(x=alpha, fill=distribution),
color=NA, alpha=.6,
data=postalpha)
+ geom_density(aes(x=alpha, fill=distribution),
color=NA, alpha=.6,
data=prioralpha)
+ ggtitle("Prior and posterior distributions of alpha\n")
+ scale_fill_discrete(drop=FALSE)
+ theme_bw()
+ xlim(0,100)
)
p
#Skew Normal
n=100000
xi <- 0
d <- 0.995
alpha <- d/sqrt(1-d^2)
z <- rtruncnorm(n,a=0, b=Inf)
e <- rnorm(n, mean = 0, sd = 1)
x <- d*z + sqrt(1-d^2)*e
o <- 1
y <- xi+o*x
nu=1.3
w <- rgamma(n,scale=nu/2, shape=nu/2)
yy <- xi+o*x/w
snd <- data.frame("Y"=y,"YY"=yy)
p <- (ggplot(snd)+geom_density(aes(x=Y), fill="blue", alpha=.2)
+ theme_bw()
+ ylab("Density")
+ ggtitle("Y~SN(0,1,10)\n")
+ xlim(-1,6)
+ ylim(0,0.8)
)
p
p <- (ggplot(snd)+geom_density(aes(x=YY), fill="blue", alpha=.2)
+ theme_bw()
+ ylab("Density")
+ ggtitle("Y~ST(0,1,10,1.3)\n")
+ xlim(-2,40)
+ ylim(0,0.8)
)
p
p <- (ggplot(snd)
+ geom_density(aes(x=Y, fill="blue"), alpha=.2)
+ geom_density(aes(x=YY, fill="red"), alpha=.2)
+ theme_bw()
+theme(legend.text = element_text(size = 13), legend.position="bottom")
+ ylab("Density")
+ xlim(-2,40)
+ ylim(0,0.8)
+ scale_fill_manual(name="", labels=c("Y~SN(0,1,10) ", "Y~ST(0,1,10,1.3)"),
guide="legend", values=c("blue", "red"))
)
p
#Variations
n=100000
xi <- -1
d <- 0.995
alpha <- d/sqrt(1-d^2)
z <- rtruncnorm(n,a=0, b=Inf)
e <- rnorm(n, mean = 0, sd = 1)
x <- d*z + sqrt(1-d^2)*e
snd <- data.frame("X"=x)
p <- (ggplot(snd)+geom_density(aes(x=X), fill="blue", alpha=.2)
+ theme_bw()
+ ylab("Density")
+ ggtitle("X~SN(10)\n")
+ xlim(-1.5,4)
+ ylim(0,1.6)
)
p
o <- 0.5
y <- xi+o*x
snd <- data.frame("Y"=y)
p <- (ggplot(snd)+geom_density(aes(x=Y), fill="blue", alpha=.2)
+ theme_bw()
+ ylab("Density")
+ ggtitle("Y~SN(-1,1,10)\n")
+ xlim(-1.5,4)
+ ylim(0,1.6)
)
p
#Simple toy example
###################
n <- 500
set.seed(12345)
d <- 2
ncl <- 4
# Sample data
sdev <- array(dim=c(d,d,ncl))
xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1, 1.5, 1, 1.5, -2, -2, -2))
psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -1.2))
p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters
sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3))
sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3))
sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3))
sdev[, ,4] <- .3*diag(2)
#' # Set parameters of G0
hyperG0 <- list()
hyperG0[["b_xi"]] <- rep(0,d)
hyperG0[["b_psi"]] <- rep(0,d)
hyperG0[["kappa"]] <- 0.0001
hyperG0[["D_xi"]] <- 100
hyperG0[["D_psi"]] <- 100
hyperG0[["nu"]] <- d + 1
hyperG0[["lambda"]] <- diag(d)
c <- rep(0,n)
z <- matrix(0, nrow=d, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
z[,k] <- xi[, c[k]] + psi[, c[k]]*abs(rnorm(1)) + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0,
sd = 1), nrow=d, ncol=1)
cat(k, "/", n, " observations simulated\n", sep="")
}
MCMCsample_sn_sep <- DPMGibbsSkewN(z, hyperG0, a, b, N=600,
doPlot, nbclust_init, plotevery=100,
gg.add=list(theme_bw(),
guides(shape=guide_legend(override.aes = list(fill="grey45")))),
diagVar=TRUE)
s <- summary(MCMCsample_sn, burnin = 400)
}
Parallel Implementation of Slice Sampling of Dirichlet Process Mixture of skew normal distributions
Description
If the monitorfile
argument is a character string naming a file to
write into, in the case of a new file that does not exist yet, such a new
file will be created. A line is written at each MCMC iteration.
Usage
DPMGibbsSkewN_parallel(
Ncpus,
type_connec,
z,
hyperG0,
a = 1e-04,
b = 1e-04,
N,
doPlot = FALSE,
nbclust_init = 30,
plotevery = N/10,
diagVar = TRUE,
use_variance_hyperprior = TRUE,
verbose = FALSE,
monitorfile = "",
...
)
Arguments
Ncpus |
the number of processors available |
type_connec |
The type of connection between the processors. Supported
cluster types are |
z |
data matrix |
hyperG0 |
prior mixing distribution. |
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
use_variance_hyperprior |
logical flag indicating whether a hyperprior is added
for the variance parameter. Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
monitorfile |
a writable connections or a character string naming a file to write into,
to monitor the progress of the analysis.
Default is |
... |
additional arguments to be passed to |
Value
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
|
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Author(s)
Boris Hejblum
References
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
Examples
rm(list=ls())
#Number of data
n <- 2000
set.seed(1234)
d <- 4
ncl <- 5
# Sample data
sdev <- array(dim=c(d,d,ncl))
xi <- matrix(nrow=d, ncol=ncl, c(runif(n=d*ncl,min=0,max=3)))
psi <- matrix(nrow=d, ncol=ncl, c(runif(n=d*ncl,min=-1,max=1)))
p <- runif(n=ncl)
p <- p/sum(p)
sdev0 <- diag(runif(n=d, min=0.05, max=0.6))
for (j in 1:ncl){
sdev[, ,j] <- invwishrnd(n = d+2, lambda = sdev0)
}
c <- rep(0,n)
z <- matrix(0, nrow=d, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
z[,k] <- xi[, c[k]] + psi[, c[k]]*abs(rnorm(1)) + sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0,
sd = 1), nrow=d, ncol=1)
#cat(k, "/", n, " observations simulated\n", sep="")
}
# Set parameters of G0
hyperG0 <- list()
hyperG0[["b_xi"]] <- rep(0,d)
hyperG0[["b_psi"]] <- rep(0,d)
hyperG0[["kappa"]] <- 0.001
hyperG0[["D_xi"]] <- 100
hyperG0[["D_psi"]] <- 100
hyperG0[["nu"]] <- d + 1
hyperG0[["lambda"]] <- diag(d)/10
# hyperprior on the Scale parameter of DPM
a <- 0.0001
b <- 0.0001
# do some plots
doPlot <- TRUE
nbclust_init <- 30
z <- z*200
## Data
########
library(ggplot2)
p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y))
+ geom_point()
+ ggtitle("Simple example in 2d data")
+xlab("D1")
+ylab("D2")
+theme_bw())
p
## alpha priors plots
#####################
prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b),
"distribution" =factor(rep("prior",5000),
levels=c("prior", "posterior")))
p <- (ggplot(prioralpha, aes(x=alpha))
+ geom_histogram(aes(y=..density..),
colour="black", fill="white")
+ geom_density(alpha=.2, fill="red")
+ ggtitle(paste("Prior distribution on alpha: Gamma(", a,
",", b, ")\n", sep=""))
)
p
# Gibbs sampler for Dirichlet Process Mixtures
##############################################
if(interactive()){
MCMCsample_sn_par <- DPMGibbsSkewN_parallel(Ncpus=parallel::detectCores()-1,
type_connec="SOCK", z, hyperG0,
a, b, N=5000, doPlot, nbclust_init,
plotevery=25, gg.add=list(theme_bw(),
guides(shape=guide_legend(override.aes = list(fill="grey45")))))
plot_ConvDPM(MCMCsample_sn_par, from=2)
}
Slice Sampling of Dirichlet Process Mixture of skew Student's t-distributions
Description
Slice Sampling of Dirichlet Process Mixture of skew Student's t-distributions
Usage
DPMGibbsSkewT(
z,
hyperG0,
a = 1e-04,
b = 1e-04,
N,
doPlot = TRUE,
nbclust_init = 30,
plotevery = N/10,
diagVar = TRUE,
use_variance_hyperprior = TRUE,
verbose = TRUE,
...
)
Arguments
z |
data matrix |
hyperG0 |
parameters of the prior mixing distribution in a
|
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
use_variance_hyperprior |
logical flag indicating whether a hyperprior is added
for the variance parameter. Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
... |
additional arguments to be passed to |
Value
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Author(s)
Boris Hejblum
References
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
Fruhwirth-Schnatter S, Pyne S, Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions, Biostatistics, 2010.
Examples
rm(list=ls())
#Number of data
n <- 2000
set.seed(4321)
d <- 2
ncl <- 4
# Sample data
library(truncnorm)
sdev <- array(dim=c(d,d,ncl))
#xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1.5, 1.5, 1.5, 2, -2.5, -2.5, -3))
#xi <- matrix(nrow=d, ncol=ncl, c(-0.5, 0, 0.5, 0, 0.5, -1, -1, 1))
xi <- matrix(nrow=d, ncol=ncl, c(-0.2, 0.5, 2.4, 0.4, 0.6, -1.3, -0.9, -2.7))
psi <- matrix(nrow=d, ncol=4, c(0.3, -0.7, -0.8, 0, 0.3, -0.7, 0.2, 0.9))
nu <- c(100,25,8,5)
p <- c(0.15, 0.05, 0.5, 0.3) # frequence des clusters
sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3))
sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3))
sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3))
sdev[, ,4] <- .3*diag(2)
c <- rep(0,n)
w <- rep(1,n)
z <- matrix(0, nrow=d, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2)
z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) +
(sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1)
#cat(k, "/", n, " observations simulated\n", sep="")
}
# Set parameters of G0
hyperG0 <- list()
hyperG0[["b_xi"]] <- rowMeans(z)
hyperG0[["b_psi"]] <- rep(0,d)
hyperG0[["kappa"]] <- 0.001
hyperG0[["D_xi"]] <- 100
hyperG0[["D_psi"]] <- 100
hyperG0[["nu"]] <- d+1
hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3
# hyperprior on the Scale parameter of DPM
a <- 0.0001
b <- 0.0001
## Data
########
library(ggplot2)
p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y))
+ geom_point()
#+ ggtitle("Simple example in 2d data")
+xlab("D1")
+ylab("D2")
+theme_bw())
p #pdf(height=8.5, width=8.5)
c2plot <- factor(c)
levels(c2plot) <- c("4", "1", "3", "2")
pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot)))
+ geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster))
#+ ggtitle("Slightly overlapping skew-normal simulation\n")
+ xlab("D1")
+ ylab("D2")
+ theme_bw()
+ scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22))))
pp #pdf(height=7, width=7.5)
## alpha priors plots
#####################
prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b),
"distribution" =factor(rep("prior",5000),
levels=c("prior", "posterior")))
p <- (ggplot(prioralpha, aes(x=alpha))
+ geom_histogram(aes(y=..density..),
colour="black", fill="white")
+ geom_density(alpha=.2, fill="red")
+ ggtitle(paste("Prior distribution on alpha: Gamma(", a,
",", b, ")\n", sep=""))
)
p
if(interactive()){
# Gibbs sampler for Dirichlet Process Mixtures
##############################################
MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=1500,
doPlot=TRUE, nbclust_init=30, plotevery=100,
diagVar=FALSE)
s <- summary(MCMCsample_st, burnin = 1000, thin=10, lossFn = "Binder")
print(s)
plot(s, hm=TRUE) #pdf(height=8.5, width=10.5) #png(height=700, width=720)
plot_ConvDPM(MCMCsample_st, from=2)
#cluster_est_binder(MCMCsample_st$mcmc_partitions[900:1000])
}
Slice Sampling of Dirichlet Process Mixture of skew Student's t-distributions
Description
Slice Sampling of Dirichlet Process Mixture of skew Student's t-distributions
Usage
DPMGibbsSkewT_SeqPrior(
z,
prior_inform,
hyperG0,
N,
nbclust_init,
add.vagueprior = TRUE,
weightnoninfo = NULL,
doPlot = TRUE,
plotevery = N/10,
diagVar = TRUE,
verbose = TRUE,
...
)
Arguments
z |
data matrix |
prior_inform |
an informative prior such as the approximation computed by |
hyperG0 |
prior mixing distribution. |
N |
number of MCMC iterations. |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
add.vagueprior |
logical flag indicating whether a non informative component should
be added to the informative prior. Default is |
weightnoninfo |
a real between 0 and 1 giving the weights of the non informative component in the prior. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
... |
additional arguments to be passed to |
Value
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Author(s)
Boris Hejblum
References
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
Examples
rm(list=ls())
#Number of data
n <- 2000
set.seed(123)
d <- 2
ncl <- 4
# Sample data
sdev <- array(dim=c(d,d,ncl))
xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1, 1.5, 1, 1.5, -2, -2, -2))
psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -0.8))
nu <- c(100,15,8,5)
p <- c(0.15, 0.05, 0.5, 0.3) # frequence des clusters
sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3))
sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3))
sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3))
sdev[, ,4] <- .3*diag(2)
c <- rep(0,n)
w <- rep(1,n)
z <- matrix(0, nrow=d, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2)
z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) +
(sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1)
#cat(k, "/", n, " observations simulated\n", sep="")
}
# Set parameters of G0
hyperG0 <- list()
hyperG0[["b_xi"]] <- rowMeans(z)
hyperG0[["b_psi"]] <- rep(0,d)
hyperG0[["kappa"]] <- 0.001
hyperG0[["D_xi"]] <- 100
hyperG0[["D_psi"]] <- 100
hyperG0[["nu"]] <- d+1
hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3
# hyperprior on the Scale parameter of DPM
a <- 0.0001
b <- 0.0001
# do some plots
nbclust_init <- 30
## Plot Data
library(ggplot2)
q <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y))
+ geom_point()
+ ggtitle("Simple example in 2d data")
+xlab("D1")
+ylab("D2")
+theme_bw())
q
if(interactive()){
MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=2000,
doPlot=TRUE, plotevery=250,
nbclust_init, diagVar=FALSE,
gg.add=list(theme_bw(),
guides(shape=guide_legend(override.aes = list(fill="grey45")))))
s <- summary(MCMCsample_st, burnin = 1500, thin=2, posterior_approx=TRUE)
F <- FmeasureC(pred=s$point_estim$c_est, ref=c)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2)
z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) +
(sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1)
cat(k, "/", n, " observations simulated\n", sep="")
}
MCMCsample_st2 <- DPMGibbsSkewT_SeqPrior(z, prior=s$param_posterior,
hyperG0, N=2000,
doPlot=TRUE, plotevery=100,
nbclust_init, diagVar=FALSE,
gg.add=list(theme_bw(),
guides(shape=guide_legend(override.aes = list(fill="grey45")))))
s2 <- summary(MCMCsample_st2, burnin = 1500, thin=5)
F2 <- FmeasureC(pred=s2$point_estim$c_est, ref=c)
# MCMCsample_st2_par <- DPMGibbsSkewT_SeqPrior_parallel(Ncpus= 2, type_connec="SOCK",
# z, prior_inform=s$param_posterior,
# hyperG0, N=2000,
# doPlot=TRUE, plotevery=50,
# nbclust_init, diagVar=FALSE,
# gg.add=list(theme_bw(),
# guides(shape=guide_legend(override.aes = list(fill="grey45"))))
}
Slice Sampling of Dirichlet Process Mixture of skew Student's t-distributions
Description
Slice Sampling of Dirichlet Process Mixture of skew Student's t-distributions
Usage
DPMGibbsSkewT_SeqPrior_parallel(
Ncpus,
type_connec,
z,
prior_inform,
hyperG0,
N,
nbclust_init,
add.vagueprior = TRUE,
weightnoninfo = NULL,
doPlot = FALSE,
plotevery = N/10,
diagVar = TRUE,
verbose = TRUE,
monitorfile = "",
...
)
Arguments
Ncpus |
the number of processors available |
type_connec |
The type of connection between the processors. Supported
cluster types are |
z |
data matrix |
prior_inform |
an informative prior such as the approximation computed by |
hyperG0 |
prior mixing distribution. |
N |
number of MCMC iterations. |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
add.vagueprior |
logical flag indicating whether a non informative component should
be added to the informative prior. Default is |
weightnoninfo |
a real between 0 and 1 giving the weights of the non informative component in the prior. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
monitorfile |
a writable connections or a character string naming a file to write into,
to monitor the progress of the analysis.
Default is |
... |
additional arguments to be passed to |
Value
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Author(s)
Boris Hejblum
References
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
Examples
rm(list=ls())
#Number of data
n <- 2000
set.seed(123)
d <- 2
ncl <- 4
# Sample data
sdev <- array(dim=c(d,d,ncl))
#xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1.5, 1.5, 1.5, 2, -2.5, -2.5, -3))
#psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -0.8))
xi <- matrix(nrow=d, ncol=ncl, c(-0.2, 0.5, 2.4, 0.4, 0.6, -1.3, -0.9, -2.7))
psi <- matrix(nrow=d, ncol=4, c(0.3, -0.7, -0.8, 0, 0.3, -0.7, 0.2, 0.9))
nu <- c(100,15,8,5)
p <- c(0.15, 0.05, 0.5, 0.3) # frequence des clusters
sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3))
sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3))
sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3))
sdev[, ,4] <- .3*diag(2)
c <- rep(0,n)
w <- rep(1,n)
z <- matrix(0, nrow=d, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2)
z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) +
(sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1)
#cat(k, "/", n, " observations simulated\n", sep="")
}
# Set parameters of G0
hyperG0 <- list()
hyperG0[["b_xi"]] <- rowMeans(z)
hyperG0[["b_psi"]] <- rep(0,d)
hyperG0[["kappa"]] <- 0.001
hyperG0[["D_xi"]] <- 100
hyperG0[["D_psi"]] <- 100
hyperG0[["nu"]] <- d+1
hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3
# hyperprior on the Scale parameter of DPM
a <- 0.0001
b <- 0.0001
# do some plots
nbclust_init <- 30
## Plot Data
library(ggplot2)
q <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y))
+ geom_point()
+ ggtitle("Simple example in 2d data")
+xlab("D1")
+ylab("D2")
+theme_bw())
q
if(interactive()){
MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=2000,
doPlot=TRUE, plotevery=250,
nbclust_init,
gg.add=list(theme_bw(),
guides(shape=guide_legend(override.aes = list(fill="grey45")))),
diagVar=FALSE)
s <- summary(MCMCsample_st, burnin = 1500, thin=5, posterior_approx=TRUE)
F <- FmeasureC(pred=s$point_estim$c_est, ref=c)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2)
z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) +
(sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1)
#cat(k, "/", n, " observations simulated\n", sep="")
}
MCMCsample_st2 <- DPMGibbsSkewT_SeqPrior_parallel(Ncpus=2, type_connec="SOCK",
z, prior_inform=s$param_posterior,
hyperG0, N=3000,
doPlot=TRUE, plotevery=100,
nbclust_init, diagVar=FALSE, verbose=FALSE,
gg.add=list(theme_bw(),
guides(shape=guide_legend(override.aes = list(fill="grey45")))))
s2 <- summary(MCMCsample_st2, burnin = 2000, thin=5)
F2 <- FmeasureC(pred=s2$point_estim$c_est, ref=c)
}
Slice Sampling of Dirichlet Process Mixture of skew Student's t-distributions
Description
Slice Sampling of Dirichlet Process Mixture of skew Student's t-distributions
Usage
DPMGibbsSkewT_parallel(
Ncpus,
type_connec,
z,
hyperG0,
a = 1e-04,
b = 1e-04,
N,
doPlot = FALSE,
nbclust_init = 30,
plotevery = N/10,
diagVar = TRUE,
use_variance_hyperprior = TRUE,
verbose = FALSE,
monitorfile = "",
...
)
Arguments
Ncpus |
the number of processors available |
type_connec |
The type of connection between the processors. Supported
cluster types are |
z |
data matrix |
hyperG0 |
prior mixing distribution. |
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
use_variance_hyperprior |
logical flag indicating whether a hyperprior is added
for the variance parameter. Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
monitorfile |
a writable connections or a character string naming a file to write into,
to monitor the progress of the analysis.
Default is |
... |
additional arguments to be passed to |
Value
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component - |
hyperG0: |
the prior on the cluster location |
Author(s)
Boris Hejblum
References
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
Examples
rm(list=ls())
#Number of data
n <- 2000
set.seed(123)
#set.seed(4321)
d <- 2
ncl <- 4
# Sample data
sdev <- array(dim=c(d,d,ncl))
xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1, 1.5, 1, 1.5, -2, -2, -2))
psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -0.8))
p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters
sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3))
sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3))
sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3))
sdev[, ,4] <- .3*diag(2)
c <- rep(0,n)
z <- matrix(0, nrow=d, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
z[,k] <- (xi[, c[k]]
+ psi[, c[k]]*abs(rnorm(1))
+ sdev[, , c[k]]%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1))
#cat(k, "/", n, " observations simulated\n", sep="")
}
# Set parameters of G0
hyperG0 <- list()
hyperG0[["b_xi"]] <- rep(0,d)
hyperG0[["b_psi"]] <- rep(0,d)
hyperG0[["kappa"]] <- 0.001
hyperG0[["D_xi"]] <- 100
hyperG0[["D_psi"]] <- 100
hyperG0[["nu"]] <- d + 1
hyperG0[["lambda"]] <- diag(d)
# hyperprior on the Scale parameter of DPM
a <- 0.0001
b <- 0.0001
# do some plots
doPlot <- TRUE
nbclust_init <- 30
## Data
########
library(ggplot2)
p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y))
+ geom_point()
+ ggtitle("Simple example in 2d data")
+xlab("D1")
+ylab("D2")
+theme_bw())
p
c2plot <- factor(c)
levels(c2plot) <- c("3", "2", "4", "1")
pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot)))
+ geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster))
+ ggtitle("Slightly overlapping skew-normal simulation\n")
+ xlab("D1")
+ ylab("D2")
+ theme_bw()
+ scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22))))
pp
## alpha priors plots
#####################
prioralpha <- data.frame("alpha"=rgamma(n=5000, shape=a, scale=1/b),
"distribution" =factor(rep("prior",5000),
levels=c("prior", "posterior")))
p <- (ggplot(prioralpha, aes(x=alpha))
+ geom_histogram(aes(y=..density..),
colour="black", fill="white")
+ geom_density(alpha=.2, fill="red")
+ ggtitle(paste("Prior distribution on alpha: Gamma(", a,
",", b, ")\n", sep=""))
)
p
if(interactive()){
# Gibbs sampler for Dirichlet Process Mixtures
##############################################
MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=2000,
doPlot, nbclust_init, plotevery=100, gg.add=list(theme_bw(),
guides(shape=guide_legend(override.aes = list(fill="grey45")))),
diagVar=FALSE)
s <- summary(MCMCsample_st, burnin = 350)
print(s)
plot(s)
plot_ConvDPM(MCMCsample_st, from=2)
cluster_est_binder(MCMCsample_st$mcmc_partitions[1500:2000])
}
Posterior estimation for Dirichlet process mixture of multivariate (potentially skew) distributions models
Description
Partially collapse slice Gibbs sampling for Dirichlet process mixture of multivariate normal, skew normal or skew t distributions.
Usage
DPMpost(
data,
hyperG0,
a = 1e-04,
b = 1e-04,
N,
doPlot = TRUE,
nbclust_init = 30,
plotevery = floor(N/10),
diagVar = TRUE,
verbose = TRUE,
distrib = c("gaussian", "skewnorm", "skewt"),
ncores = 1,
type_connec = "SOCK",
informPrior = NULL,
...
)
Arguments
data |
data matrix |
hyperG0 |
prior mixing distribution. |
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process. Default is |
N |
number of MCMC iterations. |
doPlot |
logical flag indicating whether to plot MCMC iteration or not.
Default to |
nbclust_init |
number of clusters at initialization. Default to 30 (or less if there are less than 30 observations). |
plotevery |
an integer indicating the interval between plotted iterations when |
diagVar |
logical flag indicating whether the variance of each cluster is
estimated as a diagonal matrix, or as a full matrix.
Default is |
verbose |
logical flag indicating whether partition info is written in the console at each MCMC iteration. |
distrib |
the distribution used for the clustering. Current possibilities are
|
ncores |
number of cores to use. |
type_connec |
The type of connection between the processors. Supported
cluster types are |
informPrior |
an optional informative prior such as the approximation computed
by |
... |
additional arguments to be passed to |
Details
This function is a wrapper around the following functions:
DPMGibbsN
, DPMGibbsN_parallel
,
DPMGibbsN_SeqPrior
, DPMGibbsSkewN
, DPMGibbsSkewN_parallel
,
DPMGibbsSkewT
, DPMGibbsSkewT_parallel
,
DPMGibbsSkewT_SeqPrior
, DPMGibbsSkewT_SeqPrior_parallel
.
Value
a object of class DPMclust
with the following attributes:
mcmc_partitions: |
a list of length |
alpha: |
a vector of length |
U_SS_list: |
a list of length |
weights_list: |
a list of length |
logposterior_list: |
a list of length |
data: |
the data matrix |
nb_mcmcit: |
the number of MCMC iterations |
clust_distrib: |
the parametric distribution of the mixture component |
hyperG0: |
the prior on the cluster location |
Author(s)
Boris Hejblum
References
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
See Also
Examples
#rm(list=ls())
set.seed(123)
# Exemple in 2 dimensions with skew-t distributions
# Generate data:
n <- 2000 # number of data points
d <- 2 # dimensions
ncl <- 4 # number of true clusters
sdev <- array(dim=c(d,d,ncl))
xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1.5, 1.5, 1.5, 2, -2.5, -2.5, -3))
psi <- matrix(nrow=d, ncol=4, c(0.3, -0.7, -0.8, 0, 0.3, -0.7, 0.2, 0.9))
nu <- c(100,25,8,5)
proba <- c(0.15, 0.05, 0.5, 0.3) # cluster frequencies
sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3))
sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3))
sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.2))
sdev[, ,4] <- .3*diag(2)
c <- rep(0,n)
w <- rep(1,n)
z <- matrix(0, nrow=d, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=proba)!=0)
w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2)
z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) +
(sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1)
}
# Define hyperprior
hyperG0 <- list()
hyperG0[["b_xi"]] <- rowMeans(z)
hyperG0[["b_psi"]] <- rep(0,d)
hyperG0[["kappa"]] <- 0.001
hyperG0[["D_xi"]] <- 100
hyperG0[["D_psi"]] <- 100
hyperG0[["nu"]] <- d+1
hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3
if(interactive()){
# Plot data
cytoScatter(z)
# Estimate posterior
MCMCsample_st <- DPMpost(data=z, hyperG0=hyperG0, N=2000,
distrib="skewt",
gg.add=list(ggplot2::theme_bw(),
ggplot2::guides(shape=ggplot2::guide_legend(override.aes = list(fill="grey45"))))
)
s <- summary(MCMCsample_st, burnin = 1600, thin=5, lossFn = "Binder")
s
plot(s)
#plot(s, hm=TRUE) # this can take a few sec...
# more data plotting:
library(ggplot2)
p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y))
+ geom_point()
+ ggtitle("Unsupervised data")
+ xlab("D1")
+ ylab("D2")
+ theme_bw()
)
p
c2plot <- factor(c)
levels(c2plot) <- c("4", "1", "3", "2")
pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot)))
+ geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster))
+ ggtitle("True clusters")
+ xlab("D1")
+ ylab("D2")
+ theme_bw()
+ scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22)))
)
pp
}
# Exemple in 2 dimensions with Gaussian distributions
set.seed(1234)
# Generate data
n <- 2000 # number of data points
d <- 2 # dimensions
ncl <- 4 # number of true clusters
m <- matrix(nrow=2, ncol=4, c(-1, 1, 1.5, 2, 2, -2, -1.5, -2)) # cluster means
sdev <- array(dim=c(2, 2, 4)) # cluster standard-deviations
sdev[, ,1] <- matrix(nrow=2, ncol=2, c(0.3, 0, 0, 0.3))
sdev[, ,2] <- matrix(nrow=2, ncol=2, c(0.1, 0, 0, 0.3))
sdev[, ,3] <- matrix(nrow=2, ncol=2, c(0.3, 0.15, 0.15, 0.3))
sdev[, ,4] <- .3*diag(2)
proba <- c(0.15, 0.05, 0.5, 0.3) # cluster frequencies
c <- rep(0,n)
z <- matrix(0, nrow=2, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=proba)!=0)
z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(2, mean = 0, sd = 1), nrow=2, ncol=1)
}
# Define hyperprior
hyperG0 <- list()
hyperG0[["mu"]] <- rep(0,d)
hyperG0[["kappa"]] <- 0.001
hyperG0[["nu"]] <- d+2
hyperG0[["lambda"]] <- diag(d)
if(interactive()){
# Plot data
cytoScatter(z)
# Estimate posterior
MCMCsample_n <- DPMpost(data=z, hyperG0=hyperG0, N=2000,
distrib="gaussian", diagVar=FALSE,
gg.add=list(ggplot2::theme_bw(),
ggplot2::guides(shape=ggplot2::guide_legend(override.aes = list(fill="grey45"))))
)
s <- summary(MCMCsample_n, burnin = 1500, thin=5, lossFn = "Binder")
s
plot(s)
#plot(s, hm=TRUE) # this can take a few sec...
# more data plotting:
library(ggplot2)
p <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,]), aes(x=X, y=Y))
+ geom_point()
+ ggtitle("Unsupervised data")
+ xlab("D1")
+ ylab("D2")
+ theme_bw()
)
p
c2plot <- factor(c)
levels(c2plot) <- c("4", "1", "3", "2")
pp <- (ggplot(data.frame("X"=z[1,], "Y"=z[2,], "Cluster"=as.character(c2plot)))
+ geom_point(aes(x=X, y=Y, colour=Cluster, fill=Cluster))
#+ ggtitle("Slightly overlapping skew-normal simulation\n")
+ xlab("D1")
+ ylab("D2")
+ theme_bw()
+ scale_colour_discrete(guide=guide_legend(override.aes = list(size = 6, shape=22)))
+ ggtitle("True clusters")
)
pp
}
Compute a limited F-measure
Description
A limited version of F-measure that only takes into account small clusters
Usage
Flimited(n_small_clst, pred, ref)
Arguments
n_small_clst |
an integer for limit size of the small cluster |
pred |
vector of a predicted partition |
ref |
vector of a reference partition |
References
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209> <arXiv: 1702.04407> https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
Examples
pred <- c(rep(1, 5),rep(2, 8),rep(3,10))
ref <- c(rep(1, 5),rep(c(2,3), 4),rep(c(3,2),5))
FmeasureC(pred, ref)
Flimited(6, pred, ref)
C++ implementation of the F-measure computation
Description
C++ implementation of the F-measure computation
Usage
FmeasureC(pred, ref)
Arguments
pred |
vector of a predicted partition |
ref |
vector of a reference partition |
Examples
pred <- c(1,1,2,3,2,3)
ref <- c(2,2,1,1,1,3)
FmeasureC(pred, ref)
C++ implementation of the F-measure computation without the reference class 0
Description
Aghaeepour in FlowCAP 1 ignore the reference class labeled "0"
Usage
FmeasureC_no0(pred, ref)
Arguments
pred |
vector of a predicted partition |
ref |
vector of a reference partition |
References
N Aghaeepour, G Finak, H Hoos, TR Mosmann, RR Brinkman, R Gottardo, RH Scheuermann, Critical assessment of automated flow cytometry data analysis techniques, Nature Methods, 10(3):228-38, 2013.
Examples
library(NPflow)
pred <- c(1,1,2,3,2,3)
ref <- c(2,2,0,0,0,3)
FmeasureC(pred, ref)
FmeasureC_no0(pred, ref)
Multiple cost computations with the F-measure as the loss function
Description
C++ implementation of multiple cost computations with the F-measure as the loss function using the Armadillo library
Usage
Fmeasure_costC(c)
Arguments
c |
a matrix where each column is one MCMC partition |
Value
a list with the following elements:
Fmeas: |
TODO |
cost: |
TODO |
Examples
library(NPflow)
c <- list(c(1,1,2,3,2,3), c(1,1,1,2,3,3),c(2,2,1,1,1,1))
#Fmeasure_costC(sapply(c, "["))
if(interactive()){
c2 <- list()
for(i in 1:100){
c2 <- c(c2, list(rmultinom(n=1, size=2000, prob=rexp(n=2000))))
}
Fmeasure_costC(sapply(c2, "["))
}
EM MAP for mixture of sNiW
Description
Maximum A Posteriori (MAP) estimation of mixture of Normal inverse Wishart distributed observations with an EM algorithm
Usage
MAP_sNiW_mmEM(
xi_list,
psi_list,
S_list,
hyperG0,
init = NULL,
K,
maxit = 100,
tol = 0.1,
doPlot = TRUE,
verbose = TRUE
)
MAP_sNiW_mmEM_weighted(
xi_list,
psi_list,
S_list,
obsweight_list,
hyperG0,
K,
maxit = 100,
tol = 0.1,
doPlot = TRUE,
verbose = TRUE
)
MAP_sNiW_mmEM_vague(
xi_list,
psi_list,
S_list,
hyperG0,
K = 10,
maxit = 100,
tol = 0.1,
doPlot = TRUE,
verbose = TRUE
)
Arguments
xi_list |
a list of length |
psi_list |
a list of length |
S_list |
a list of length |
hyperG0 |
prior mixing distribution used if |
init |
a list for initializing the algorithm with the following elements: |
K |
integer giving the number of mixture components. |
maxit |
integer giving the maximum number of iteration for the EM algorithm.
Default is |
tol |
real number giving the tolerance for the stopping of the EM algorithm.
Default is |
doPlot |
a logical flag indicating whether the algorithm progression should be plotted.
Default is |
verbose |
logical flag indicating whether plot should be drawn. Default is |
obsweight_list |
a list of length |
Details
MAP_sNiW_mmEM
provides an estimation for the MAP of mixtures of
Normal inverse Wishart distributed observations. MAP_sNiW_mmEM_vague
provides
an estimates incorporating a vague component in the mixture.
MAP_sNiW_mmEM_weighted
provides a weighted version of the algorithm.
Author(s)
Boris Hejblum, Chariff Alkhassim
Examples
set.seed(1234)
hyperG0 <- list()
hyperG0$b_xi <- c(0.3, -1.5)
hyperG0$b_psi <- c(0, 0)
hyperG0$kappa <- 0.001
hyperG0$D_xi <- 100
hyperG0$D_psi <- 100
hyperG0$nu <- 20
hyperG0$lambda <- diag(c(0.25,0.35))
hyperG0 <- list()
hyperG0$b_xi <- c(1, -1.5)
hyperG0$b_psi <- c(0, 0)
hyperG0$kappa <- 0.1
hyperG0$D_xi <- 1
hyperG0$D_psi <- 1
hyperG0$nu <- 2
hyperG0$lambda <- diag(c(0.25,0.35))
xi_list <- list()
psi_list <- list()
S_list <- list()
w_list <- list()
for(k in 1:200){
NNiW <- rNNiW(hyperG0, diagVar=FALSE)
xi_list[[k]] <- NNiW[["xi"]]
psi_list[[k]] <- NNiW[["psi"]]
S_list[[k]] <- NNiW[["S"]]
w_list [[k]] <- 0.75
}
hyperG02 <- list()
hyperG02$b_xi <- c(-1, 2)
hyperG02$b_psi <- c(-0.1, 0.5)
hyperG02$kappa <- 0.1
hyperG02$D_xi <- 1
hyperG02$D_psi <- 1
hyperG02$nu <- 4
hyperG02$lambda <- 0.5*diag(2)
for(k in 201:400){
NNiW <- rNNiW(hyperG02, diagVar=FALSE)
xi_list[[k]] <- NNiW[["xi"]]
psi_list[[k]] <- NNiW[["psi"]]
S_list[[k]] <- NNiW[["S"]]
w_list [[k]] <- 0.25
}
map <- MAP_sNiW_mmEM(xi_list, psi_list, S_list, hyperG0, K=2, tol=0.1)
EM MLE for mixture of NiW
Description
Maximum likelihood estimation of mixture of Normal inverse Wishart distributed observations with an EM algorithm
Usage
MLE_NiW_mmEM(
mu_list,
S_list,
hyperG0,
K,
maxit = 100,
tol = 0.1,
doPlot = TRUE
)
Arguments
mu_list |
a list of length |
S_list |
a list of length |
hyperG0 |
prior mixing distribution used for randomly initializing the algorithm. |
K |
integer giving the number of mixture components. |
maxit |
integer giving the maximum number of iteration for the EM algorithm.
Default is |
tol |
real number giving the tolerance for the stopping of the EM algorithm.
Default is |
doPlot |
a logical flag indicating whether the algorithm progression should be plotted. Default is |
Examples
set.seed(123)
U_mu <- list()
U_Sigma <- list()
U_nu<-list()
U_kappa<-list()
d <- 2
hyperG0 <- list()
hyperG0[["mu"]] <- rep(1,d)
hyperG0[["kappa"]] <- 0.01
hyperG0[["nu"]] <- d+1
hyperG0[["lambda"]] <- diag(d)
for(k in 1:200){
NiW <- rNiW(hyperG0, diagVar=FALSE)
U_mu[[k]] <-NiW[["mu"]]
U_Sigma[[k]] <-NiW[["S"]]
}
hyperG02 <- list()
hyperG02[["mu"]] <- rep(2,d)
hyperG02[["kappa"]] <- 1
hyperG02[["nu"]] <- d+10
hyperG02[["lambda"]] <- diag(d)/10
for(k in 201:400){
NiW <- rNiW(hyperG02, diagVar=FALSE)
U_mu[[k]] <-NiW[["mu"]]
U_Sigma[[k]] <-NiW[["S"]]
}
mle <- MLE_NiW_mmEM( U_mu, U_Sigma, hyperG0, K=2)
hyperG0[["mu"]]
hyperG02[["mu"]]
mle$U_mu
hyperG0[["lambda"]]
hyperG02[["lambda"]]
mle$U_lambda
hyperG0[["nu"]]
hyperG02[["nu"]]
mle$U_nu
hyperG0[["kappa"]]
hyperG02[["kappa"]]
mle$U_kappa
MLE for Gamma distribution
Description
Maximum likelihood estimation of Gamma distributed observations distribution parameters
Usage
MLE_gamma(g)
Arguments
g |
a list of Gamma distributed observation. |
Examples
g_list <- list()
for(i in 1:1000){
g_list <- c(g_list, rgamma(1, shape=100, rate=5))
}
mle <- MLE_gamma(g_list)
mle
MLE for sNiW distributed observations
Description
Maximum likelihood estimation of Normal inverse Wishart distributed observations
Usage
MLE_sNiW(xi_list, psi_list, S_list, doPlot = TRUE)
Arguments
xi_list |
a list of length |
psi_list |
a list of length |
S_list |
a list of length |
doPlot |
a logical flag indicating whether the algorithm progression should be plotted.
Default is |
Author(s)
Boris Hejblum, Chariff Alkhassim
Examples
hyperG0 <- list()
hyperG0$b_xi <- c(0.3, -1.5)
hyperG0$b_psi <- c(0, 0)
hyperG0$kappa <- 0.001
hyperG0$D_xi <- 100
hyperG0$D_psi <- 100
hyperG0$nu <- 35
hyperG0$lambda <- diag(c(0.25,0.35))
xi_list <- list()
psi_list <- list()
S_list <- list()
for(k in 1:1000){
NNiW <- rNNiW(hyperG0, diagVar=FALSE)
xi_list[[k]] <- NNiW[["xi"]]
psi_list[[k]] <- NNiW[["psi"]]
S_list[[k]] <- NNiW[["S"]]
}
mle <- MLE_sNiW(xi_list, psi_list, S_list)
mle
EM MLE for mixture of sNiW
Description
Maximum likelihood estimation of mixture of Normal inverse Wishart distributed observations with an EM algorithm
Usage
MLE_sNiW_mmEM(
xi_list,
psi_list,
S_list,
hyperG0,
K,
init = NULL,
maxit = 100,
tol = 0.1,
doPlot = TRUE,
verbose = TRUE
)
Arguments
xi_list |
a list of length |
psi_list |
a list of length |
S_list |
a list of length |
hyperG0 |
prior mixing distribution used if |
K |
integer giving the number of mixture components. |
init |
a list for initializing the algorithm with the following elements: |
maxit |
integer giving the maximum number of iteration for the EM algorithm.
Default is |
tol |
real number giving the tolerance for the stopping of the EM algorithm.
Default is |
doPlot |
a logical flag indicating whether the algorithm progression should be plotted. Default is |
verbose |
logical flag indicating whether plot should be drawn. Default is |
Author(s)
Boris Hejblum, Chariff Alkhassim
Examples
set.seed(1234)
hyperG0 <- list()
hyperG0$b_xi <- c(0.3, -1.5)
hyperG0$b_psi <- c(0, 0)
hyperG0$kappa <- 0.001
hyperG0$D_xi <- 100
hyperG0$D_psi <- 100
hyperG0$nu <- 3
hyperG0$lambda <- diag(c(0.25,0.35))
xi_list <- list()
psi_list <- list()
S_list <- list()
for(k in 1:200){
NNiW <- rNNiW(hyperG0, diagVar=FALSE)
xi_list[[k]] <- NNiW[["xi"]]
psi_list[[k]] <- NNiW[["psi"]]
S_list[[k]] <- NNiW[["S"]]
}
hyperG02 <- list()
hyperG02$b_xi <- c(-1, 2)
hyperG02$b_psi <- c(-0.1, 0.5)
hyperG02$kappa <- 0.001
hyperG02$D_xi <- 10
hyperG02$D_psi <- 10
hyperG02$nu <- 3
hyperG02$lambda <- 0.5*diag(2)
for(k in 201:400){
NNiW <- rNNiW(hyperG02, diagVar=FALSE)
xi_list[[k]] <- NNiW[["xi"]]
psi_list[[k]] <- NNiW[["psi"]]
S_list[[k]] <- NNiW[["S"]]
}
mle <- MLE_sNiW_mmEM(xi_list, psi_list, S_list, hyperG0, K=2)
C++ implementation of similarity matrix computation using pre-computed distances
Description
C++ implementation of similarity matrix computation using pre-computed distances
Usage
NuMatParC(c, d)
Arguments
c |
an MCMC partitions of length |
d |
a symmetric |
Author(s)
Boris Hejblum, Chariff Alkhassim
Examples
c <- c(1,1,2,3,2,3)
d <- matrix(runif(length(c)^2),length(c))
NuMatParC(c,d)
Burning MCMC iterations from a Dirichlet Process Mixture Model.
Description
Utility function for burning MCMC iteration from a DPMMclust
object.
Usage
burn.DPMMclust(x, burnin = 0, thin = 1)
Arguments
x |
a |
burnin |
the number of MCMC iterations to burn (default is |
thin |
the spacing at which MCMC iterations are kept.
Default is |
Value
a DPMMclust
object minus the burnt iterations
Author(s)
Boris Hejblum
See Also
Point estimate of the partition using the F-measure as the cost function.
Description
Get a point estimate of the partition using the F-measure as the cost function.
Usage
cluster_est_Fmeasure(c, logposterior)
Arguments
c |
a list of vector of length |
logposterior |
a vector of logposterior corresponding to each
partition from |
Value
a list
:
c_est: |
a vector of length |
cost: |
a vector of length |
similarity: |
matrix of size |
opt_ind: |
the index of the optimal partition among the MCMC iterations. |
Author(s)
Francois Caron, Boris Hejblum
See Also
Point estimate of the partition using a modified Binder loss function
Description
Get a point estimate of the partition using a modified Binder loss function for Gaussian components
Usage
cluster_est_Mbinder_norm(c, Mu, Sigma, lambda = 0, a = 1, b = a, logposterior)
Arguments
c |
a list of vector of length |
Mu |
is a list of length |
Sigma |
is list of length |
lambda |
is a nonnegative tunning parameter allowing further control over the distance function. Default is 0. |
a |
nonnegative constant seen as the unit cost for pairwise misclassification. Default is 1. |
b |
nonnegative constant seen as the unit cost for the other kind of pairwise misclassification. Default is 1. |
logposterior |
vector of logposterior corresponding to each
partition from |
Details
Note that he current implementation only allows Gaussian components.
The modified Binder loss function takes into account the distance between mixture components using #'the Bhattacharyya distance.
Value
a list
:
c_est: |
a vector of length |
cost: |
a vector of length |
similarity: |
matrix of size |
opt_ind: |
the index of the optimal partition among the MCMC iterations. |
Author(s)
Chariff Alkhassim
References
JW Lau, PJ Green, Bayesian Model-Based Clustering Procedures, Journal of Computational and Graphical Statistics, 16(3):526-558, 2007.
DA Binder, Bayesian cluster analysis, Biometrika 65(1):31-38, 1978.
See Also
similarityMat
similarityMatC
similarityMat_nocostC
Point estimate of the partition for the Binder loss function
Description
Get a point estimate of the partition using the Binder loss function.
Usage
cluster_est_binder(c, logposterior)
Arguments
c |
a list of vector of length |
logposterior |
vector of logposterior corresponding to each
partition from |
Value
a list
:
c_est
:a vector of length
n
. Point estimate of the partitioncost
:a vector of length
N
.cost[j]
is the cost associated to partitionc[[j]]
similarity
:matrix of size
n x n
. Similarity matrix (seesimilarityMat
)opt_ind
:the index of the optimal partition among the MCMC iterations.
Author(s)
Francois Caron, Boris Hejblum
References
F Caron, YW Teh, TB Murphy, Bayesian nonparametric Plackett-Luce models for the analysis of preferences for college degree programmes, Annals of Applied Statistics, 8(2):1145-1181, 2014.
DB Dahl, Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model, Bayesian Inference for Gene Expression and Proteomics, K-A Do, P Muller, M Vannucci (Eds.), Cambridge University Press, 2006.
See Also
Gets a point estimate of the partition using posterior expected adjusted Rand index (PEAR)
Description
Gets a point estimate of the partition using posterior expected adjusted Rand index (PEAR)
Usage
cluster_est_pear(c)
Arguments
c |
a list of vector of length |
Value
a list
:
c_est: |
a vector of length |
pear: |
a vector of length |
similarity: |
matrix of size |
opt_ind: |
the index of the optimal partition among the MCMC iterations. |
Author(s)
Chariff Alkhassim
References
A. Fritsch, K. Ickstadt. Improved Criteria for Clustering Based on the Posterior Similarity Matrix, in Bayesian Analysis, Vol.4 : p.367-392 (2009)
See Also
Scatterplot of flow cytometry data
Description
Scatterplot of flow cytometry data
Usage
cytoScatter(
cytomatrix,
dims2plot = c(1, 2),
gating = NULL,
scale_log = FALSE,
xlim = NULL,
ylim = NULL,
gg.add = list(theme())
)
Arguments
cytomatrix |
a |
dims2plot |
a vector of length at least 2, indicating of the dimensions to be plotted.
Default is |
gating |
an optional vector of length |
scale_log |
a logical Flag indicating whether the data should be plotted on the log scale.
Default is |
xlim |
a vector of length 2 to specify the x-axis limits. Only used if |
ylim |
a vector of length 2 to specify the y-axis limits. Only used if |
gg.add |
A list of instructions to add to the |
Examples
rm(list=ls())
#Number of data
n <- 500
#n <- 2000
set.seed(1234)
#set.seed(123)
#set.seed(4321)
# Sample data
m <- matrix(nrow=2, ncol=4, c(-1, 1, 1.5, 2, 2, -2, -1.5, -2))
p <- c(0.2, 0.1, 0.4, 0.3) # frequence des clusters
sdev <- array(dim=c(2,2,4))
sdev[, ,1] <- matrix(nrow=2, ncol=2, c(0.3, 0, 0, 0.3))
sdev[, ,2] <- matrix(nrow=2, ncol=2, c(0.1, 0, 0, 0.3))
sdev[, ,3] <- matrix(nrow=2, ncol=2, c(0.3, 0.15, 0.15, 0.3))
sdev[, ,4] <- .3*diag(2)
c <- rep(0,n)
z <- matrix(0, nrow=2, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
z[,k] <- m[, c[k]] + sdev[, , c[k]]%*%matrix(rnorm(2, mean = 0, sd = 1), nrow=2, ncol=1)
#cat(k, "/", n, " observations simulated\n", sep="")
}
cytoScatter(z)
ELoss of a partition point estimate compared to a gold standard
Description
Evaluate the loss of a point estimate of the partition compared to a gold standard according to a given loss function
Usage
evalClustLoss(c, gs, lossFn = "F-measure", a = 1, b = 1)
Arguments
c |
vector of length |
gs |
vector of length |
lossFn |
character string specifying the loss function to be used. Either "F-measure" or "Binder" (see Details). Default is "F-measure". |
a |
only relevant if |
b |
only relevant if |
Details
The cost of a point estimate partition is calculated using either a pairwise coincidence loss function (Binder), or 1-Fmeasure (F-measure).
Value
the cost of the point estimate c
in regard of the
gold standard gs
for a given loss function.
Author(s)
Boris Hejblum
References
J.W. Lau & P.J. Green. Bayesian Model-Based Clustering Procedures, Journal of Computational and Graphical Statistics, 16(3): 526-558, 2007.
D. B. Dahl. Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model, in Bayesian Inference for Gene Expression and Proteomics, K.-A. Do, P. Muller, M. Vannucci (Eds.), Cambridge University Press, 2006.
See Also
similarityMat
, cluster_est_binder
Sample from a inverse-Wishart distribution
Description
For internal use only.
Usage
invwishrnd(n, lambda)
Arguments
n |
degrees of freedom |
lambda |
scale parameter |
Multivariate log gamma function
Description
Multivariate log gamma function
Usage
lgamma_mv(x, p)
Arguments
x |
strictly positive real number |
p |
integer |
multivariate Normal inverse Wishart probability density function for multiple inputs
Description
multivariate Normal inverse Wishart probability density function for multiple inputs
Usage
mmNiWpdf(mu, Sigma, U_mu0, U_kappa0, U_nu0, U_lambda0, Log = TRUE)
Arguments
mu |
data matrix of dimension |
Sigma |
list of length |
U_mu0 |
mean vectors matrix of dimension |
U_kappa0 |
vector of length |
U_nu0 |
vector of length |
U_lambda0 |
list of length |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
Value
matrix of densities of dimension K x n
C++ implementation of multivariate Normal inverse Wishart probability density function for multiple inputs
Description
C++ implementation of multivariate Normal inverse Wishart probability density function for multiple inputs
Usage
mmNiWpdfC(Mu, Sigma, U_Mu0, U_Kappa0, U_Nu0, U_Sigma0, Log = TRUE)
Arguments
Mu |
data matrix of dimension |
Sigma |
list of length |
U_Mu0 |
mean vectors matrix of dimension |
U_Kappa0 |
vector of length |
U_Nu0 |
vector of length |
U_Sigma0 |
list of length |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
Value
matrix of densities of dimension K x n
References
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209>. <arXiv: 1702.04407>. https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
Probability density function of multiple structured Normal inverse Wishart
Description
Probability density function of structured Normal inverse Wishart (sNiW) for multiple inputs, on the log scale.
Usage
mmsNiWlogpdf(U_xi, U_psi, U_Sigma, U_xi0, U_psi0, U_B0, U_Sigma0, U_df0)
Arguments
U_xi |
a list of length n of observed mean vectors, each of dimension p |
U_psi |
a list of length n of observed skew vectors of dimension p |
U_Sigma |
a list of length n of observed covariance matrices, each of dimension p x p |
U_xi0 |
a list of length K of mean vector parameters for sNiW, each of dimension p |
U_psi0 |
a list of length K of mean vector parameters for sNiW, each of dimension p |
U_B0 |
a list of length K of structuring matrix parameters for sNiW, each of dimension 2 x 2 |
U_Sigma0 |
a list of length K of covariance matrix parameters for sNiW, each of dimension p x p |
U_df0 |
a list of length K of degrees of freedom parameters for sNiW, each of dimension p x p |
Examples
hyperG0 <- list()
hyperG0$b_xi <- c(-1.6983129, -0.4819131)
hyperG0$b_psi <- c(-0.0641866, -0.7606068)
hyperG0$kappa <- 0.001
hyperG0$D_xi <- 16.951313
hyperG0$D_psi <- 1.255192
hyperG0$nu <- 27.67656
hyperG0$lambda <- matrix(c(2.3397761, -0.3975259,-0.3975259, 1.9601773), ncol=2)
xi_list <- list()
psi_list <- list()
S_list <- list()
for(k in 1:1000){
NNiW <- rNNiW(hyperG0, diagVar=FALSE)
xi_list[[k]] <- NNiW[["xi"]]
psi_list[[k]] <- NNiW[["psi"]]
S_list[[k]] <- NNiW[["S"]]
}
mmsNiWlogpdf(U_xi=xi_list, U_psi=psi_list, U_Sigma=S_list,
U_xi0=list(hyperG0$b_xi), U_psi0=list(hyperG0$b_psi) ,
U_B0=list(diag(c(hyperG0$D_xi, hyperG0$D_psi))) ,
U_Sigma0=list(hyperG0$lambda), U_df0=list(hyperG0$nu))
hyperG0 <- list()
hyperG0$b_xi <- c(-1.6983129)
hyperG0$b_psi <- c(-0.0641866)
hyperG0$kappa <- 0.001
hyperG0$D_xi <- 16.951313
hyperG0$D_psi <- 1.255192
hyperG0$nu <- 27.67656
hyperG0$lambda <- matrix(c(2.3397761), ncol=1)
#'xi_list <- list()
psi_list <- list()
S_list <- list()
for(k in 1:1000){
NNiW <- rNNiW(hyperG0, diagVar=FALSE)
xi_list[[k]] <- NNiW[["xi"]]
psi_list[[k]] <- NNiW[["psi"]]
S_list[[k]] <- NNiW[["S"]]
}
mmsNiWlogpdf(U_xi=xi_list, U_psi=psi_list, U_Sigma=S_list,
U_xi0=list(hyperG0$b_xi), U_psi0=list(hyperG0$b_psi) ,
U_B0=list(diag(c(hyperG0$D_xi, hyperG0$D_psi))) ,
U_Sigma0=list(hyperG0$lambda), U_df0=list(hyperG0$nu))
C++ implementation of multivariate structured Normal inverse Wishart probability density function for multiple inputs
Description
C++ implementation of multivariate structured Normal inverse Wishart probability density function for multiple inputs
Usage
mmsNiWpdfC(xi, psi, Sigma, U_xi0, U_psi0, U_B0, U_Sigma0, U_df0, Log = TRUE)
Arguments
xi |
data matrix of dimensions |
psi |
data matrix of dimensions |
Sigma |
list of length |
U_xi0 |
mean vectors matrix of dimension |
U_psi0 |
skew parameter vectors matrix of dimension |
U_B0 |
list of length |
U_Sigma0 |
list of length |
U_df0 |
vector of length |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
Value
matrix of densities of dimension K x n
References
Hejblum BP, Alkhassim C, Gottardo R, Caron F and Thiebaut R (2019) Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data. The Annals of Applied Statistics, 13(1): 638-660. <doi: 10.1214/18-AOAS1209>. <arXiv: 1702.04407>. https://arxiv.org/abs/1702.04407 doi:10.1214/18-AOAS1209
C++ implementation of multivariate Normal probability density function for multiple inputs
Description
C++ implementation of multivariate Normal probability density function for multiple inputs
Usage
mmvnpdfC(x, mean, varcovM, Log = TRUE)
Arguments
x |
data matrix of dimension |
mean |
mean vectors matrix of dimension |
varcovM |
list of length |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
Value
matrix of densities of dimension K x n
.
Examples
if(require(microbenchmark)){
library(microbenchmark)
microbenchmark(mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE),
mvnpdfC(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE),
mmvnpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), Log=FALSE),
times=1000L)
microbenchmark(mvnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=c(-0.2, 0.3),
varcovM=matrix(c(2, 0.2, 0.2, 2), ncol=2), Log=FALSE),
mvnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), mean=c(-0.2, 0.3),
varcovM=matrix(c(2, 0.2, 0.2, 2), ncol=2), Log=FALSE),
mmvnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1),
mean=matrix(c(-0.2, 0.3), nrow=2, ncol=1),
varcovM=list(matrix(c(2, 0.2, 0.2, 2), ncol=2)), Log=FALSE),
times=1000L)
microbenchmark(mvnpdf(x=matrix(c(rep(1.96,2),rep(0,2)), nrow=2, ncol=2),
mean=list(c(0,0),c(-1,-1), c(1.5,1.5)),
varcovM=list(diag(2),10*diag(2), 20*diag(2)), Log=FALSE),
mmvnpdfC(matrix(c(rep(1.96,2),rep(0,2)), nrow=2, ncol=2),
mean=matrix(c(0,0,-1,-1, 1.5,1.5), nrow=2, ncol=3),
varcovM=list(diag(2),10*diag(2), 20*diag(2)), Log=FALSE),
times=1000L)
}else{
cat("package 'microbenchmark' not available\n")
}
C++ implementation of multivariate skew Normal probability density function for multiple inputs
Description
C++ implementation of multivariate skew Normal probability density function for multiple inputs
Usage
mmvsnpdfC(x, xi, psi, sigma, Log = TRUE)
Arguments
x |
data matrix of dimension |
xi |
mean vectors matrix of dimension |
psi |
skew parameter vectors matrix of dimension |
sigma |
list of length K of variance-covariance matrices,
each of dimensions |
Log |
logical flag for returning the log of the probability density
function. Default is |
Value
matrix of densities of dimension K x n
.
Author(s)
Boris Hejblum
Examples
mmvsnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1),
xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)), Log=FALSE
)
mmvsnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1),
xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2))
)
if(require(microbenchmark)){
library(microbenchmark)
microbenchmark(mvsnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=c(0, 0), psi=c(1, 1),
sigma=diag(2), Log=FALSE),
mmvsnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1), xi=matrix(c(0, 0)),
psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)), Log=FALSE),
times=1000L
)
microbenchmark(mvsnpdf(x=matrix(c(rep(1.96,2),rep(0,2)), nrow=2, ncol=2),
xi=list(c(0,0),c(-1,-1), c(1.5,1.5)),
psi=list(c(0.1,0.1),c(-0.1,-1), c(0.5,-1.5)),
sigma=list(diag(2),10*diag(2), 20*diag(2)), Log=FALSE),
mmvsnpdfC(matrix(c(rep(1.96,2),rep(0,2)), nrow=2, ncol=2),
xi=matrix(c(0,0,-1,-1, 1.5,1.5), nrow=2, ncol=3),
psi=matrix(c(0.1,0.1,-0.1,-1, 0.5,-1.5), nrow=2, ncol=3),
sigma=list(diag(2),10*diag(2), 20*diag(2)), Log=FALSE),
times=1000L)
}else{
cat("package 'microbenchmark' not available\n")
}
C++ implementation of multivariate Normal probability density function for multiple inputs
Description
C++ implementation of multivariate Normal probability density function for multiple inputs
Usage
mmvstpdfC(x, xi, psi, sigma, df, Log = TRUE)
Arguments
x |
data matrix of dimension |
xi |
mean vectors matrix of dimension |
psi |
skew parameter vectors matrix of dimension |
sigma |
list of length |
df |
vector of length K of degree of freedom parameters. |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
Value
matrix of densities of dimension K x n
.
Author(s)
Boris Hejblum
Examples
mmvstpdfC(x = matrix(c(3.399890,-5.936962), ncol=1), xi=matrix(c(0.2528859,-2.4234067)),
psi=matrix(c(11.20536,-12.51052), ncol=1),
sigma=list(matrix(c(0.2134011, -0.0382573, -0.0382573, 0.2660086), ncol=2)),
df=c(7.784106)
)
mvstpdf(x = matrix(c(3.399890,-5.936962), ncol=1), xi=c(0.2528859,-2.4234067),
psi=c(11.20536,-12.51052),
sigma=matrix(c(0.2134011, -0.0382573, -0.0382573, 0.2660086), ncol=2),
df=c(7.784106)
)
#skew-normal limit
mmvsnpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1),
xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2))
)
mvstpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1),
xi=c(0, 0), psi=c(1, 1), sigma=diag(2),
df=100000000
)
mmvstpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1),
xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1), sigma=list(diag(2)),
df=100000000
)
#non-skewed limit
mmvtpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1),
mean=matrix(c(0, 0)), varcovM=list(diag(2)),
df=10
)
mmvstpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1),
xi=matrix(c(0, 0)), psi=matrix(c(0, 0),ncol=1), sigma=list(diag(2)),
df=10
)
if(require(microbenchmark)){
library(microbenchmark)
microbenchmark(mvstpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1),
xi=c(0, 0), psi=c(1, 1),
sigma=diag(2), df=10),
mmvstpdfC(x=matrix(rep(1.96,2), nrow=2, ncol=1),
xi=matrix(c(0, 0)), psi=matrix(c(1, 1),ncol=1),
sigma=list(diag(2)), df=10),
times=1000L)
}else{
cat("package 'microbenchmark' not available\n")
}
C++ implementation of multivariate Normal probability density function for multiple inputs
Description
C++ implementation of multivariate Normal probability density function for multiple inputs
Usage
mmvtpdfC(x, mean, varcovM, df, Log = TRUE)
Arguments
x |
data matrix of dimension |
mean |
mean vectors matrix of dimension |
varcovM |
list of length |
df |
vector of length |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
Value
matrix of densities of dimension K x n
.
Author(s)
Boris Hejblum
Examples
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE)
mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10000000, Log=FALSE)
mmvtpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), df=10000000, Log=FALSE)
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1))
mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10000000)
mmvtpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), df=10000000)
mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10)
mmvtpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)), df=10)
if(require(microbenchmark)){
library(microbenchmark)
microbenchmark(mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=1, Log=FALSE),
mmvtpdfC(x=matrix(1.96), mean=matrix(0), varcovM=list(diag(1)),
df=c(1), Log=FALSE),
times=10000L)
}else{
cat("package 'microbenchmark' not available\n")
}
C++ implementation of multivariate Normal probability density function for multiple inputs
Description
C++ implementation of multivariate Normal probability density function for multiple inputs
Usage
mvnlikC(x, c, clustval, mu, sigma, loglik = TRUE)
Arguments
x |
data matrix of dimension p x n, p being the dimension of the data and n the number of data points |
c |
integer vector of cluster allocations with values from 1 to K |
clustval |
vector of unique values from c in the order corresponding to
the storage of cluster parameters in |
mu |
mean vectors matrix of dimension p x K, K being the number of clusters |
sigma |
list of length |
loglik |
logical flag or returning the log-likelihood instead of the likelihood.
Default is |
Value
a list:
"indiv": |
vector of likelihood of length n; |
"clust": |
vector of likelihood of length K; |
"total": |
total (log)-likelihood; |
Author(s)
Boris Hejblum
multivariate-Normal probability density function
Description
multivariate-Normal probability density function
Usage
mvnpdf(x, mean, varcovM, Log = TRUE)
Arguments
x |
p x n data matrix with n the number of observations and p the number of dimensions |
mean |
mean vector or list of mean vectors (either a vector, a matrix or a list) |
varcovM |
variance-covariance matrix or list of variance-covariance matrices (either a matrix or a list) |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
Author(s)
Boris P. Hejblum
See Also
Examples
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE)
dnorm(1.96)
mvnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1),
mean=c(0, 0), varcovM=diag(2), Log=FALSE
)
C++ implementation of multivariate normal probability density function for multiple inputs
Description
Based on the implementation from Nino Hardt and Dicko Ahmadou https://gallery.rcpp.org/articles/dmvnorm_arma/ (accessed in August 2014)
Usage
mvnpdfC(x, mean, varcovM, Log = TRUE)
Arguments
x |
data matrix |
mean |
mean vector |
varcovM |
variance covariance matrix |
Log |
logical flag for returning the log of the probability density
function. Default is |
Value
vector of densities
Author(s)
Boris P. Hejblum
Examples
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE)
mvnpdfC(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE)
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1))
mvnpdfC(x=matrix(1.96), mean=0, varcovM=diag(1))
if(require(microbenchmark)){
library(microbenchmark)
microbenchmark(dnorm(1.96),
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE),
mvnpdfC(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE),
times=10000L)
}else{
cat("package 'microbenchmark' not available\n")
}
C++ implementation of multivariate skew normal likelihood function for multiple inputs
Description
C++ implementation of multivariate skew normal likelihood function for multiple inputs
Usage
mvsnlikC(x, c, clustval, xi, psi, sigma, loglik = TRUE)
Arguments
x |
data matrix of dimension p x n, p being the dimension of the data and n the number of data points |
c |
integer vector of cluster allocations with values from 1 to K |
clustval |
vector of unique values from c in the order corresponding to
the storage of cluster parameters in |
xi |
mean vectors matrix of dimension p x K, K being the number of clusters |
psi |
skew parameter vectors matrix of dimension |
sigma |
list of length |
loglik |
logical flag or returning the log-likelihood instead of the likelihood.
Default is |
Value
a list:
"indiv": |
vector of likelihood of length n; |
"clust": |
vector of likelihood of length K; |
"total": |
total (log)-likelihood; |
Author(s)
Boris Hejblum
multivariate Skew-Normal probability density function
Description
multivariate Skew-Normal probability density function
Usage
mvsnpdf(x, xi, sigma, psi, Log = TRUE)
Arguments
x |
p x n data matrix with n the number of observations and p the number of dimensions |
xi |
mean vector or list of mean vectors (either a vector, a matrix or a list) |
sigma |
variance-covariance matrix or list of variance-covariance matrices (either a matrix or a list) |
psi |
skew parameter vector or list of skew parameter vectors (either a vector, a matrix or a list) |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
See Also
Examples
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1), Log=FALSE)
dnorm(1.96)
mvsnpdf(x=matrix(rep(1.96,1), nrow=1, ncol=1),
xi=c(0), psi=c(0), sigma=diag(1),
Log=FALSE
)
mvsnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1),
xi=c(0, 0), psi=c(1, 1), sigma=diag(2)
)
N=50000#00
Yn <- rnorm(n=N, mean=0, sd=1)
Z <- rtruncnorm(n=N, a=0, b=Inf, mean=0, sd=1)
eps <- rnorm(n=N, mean=0, sd=1)
psi <- 10
Ysn <- psi*Z + eps
nu <- 1.5
W <- rgamma(n=N, shape=nu/2, rate=nu/2)
Yst=Ysn/sqrt(W)
library(reshape2)
library(ggplot2)
data2plot <- melt(cbind.data.frame(Ysn, Yst))
#pdf(file="ExSNST.pdf", height=5, width=4)
p <- (ggplot(data=data2plot)
+ geom_density(aes(x=value, fill=variable, alpha=variable), col="black")#, lwd=1.1)
+ theme_bw()
+ xlim(-15,100)
+ theme(legend.position="bottom")
+ scale_fill_manual(values=alpha(c("#F8766D", "#00B0F6"),c(0.2,0.45)),
name =" ",
labels=c("Y~SN(0,1,10) ", "Y~ST(0,1,10,1.5)")
)
+ scale_alpha_manual(guide=FALSE, values=c(0.25, 0.45))
+ xlab("Y")
+ ylim(0,0.08)
+ ylab("Density")
+ guides(fill = guide_legend(override.aes = list(colour = NULL)))
+ theme(legend.key = element_rect(colour = "black"))
)
p
#dev.off()
C++ implementation of multivariate skew t likelihood function for multiple inputs
Description
C++ implementation of multivariate skew t likelihood function for multiple inputs
Usage
mvstlikC(x, c, clustval, xi, psi, sigma, df, loglik = TRUE)
Arguments
x |
data matrix of dimension p x n, p being the dimension of the data and n the number of data points |
c |
integer vector of cluster allocations with values from 1 to K |
clustval |
vector of unique values from c in the order corresponding to
the storage of cluster parameters in |
xi |
mean vectors matrix of dimension p x K, K being the number of clusters |
psi |
skew parameter vectors matrix of dimension |
sigma |
list of length |
df |
vector of length |
loglik |
logical flag or returning the log-likelihood instead of the likelihood.
Default is |
Value
a list:
"indiv": |
vector of likelihood of length n; |
"clust": |
vector of likelihood of length K; |
"total": |
total (log)-likelihood; |
Author(s)
Boris Hejblum
multivariate skew-t probability density function
Description
multivariate skew-t probability density function
Usage
mvstpdf(x, xi, sigma, psi, df, Log = TRUE)
Arguments
x |
|
xi |
mean vector or list of mean vectors (either a vector, a matrix or a list) |
sigma |
variance-covariance matrix or list of variance-covariance matrices (either a matrix or a list) |
psi |
skew parameter vector or list of skew parameter vectors (either a vector, a matrix or a list) |
df |
a numeric vector or a list of the degrees of freedom (either a vector or a list) |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
See Also
mvtpdf
, mvsnpdf
, mmvstpdfC
, mvstlikC
Examples
mvstpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1),
xi=c(0, 0), psi=c(1, 1), sigma=diag(2),
df=100000000, Log=FALSE
)
mvsnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1),
xi=c(0, 0), psi=c(1, 1), sigma=diag(2),
Log=FALSE
)
mvstpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1),
xi=c(0, 0), psi=c(1, 1), sigma=diag(2),
df=100000000
)
mvsnpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1),
xi=c(0, 0), psi=c(1, 1), sigma=diag(2)
)
multivariate Student's t-distribution probability density function
Description
multivariate Student's t-distribution probability density function
Usage
mvtpdf(x, mean, varcovM, df, Log = TRUE)
Arguments
x |
|
mean |
mean vector or list of mean vectors (either a vector, a matrix or a list) |
varcovM |
variance-covariance matrix or list of variance-covariance matrices (either a matrix or a list) |
df |
a numeric vector or a list of the degrees of freedom (either a vector or a list) |
Log |
logical flag for returning the log of the probability density
function. Defaults is |
Examples
mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10000000)
mvnpdf(x=matrix(1.96), mean=0, varcovM=diag(1))
mvtpdf(x=matrix(1.96), mean=0, varcovM=diag(1), df=10)
mvtpdf(x=matrix(rep(1.96,2), nrow=2, ncol=1),
mean=c(0, 0), varcovM=diag(2), df=10
)
Convergence diagnostic plots
Description
Convergence diagnostic plots
Usage
plot_ConvDPM(
MCMCsample,
from = 1,
to = length(MCMCsample$logposterior_list),
shift = 0,
thin = 1,
...
)
Arguments
MCMCsample |
a |
from |
the MCMC iteration from which the plot should start.
Default is |
to |
the MCMC iteration up until which the plot should stop.
Default is |
shift |
a number of initial iterations not to be displayed. Default is |
thin |
integer giving the spacing at which MCMC iterations are kept.
Default is |
... |
further arguments passed to or from other methods |
Plot of a Dirichlet process mixture of gaussian distribution partition
Description
Plot of a Dirichlet process mixture of gaussian distribution partition
Usage
plot_DPM(
z,
U_mu = NULL,
U_Sigma = NULL,
m,
c,
i,
alpha = "?",
U_SS = NULL,
dims2plot = 1:nrow(z),
ellipses = ifelse(length(dims2plot) < 3, TRUE, FALSE),
gg.add = list(theme())
)
Arguments
z |
data matrix |
U_mu |
either a list or a matrix containing the current estimates of mean vectors
of length |
U_Sigma |
either a list or an array containing the |
m |
vector of length |
c |
allocation vector of length |
i |
current MCMC iteration number. |
alpha |
current value of the DP concentration parameter. |
U_SS |
a list containing |
dims2plot |
index vector, subset of |
ellipses |
a logical flag indicating whether ellipses should be drawn around clusters. Default
is |
gg.add |
a list of instructions to add to the |
Author(s)
Boris Hejblum
Plot of a Dirichlet process mixture of skew normal distribution partition
Description
Plot of a Dirichlet process mixture of skew normal distribution partition
Usage
plot_DPMsn(
z,
c,
i = "",
alpha = "?",
U_SS,
dims2plot = 1:nrow(z),
ellipses = ifelse(length(dims2plot) < 3, TRUE, FALSE),
gg.add = list(theme()),
nbsim_dens = 1000
)
Arguments
z |
data matrix |
c |
allocation vector of length |
i |
current MCMC iteration number. |
alpha |
current value of the DP concentration parameter. |
U_SS |
a list containing |
dims2plot |
index vector, subset of |
ellipses |
a logical flag indicating whether ellipses should be drawn around clusters. Default
is |
gg.add |
A list of instructions to add to the |
nbsim_dens |
number of simulated points used for computing clusters density contours in 2D
plots. Default is |
Author(s)
Boris Hejblum
Plot of a Dirichlet process mixture of skew t-distribution partition
Description
Plot of a Dirichlet process mixture of skew t-distribution partition
Usage
plot_DPMst(
z,
c,
i = "",
alpha = "?",
U_SS,
dims2plot = 1:nrow(z),
ellipses = ifelse(length(dims2plot) < 3, TRUE, FALSE),
gg.add = list(theme()),
nbsim_dens = 1000,
nice = FALSE
)
Arguments
z |
data matrix |
c |
allocation vector of length |
i |
current MCMC iteration number. |
alpha |
current value of the DP concentration parameter. |
U_SS |
a list containing |
dims2plot |
index vector, subset of |
ellipses |
a logical flag indicating whether ellipses should be drawn around clusters. Default
is |
gg.add |
A list of instructions to add to the |
nbsim_dens |
number of simulated points used for computing clusters density contours in 2D
plots. Default is |
nice |
logical flag changing the plot looks. Default is |
Author(s)
Boris Hejblum
Post-processing Dirichlet Process Mixture Models results to get a mixture distribution of the posterior locations
Description
Post-processing Dirichlet Process Mixture Models results to get a mixture distribution of the posterior locations
Usage
postProcess.DPMMclust(
x,
burnin = 0,
thin = 1,
gs = NULL,
lossFn = "F-measure",
K = 10,
...
)
Arguments
x |
a |
burnin |
integer giving the number of MCMC iterations to burn (defaults is half) |
thin |
integer giving the spacing at which MCMC iterations are kept.
Default is |
gs |
optional vector of length |
lossFn |
character string specifying the loss function to be used. Either "F-measure" or "Binder" (see Details). Default is "F-measure". |
K |
integer giving the number of mixture components. Default is |
... |
further arguments passed to or from other methods |
Details
The cost of a point estimate partition is calculated using either a pairwise coincidence loss function (Binder), or 1-Fmeasure (F-measure).
Value
a list
:
burnin: |
an integer passing along the |
thin: |
an integer passing along the |
lossFn: |
a character string passing along the |
point_estim: |
|
loss: |
|
index_estim: |
Author(s)
Boris Hejblum
See Also
similarityMat
summary.DPMMclust
Methods for a summary of a DPMMclust
object
Description
Methods for a summary of a DPMMclust
object
Usage
## S3 method for class 'summaryDPMMclust'
print(x, ...)
## S3 method for class 'summaryDPMMclust'
plot(
x,
hm = FALSE,
nbsim_densities = 5000,
hm_subsample = NULL,
hm_order_by_clust = TRUE,
gg.add = list(theme_bw()),
...
)
Arguments
x |
a |
... |
further arguments passed to or from other methods |
hm |
logical flag to plot the heatmap of the similarity matrix.
Default is |
nbsim_densities |
the number of simulated observations to be used to plot the density lines of the clusters in the point estimate partition plot |
hm_subsample |
a integer designating the number of observations to use when plotting the heatmap.
Used only if |
hm_order_by_clust |
logical flag indicating whether observations should be ordered according to
the point estimate first. Used only if |
gg.add |
a list of instructions to add to the |
Author(s)
Boris Hejblum
Construction of an Empirical based prior
Description
Construction of an Empirical based prior
Usage
priormix(sDPMclust, nu0add = 5)
Arguments
sDPMclust |
an object of class |
nu0add |
an additional value integer added to hyperprior parameter nu (increase to avoid non positive definite matrix sampling) |
See Also
summary.DPMMclust
Examples
rm(list=ls())
#Number of data
n <- 2000
set.seed(123)
#set.seed(4321)
d <- 2
ncl <- 4
# Sample data
sdev <- array(dim=c(d,d,ncl))
xi <- matrix(nrow=d, ncol=ncl, c(-1.5, 1.5, 1.5, 1.5, 2, -2.5, -2.5, -3))
#xi <- matrix(nrow=d, ncol=ncl, c(-0.5, 0, 0.5, 0, 0.5, -1, -1, 1))
psi <- matrix(nrow=d, ncol=4, c(0.4, -0.6, 0.8, 0, 0.3, -0.7, -0.3, -0.8))
nu <- c(100,15,8,5)
p <- c(0.15, 0.05, 0.5, 0.3) # frequence des clusters
sdev[, ,1] <- matrix(nrow=d, ncol=d, c(0.3, 0, 0, 0.3))
sdev[, ,2] <- matrix(nrow=d, ncol=d, c(0.1, 0, 0, 0.3))
sdev[, ,3] <- matrix(nrow=d, ncol=d, c(0.3, 0.15, 0.15, 0.3))
sdev[, ,4] <- .3*diag(2)
c <- rep(0,n)
w <- rep(1,n)
z <- matrix(0, nrow=d, ncol=n)
for(k in 1:n){
c[k] = which(rmultinom(n=1, size=1, prob=p)!=0)
w[k] <- rgamma(1, shape=nu[c[k]]/2, rate=nu[c[k]]/2)
z[,k] <- xi[, c[k]] + psi[, c[k]]*rtruncnorm(n=1, a=0, b=Inf, mean=0, sd=1/sqrt(w[k])) +
(sdev[, , c[k]]/sqrt(w[k]))%*%matrix(rnorm(d, mean = 0, sd = 1), nrow=d, ncol=1)
#cat(k, "/", n, " observations simulated\n", sep="")
}
# Set parameters of G0
hyperG0 <- list()
hyperG0[["b_xi"]] <- rowMeans(z)
hyperG0[["b_psi"]] <- rep(0,d)
hyperG0[["kappa"]] <- 0.001
hyperG0[["D_xi"]] <- 100
hyperG0[["D_psi"]] <- 100
hyperG0[["nu"]] <- d+1
hyperG0[["lambda"]] <- diag(apply(z,MARGIN=1, FUN=var))/3
# hyperprior on the Scale parameter of DPM
a <- 0.0001
b <- 0.0001
nbclust_init <- 30
if(interactive()){
MCMCsample_st <- DPMGibbsSkewT(z, hyperG0, a, b, N=2000, doPlot=FALSE,
nbclust_init, diagVar=FALSE)
s <- summary(MCMCsample_st, burnin = 1500, thin=5, posterior_approx=TRUE)
pmix <- priormix(s)
}
Generating cluster data from the Chinese Restaurant Process
Description
Generating cluster data from the Chinese Restaurant Process
Usage
rCRP(n = 1000, alpha = 2, hyperG0, verbose = TRUE)
Arguments
n |
number of observations |
alpha |
concentration parameter |
hyperG0 |
base distribution hyperparameter |
verbose |
logical flag indicating whether info is written in the console. |
Examples
rm(list=ls())
d=2
hyperG0 <- list()
hyperG0[["NNiW"]] <- list()
hyperG0[["NNiW"]][["b_xi"]] <- rep(0,d)
hyperG0[["NNiW"]][["b_psi"]] <- rep(0,d)
hyperG0[["NNiW"]][["D_xi"]] <- 100
hyperG0[["NNiW"]][["D_psi"]] <- 8
hyperG0[["NNiW"]][["nu"]] <- d+1
hyperG0[["NNiW"]][["lambda"]] <- diag(c(1,1))
hyperG0[["scale"]] <- list()
set.seed(4321)
N <- 200
alph <- runif(n=1,0.2,2)
GvHD_sims <- rCRP(n=2*N, alpha=alph, hyperG0=hyperG0)
library(ggplot2)
q <- (ggplot(data=cbind.data.frame("D1"=GvHD_sims$data[1,],
"D2"=GvHD_sims$data[2,],
"Cluster"=GvHD_sims$cluster),
aes(x=D1, y=D2))
+ geom_point(aes(colour=Cluster), alpha=0.6)
+ theme_bw()
)
q
#q + stat_density2d(alpha=0.15, geom="polygon")
if(interactive()){
MCMCy1 <- DPMGibbsSkewT(z=GvHD_sims$data[,1:N],
hyperG0$NNiW, a=0.0001, b=0.0001, N=5000,
doPlot=TRUE, nbclust_init=64, plotevery=500,
gg.add=list(theme_bw()), diagVar=FALSE)
s1 <- summary(MCMCy1, burnin=4000, thin=5,
posterior_approx=TRUE)
F1 <- FmeasureC(ref=GvHD_sims$cluster[1:N], pred=s1$point_estim$c_est)
# s <- summary(MCMCy1, burnin=4000, thin=5,
# posterior_approx=TRUE, K=1)
# s2 <- summary(MCMCy1, burnin=4000, thin=5,
# posterior_approx=TRUE, K=2)
# MCMCy2_seqPost<- DPMGibbsSkewT(z=GvHD_sims$data[,(N+1):(2*N)],
# hyperG0=s1$param_post$parameters,
# a=s1$param_post$alpha_param$shape,
# b=s1$param_post$alpha_param$rate,
# N=5000, doPlot=TRUE, nbclust_init=64, plotevery=500,
# gg.add=list(theme_bw()), diagVar=FALSE)
MCMCy2_seqPost <- DPMGibbsSkewT_SeqPrior(z=GvHD_sims$data[,(N+1):(2*N)],
prior=s1$param_post, hyperG0=hyperG0$NNiW, , N=1000,
doPlot=TRUE, nbclust_init=10, plotevery=100,
gg.add=list(theme_bw()), diagVar=FALSE)
s2_seqPost <- summary(MCMCy2_seqPost, burnin=600, thin=2)
F2_seqPost <- FmeasureC(ref=GvHD_sims$cluster[(N+1):(2*N)], pred=s2_seqPost$point_estim$c_est)
MCMCy2 <- DPMGibbsSkewT(z=GvHD_sims$data[,(N+1):(2*N)],
hyperG0$NNiW, a=0.0001, b=0.0001, N=5000,
doPlot=TRUE, nbclust_init=64, plotevery=500,
gg.add=list(theme_bw()), diagVar=FALSE)
s2 <- summary(MCMCy2, burnin=4000, thin=5)
F2 <- FmeasureC(ref=GvHD_sims$cluster[(N+1):(2*N)], pred=s2$point_estim$c_est)
MCMCtot <- DPMGibbsSkewT(z=GvHD_sims$data,
hyperG0$NNiW, a=0.0001, b=0.0001, N=5000,
doPlot=TRUE, nbclust_init=10, plotevery=500,
gg.add=list(theme_bw()), diagVar=FALSE)
stot <- summary(MCMCtot, burnin=4000, thin=5)
F2tot <- FmeasureC(ref=GvHD_sims$cluster[(N+1):(2*N)], pred=stot$point_estim$c_est[(N+1):(2*N)])
c(F1, F2, F2_seqPost, F2tot)
}
Sample from a normal inverse Wishart distribution
whose parameter are given by the structure SufStat
Description
For internal use only.
Usage
rNNiW(SufStat, diagVar)
Sample from a Normal inverse-Wishart distribution whose parameter are given by the structure hyper
Description
For internal use only.
Usage
rNiW(hyper, diagVar)
C++ implementation of the multinomial sampling from a matrix of column vectors, each containing the sampling probabilities for their respective draw
Description
C++ implementation of the multinomial sampling from a matrix of column vectors, each containing the sampling probabilities for their respective draw
Usage
sampleClassC(probMat, Log = FALSE)
Arguments
probMat |
a numeric matrix of dim |
Log |
a logical flag indicating whether the provided |
Value
a vector of integer of length n
containing the multinomial draws for each
observation, i.e. the class allocation.
Sampler for the concentration parameter of a Dirichlet process
Description
Sampler updating the concentration parameter of a Dirichlet process given
the number of observations and a Gamma(a
, b
) prior, following the augmentation
strategy of West, and of Escobar and West.
Usage
sample_alpha(alpha_old, n, K, a = 1e-04, b = 1e-04)
Arguments
alpha_old |
the current value of alpha |
n |
the number of data points |
K |
current number of cluster |
a |
shape hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process.
Default is |
b |
scale hyperparameter of the Gamma prior
on the concentration parameter of the Dirichlet Process.
Default is |
Details
A Gamma prior is used.
References
M West, Hyperparameter estimation in Dirichlet process mixture models, Technical Report, Duke University, 1992.
MD Escobar, M West, Bayesian Density Estimation and Inference Using Mixtures Journal of the American Statistical Association, 90(430):577-588, 1995.
Examples
#Test with a fixed K
####################
alpha_init <- 1000
N <- 10000
#n=500
n=10000
K <- 80
a <- 0.0001
b <- a
alphas <- numeric(N)
alphas[1] <- alpha_init
for (i in 2:N){
alphas[i] <- sample_alpha(alpha_old = alphas[i-1], n=n, K=K, a=a, b=b)
}
postalphas <- alphas[floor(N/2):N]
alphaMMSE <- mean(postalphas)
alphaMAP <- density(postalphas)$x[which.max(density(postalphas)$y)]
expK <- sum(alphaMMSE/(alphaMMSE+0:(n-1)))
round(expK)
prioralpha <- data.frame("alpha"=rgamma(n=5000, a,1/b),
"distribution" =factor(rep("prior",5000),
levels=c("prior", "posterior")))
library(ggplot2)
p <- (ggplot(prioralpha, aes(x=alpha))
+ geom_histogram(aes(y=..density..),
colour="black", fill="white")
+ geom_density(alpha=.2, fill="red")
+ ggtitle(paste("Prior distribution on alpha: Gamma(", a,
",", b, ")\n", sep=""))
)
p
postalpha.df <- data.frame("alpha"=postalphas,
"distribution" = factor(rep("posterior",length(postalphas)),
levels=c("prior", "posterior")))
p <- (ggplot(postalpha.df, aes(x=alpha))
+ geom_histogram(aes(y=..density..), binwidth=.1,
colour="black", fill="white")
+ geom_density(alpha=.2, fill="blue")
+ ggtitle("Posterior distribution of alpha\n")
# Ignore NA values for mean
# Overlay with transparent density plot
+ geom_vline(aes(xintercept=mean(alpha, na.rm=TRUE)),
color="red", linetype="dashed", size=1)
)
p
Computes the co-clustering (or similarity) matrix
Description
Computes the co-clustering (or similarity) matrix
Usage
similarityMat(c, step = 1)
Arguments
c |
a list of vector of length |
step |
provide co-clustering every |
Value
A matrix of size n x n
whose term [i,j]
is the proportion of MCMC iterations where observation i
and
observations j
are allocated to the same cluster.
Author(s)
Boris Hejblum
C++ implementation
Description
C++ implementation
Usage
similarityMatC(cc)
Arguments
cc |
a matrix whose columns each represents a (MCMC) partition |
Examples
c <- list(c(1,1,2,3,2,3), c(1,1,1,2,3,3),c(2,2,1,1,1,1))
similarityMatC(sapply(c, "["))
c2 <- list()
for(i in 1:10){
c2 <- c(c2, list(rmultinom(n=1, size=200, prob=rexp(n=200))))
}
similarityMatC(sapply(c2, "["))
C++ implementation
Description
C++ implementation
Usage
similarityMat_nocostC(cc)
Arguments
cc |
a matrix whose columns each represents a ()MCMC) partition |
Examples
c <- list(c(1,1,2,3,2,3), c(1,1,1,2,3,3),c(2,2,1,1,1,1))
similarityMat_nocostC(sapply(c, "["))
c2 <- list()
for(i in 1:10){
c2 <- c(c2, list(rmultinom(n=1, size=1000, prob=rexp(n=1000))))
}
c3 <- sapply(c2, "[")
if(require(microbenchmark)){
library(microbenchmark)
microbenchmark(similarityMat(c3), similarityMat_nocostC(c3), times=2L)
}else{
cat("package 'microbenchmark' not available\n")
}
Summarizing Dirichlet Process Mixture Models
Description
Summary methods for DPMMclust
objects.
Usage
## S3 method for class 'DPMMclust'
summary(
object,
burnin = 0,
thin = 1,
gs = NULL,
lossFn = "Binder",
posterior_approx = FALSE,
...
)
Arguments
object |
a |
burnin |
integer giving the number of MCMC iterations to burn (defaults is half) |
thin |
integer giving the spacing at which MCMC iterations are kept.
Default is |
gs |
optional vector of length |
lossFn |
character string specifying the loss function to be used. Either "F-measure" or "Binder" (see Details). Default is "Binder". |
posterior_approx |
logical flag whether a parametric approximation of the posterior should be
computed. Default is |
... |
further arguments passed to or from other methods |
Details
The cost of a point estimate partition is calculated using either a pairwise coincidence loss function (Binder), or 1-Fmeasure (F-measure).
The number of retained sampled partitions is m = (N - burnin)/thin
Value
a list
containing the following elements:
nb_mcmcit
:an integer giving the value of
m
, the number of retained sampled partitions, i.e.(N - burnin)/thin
burnin
:an integer passing along the
burnin
argumentthin
:an integer passing along the
thin
argumentlossFn
:a character string passing along the
lossFn
argumentclust_distrib
:a character string passing along the
clust_distrib
argumentpoint_estim
:a
list
containing:c_est
:a vector of length
n
containing the point estimated clustering for each observationscost
:a vector of length
m
containing the cost of each sampled partitionFmeas
:if
lossFn
is'F-measure'
, them x m
matrix of total F-measures for each pair of sampled partitionsopt_ind
:the index of the point estimate partition among the
m
sampled
loss
:the loss for the point estimate.
NA
iflossFn
is not'Binder'
param_posterior
:a list containing the parametric approximation of the posterior, suitable to be plugged in as prior for a new MCMC algorithm run
mcmc_partitions
:a list containing the
m
sampled partitionsalpha
:a vector of length
m
with the values of thealpha
DP parameterindex_estim
:the index of the point estimate partition among the
m
sampledhyperG0
:a list passing along the prior, i.e. the
hyperG0
argumentlogposterior_list
:a list of length
m
containing the logposterior and its decomposition, for each sampled partitionU_SS_list
:a list of length
m
containing the containing the lists of sufficient statistics for all the mixture components, for each sampled partitiondata
:a
d x n
matrix containing the clustered data
Author(s)
Boris Hejblum
See Also
C++ implementation of residual trace computation step used when sampling the scale
Description
C++ implementation of residual trace computation step used when sampling the scale
Usage
traceEpsC(eps, sigma)
Arguments
eps |
a numeric matrix where each column contains the centered and unskewed observations |
sigma |
a numeric covariance matrix |
Value
the computed trace
Return updated sufficient statistics S with new data matrix z
Description
For internal use only.
Usage
update_SS(z, S, hyperprior = NULL)
Arguments
z |
data matrix |
S |
previous sufficient statistics |
hyperprior |
Default is |
Return updated sufficient statistics S with new data matrix z
Description
For internal use only.
Usage
update_SSsn(z, S, ltn, hyperprior = NULL)
Arguments
z |
data matrix |
S |
previous sufficient statistics |
ltn |
random effects |
hyperprior |
Default is |
Return updated sufficient statistics S for skew t-distribution with data matrix z
Description
For internal use only.
Usage
update_SSst(z, S, ltn, scale, df, hyperprior = NULL)
Arguments
z |
data matrix |
S |
previous sufficient statistics |
ltn |
random effects |
scale |
|
df |
skew t degrees of freedom |
hyperprior |
Default is |
C++ implementation
Description
C++ implementation
Usage
vclust2mcoclustC(c)
Arguments
c |
is an MCMC partition |
Author(s)
Chariff Alkhassim
Examples
cc <- c(1,1,2,3,2,3)
vclust2mcoclustC(cc)
Sample from a Wishart distribution
Description
For internal use only.
Usage
wishrnd(n, Sigma)
Arguments
n |
degrees of freedom |
Sigma |
scale parameter (matrix) |