Statistics and clustering

This vignette is adapted from the official Armadillo documentation.

Mean

The mean function computes the mean of a vector, matrix, or cube. For a vector argument, the mean is calculated using all the elements of the vector. For a matrix argument, the mean is calculated for each column by default (dim = 0), or for each row (dim = 1). For a cube argument, the mean is calculated along the specified dimension (same as for matrices plus dim = 2 for slices).

Usage:

mean(V)

mean(M)
mean(M, dim)

mean(Q)
mean(Q, dim)

Examples

[[cpp11::register]] list mean1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);
  
  // create a cube with 3 copies of B + random noise
  cube C(B.n_rows, B.n_cols, 3);
  C.slice(0) = B + 0.1 * randn<mat>(B.n_rows, B.n_cols);
  C.slice(1) = B + 0.2 * randn<mat>(B.n_rows, B.n_cols);
  C.slice(2) = B + 0.3 * randn<mat>(B.n_rows, B.n_cols);

  vec D = mean(A).t();
  vec E = mean(A, 1);
  vec F = mean(mean(B, 1), 1);

  writable::list res(3);
  res[0] = as_doubles(D);
  res[1] = as_doubles(E);
  res[2] = as_doubles(F);

  return res;
}

Median

The median function computes the median of a vector or matrix. For a vector argument, the median is calculated using all the elements of the vector. For a matrix argument, the median is calculated for each column by default (dim = 0), or for each row (dim = 1).

Usage:

median(V)

median(M)
median(M, dim)

Examples

[[cpp11::register]] list median1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);

  vec C = median(A).t();
  vec D = median(A, 1);
  vec E = median(median(B, 1), 1);

  writable::list res(3);
  res[0] = as_doubles(C);
  res[1] = as_doubles(D);
  res[2] = as_doubles(E);

  return res;
}

Standard deviation

The stddev function computes the standard deviation of a vector or matrix. For a vector argument, the standard deviation is calculated using all the elements of the vector. For a matrix argument, the standard deviation is calculated for each column by default (dim = 0), or for each row (dim = 1).

The norm_type argument is optional; by default norm_type = 0 is used. The norm_type argument controls the type of normalization used, with N denoting the number of observations:

Usage:

stddev(V)
stddev(V, norm_type)

stddev(M)
stddev(M, norm_type)
stddev(M, norm_type, dim)

Examples

[[cpp11::register]] list stddev1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);

  vec C = stddev(A).t();
  vec D = stddev(A, 1).t();
  vec E = stddev(A, 1, 1);

  writable::list res(3);
  res[0] = as_doubles(C);
  res[1] = as_doubles(D);
  res[2] = as_doubles(E);

  return res;
}

Variance

The var function computes the variance of a vector or matrix. For a vector argument, the variance is calculated using all the elements of the vector. For a matrix argument, the variance is calculated for each column by default (dim = 0), or for each row (dim = 1).

The norm_type argument is optional; by default norm_type = 0 is used. The norm_type argument controls the type of normalization used, with N denoting the number of observations:

Usage:

var(V)
var(V, norm_type)

var(M)
var(M, norm_type)
var(M, norm_type, dim)

Examples

[[cpp11::register]] list var1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);

  vec C = var(A).t();
  vec D = var(A, 1).t();
  vec E = var(A, 1, 1);

  writable::list res(3);
  res[0] = as_doubles(C);
  res[1] = as_doubles(D);
  res[2] = as_doubles(E);

  return res;
}

Range

The range function computes the range of a vector or matrix. For a vector argument, the range is calculated using all the elements of the vector. For a matrix argument, the range is calculated for each column by default (dim = 0), or for each row (dim = 1).

Usage:

range(V)

range(M)
range(M, dim)

Examples

[[cpp11::register]] list range1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);

  vec C = range(A).t();
  vec D = range(A, 1);

  writable::list res(2);
  res[0] = as_doubles(C);
  res[1] = as_doubles(D);

  return res;
}

Covariance

The cov function computes the covariance between two matrices or vectors. If each row of X and Y is an observation and each column is a variable, the (i,j)-th entry of cov(X,Y) is the covariance between the i-th variable in X and the j-th variable in Y.

For two matrix arguments X and Y, the cov(X,Y) function computes the covariance between the two matrices. For vector arguments, the type of vector is ignored and each element in the vector is treated as an observation. The cov(X) function is equivalent to cov(X, X).

The norm_type argument is optional; by default norm_type = 0 is used. The norm_type argument controls the type of normalization used, with N denoting the number of observations:

Usage:

cov(X, Y, norm_type)

cov(X)
cov(X, norm_type)

Examples

[[cpp11::register]] list cov1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);

  mat C = cov(A, B);
  mat D = cov(A, B, 1);

  writable::list res(2);
  res[0] = as_doubles_matrix(C);
  res[1] = as_doubles_matrix(D);

  return res;
}

Correlation

The cor function computes the correlation coefficient between two matrices or vectors. If each row of X and Y is an observation and each column is a variable, the (i,j)-th entry of cor(X,Y) is the correlation coefficient between the i-th variable in X and the j-th variable in Y.

For two matrix arguments X and Y, the cor(X,Y) function computes the correlation coefficient between the two matrices. For vector arguments, the type of vector is ignored and each element in the vector is treated as an observation.

The norm_type argument is optional; by default norm_type = 0 is used. The norm_type argument controls the type of normalization used, with N denoting the number of observations:

Usage:

cor(X, Y)
cor(X, Y, norm_type)

cor(X)
cor(X, norm_type)

Examples

[[cpp11::register]] list cor1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);

  mat C = cor(A, B);
  mat D = cor(A, B, 1);

  writable::list res(2);
  res[0] = as_doubles_matrix(C);
  res[1] = as_doubles_matrix(D);

  return res;
}

Histogram

The hist function computes a histogram of counts for a vector or matrix. For a vector argument, the histogram is calculated using all the elements of the vector. For a matrix argument, the histogram is calculated for each column by default (dim = 0), or for each row (dim = 1).

The bin centers can be automatically determined from the data, with the number of bins specified via n_bins (the default is 10). The range of the bins is determined by the range of the data. The bin centers can be explicitly specified in the centers vector, which must contain monotonically increasing values.

Usage:

hist(V)
hist(V, n_bins)
hist(V, centers)

hist(M, centers)
hist(M, centers, dim)

Examples

[[cpp11::register]] list hist1_(const int& n) {
  vec A = randu<vec>(n);

  uvec h1 = hist(A, 11);
  uvec h2 = hist(A, linspace<vec>(-2, 2, 11));

  writable::list res(2);
  res[0] = as_integers(h1);
  res[1] = as_integers(h2);

  return res;
}

Histogram of counts with user specified edges

The histc function computes a histogram of counts for a vector or matrix. For a vector argument, the histogram is calculated using all the elements of the vector. For a matrix argument, the histogram is calculated for each column by default (dim = 0), or for each row (dim = 1).

The bin edges have to be specified and contain monotonically increasing values.

Usage:

histc(V)
histc(V, edges)

hist(M, edges)
hist(M, edges, dim)

Examples

[[cpp11::register]] integers histc1_(const int& n) {
  vec A = randu<vec>(n);

  uvec h = histc(A, linspace<vec>(-2,2,11));

  return as_integers(h);
}

Quantiles of a dataset

The quantile function computes the quantiles corresponding to the cumulative probability values for a vector or matrix. For a vector argument, the quantiles are calculated using all the elements of the vector. For a matrix argument, the quantiles are calculated for each column by default (dim = 0), or for each row (dim = 1).

The probabilities have to be specified as a second argument P.

The algorithm for calculating the quantiles is based on Definition 5 in: Rob J. Hyndman and Yanan Fan. Sample Quantiles in Statistical Packages. The American Statistician, 50(4), 361-365, 1996. DOI: 10.2307/2684934

Usage:

quantile(V, P)

quantile(M, P)
quantile(M, P, dim)

Examples

[[cpp11::register]] doubles quantile1_(const int& n) {
  vec A = randu<vec>(n);

  vec P = {0.0, 0.25, 0.50, 0.75, 1.0};
  vec Q = quantile(A, P);

  return as_doubles(Q);
}

Principal component analysis (PCA)

TODO: This needs a custom method.

Probability density function of normal distribution

The normpdf function computes the probability density function of a normal distribution for a scalar, vector, or matrix. For each scalar x in X, the probability density function is computed according to a Gaussian (normal) distribution using the corresponding mean value m in M and the corresponding standard deviation value s in S.

\[ y = \frac{1}{s \sqrt{2\pi}} \exp\left[-\frac{(x - m)^2}{2s^2}\right] \]

Caveat

To reduce the incidence of numerical underflows, consider using log_normpdf().

Examples

[[cpp11::register]] list normpdf1_(const int& n) {
  vec X = randu<vec>(n);
  vec M = randu<vec>(n);
  vec S = randu<vec>(n);

  vec P1 = normpdf(X);
  vec P2 = normpdf(X, M, S);
  vec P3 = normpdf(1.23, M, S);
  vec P4 = normpdf(X, 4.56, 7.89);
  double P5 = normpdf(1.23, 4.56, 7.89);

  writable::list res(5);

  res[0] = as_doubles(P1);
  res[1] = as_doubles(P2);
  res[2] = as_doubles(P3);
  res[3] = as_doubles(P4);
  res[4] = as_doubles({P5});

  return res;
}

Probability density function of log-normal distribution

The log_normpdf function computes the probability density function of a log-normal distribution for a scalar, vector, or matrix. For each scalar x in X, the probability density function is computed according to a log-normal distribution using the corresponding mean value m in M and the corresponding standard deviation value s in S.

\[ y = \log\left[\frac{1}{x s \sqrt{2\pi}} \exp\left[-\frac{(\log(x) - m)^2}{2s^2}\right]\right] \]

Examples

[[cpp11::register]] list lognormpdf1_(const int& n) {
  vec X = randu<vec>(n);
  vec M = randu<vec>(n);
  vec S = randu<vec>(n);

  vec P1 = log_normpdf(X);
  vec P2 = log_normpdf(X, M, S);
  vec P3 = log_normpdf(1.23, M, S);
  vec P4 = log_normpdf(X, 4.56, 7.89);
  double P5 = log_normpdf(1.23, 4.56, 7.89);

  writable::list res(5);

  res[0] = as_doubles(P1);
  res[1] = as_doubles(P2);
  res[2] = as_doubles(P3);
  res[3] = as_doubles(P4);
  res[4] = as_doubles({P5});

  return res;
}

Cumulative distribution function of normal distribution

For each scalar x in X, compute its cumulative distribution function according to a Gaussian (normal) distribution using the corresponding mean value m in M and the corresponding standard deviation value s in S.

Examples

[[cpp11::register]] list normcdf1_(const int& n) {
  vec X = randu<vec>(n);
  vec M = randu<vec>(n);
  vec S = randu<vec>(n);

  vec P1 = normcdf(X);
  vec P2 = normcdf(X, M, S);
  vec P3 = normcdf(1.23, M, S);
  vec P4 = normcdf(X, 4.56, 7.89);
  double P5 = normcdf(1.23, 4.56, 7.89);

  writable::list res(5);

  res[0] = as_doubles(P1);
  res[1] = as_doubles(P2);
  res[2] = as_doubles(P3);
  res[3] = as_doubles(P4);
  res[4] = as_doubles({P5});

  return res;
}

Random vectors from multivariate normal distribution

Generate a matrix with random column vectors from a multivariate Gaussian (normal) distribution with parameters M and C.

Usage:

X = mvnrnd(M, C)
mvnrnd(X, M, C)
mvnrnd(X. M. C. N)

The first form returns an error if the generation fails. The second and third form reset X and return a boolean set to false without error if the generation fails.

Examples

[[cpp11::register]] doubles_matrix<> mvnrnd1_(const int& n, const int&m) {
  vec M = randu<vec>(n);

  mat B = randu<mat>(n, n);
  mat C = B.t() * B;

  mat X = mvnrnd(M, C, m);

  return as_doubles_matrix(X);
}

Random numbers from chi-squared distribution

Generate a random scalar, vector, or matrix with elements sampled from a chi-squared distribution with the degrees of freedom specified by parameter DF or DF_scalar.

Usage:

v = chi2rnd(DF)
X = chi2rnd(DF)

double s = chi2rnd<double>(DF_scalar) // float also works
vec v = chi2rnd<vec>(DF_scalar, n_elem)
mat X = chi2rnd<mat>(DF_scalar, n_rows, n_cols)
mat Y = chi2rnd<mat>(DF_scalar, size(X))

Examples

[[cpp11::register]] list chi2rnd1_(const int& n, const int& m) {
  mat X = chi2rnd(2, n, m);
  mat Y = randi<mat>(n, m, distr_param(1, 10));
  mat Z = chi2rnd(Y);

  writable::list res(2);
  res[0] = as_doubles_matrix(X);
  res[1] = as_doubles_matrix(Z);

  return res; 
}

Random matrix from Wishart distribution

Generate a random matrix sampled from the Wishart distribution with parameters S and df.

Usage:

W = wishrnd(S, df)
wishrnd(W, S, df)

The first form returns an error if the generation fails. The second form resets W and returns a boolean set to false without error if the generation fails.

Examples

[[cpp11::register]] doubles_matrix<> wishrnd1_(const int& n) {
  mat X = randu<mat>(n, n);
  mat S = X.t() * X;

  mat W = wishrnd(S, 6.7);

  return as_doubles_matrix(W);
}

Random matrix from inverse Wishart distribution

Generate a random matrix sampled from the inverse Wishart distribution with parameters T and df.

Usage:

W = iwishrnd(T, df)
iwishrnd(W, T, df)

The first form returns an error if the generation fails. The second form resets W and returns a boolean set to false without error if the generation fails.

Examples

[[cpp11::register]] doubles_matrix<> iwishrnd1_(const int& n, const double& d) {
  mat X = randu<mat>(n, n);
  mat T = X.t() * X;

  mat W = iwishrnd(T, d);

  return as_doubles_matrix(W);
}

Cluster data into disjoint sets

Cluster given data into k disjoint sets.

Usage:

kmeans(means, data, k, seed_mode, n_iter, print_mode)

Caveats

Examples

[[cpp11::register]] list kmeans1_(const int& n, const int& d) {
  mat data(d, n, fill::randu);

  mat means;

  bool status = kmeans(means, data, 2, random_subset, 10, true);

  if (status == false) {
    stop("clustering failed");
  }

  writable::list res(2);

  res[0] = writable::logicals({status});
  res[1] = as_doubles_matrix(means);

  return res;
}

Probabilistic clustering and likelihood calculation via mixture of Gaussians

gmm_diag and gmm_full are classes for multi-variate probabilistic clustering and likelihood calculation via Gaussian Mixture Models (GMMs). The implementation details are available in Sanderson and Curtin (2017).

The distribution of data is modelled as:

\[ p(x) = \sum_{g=0}^{n_{\text{gaus}}-1} h_g N(x \mid m_g, C_g) \]

where:

Both gmm_diag and gmm_full include tailored k-means and Expectation Maximisation algorithms for learning model parameters from training data.

For an instance of gmm_diag or gmm_full named as M, the member functions and variables are:

Caveats

Examples

[[cpp11::register]] list gmm1_(const int& n, const int& d) {
  // create synthetic data with 2 Gaussians

  mat data(d, n, fill::zeros);

  vec mean0 = linspace<vec>(1, d, d);
  vec mean1 = mean0 + 2;

  int i = 0;

  while (i < n) {
    if (i < n) {
      data.col(i) = mean0 + randn<vec>(d);
      ++i;
    }
    if (i < n) {
      data.col(i) = mean0 + randn<vec>(d);
      ++i;
    }
    if (i < n) {
      data.col(i) = mean1 + randn<vec>(d);
      ++i;
    }
  }

  // model the data as a diagonal GMM with 2 Gaussians

  gmm_diag model;

  bool status = model.learn(data, 2, maha_dist, random_subset, 10, 5, 1e-5,
    true);

  if (status == false) {
    stop("learning failed");
  }

  model.means.print("means:");

  double scalar_likelihood = model.log_p(data.col(0));
  rowvec set_likelihood = model.log_p(data.cols(0, 9));

  double overall_likelihood = model.avg_log_p(data);

  uword gaus_id = model.assign(data.col(0), eucl_dist);
  urowvec gaus_ids = model.assign(data.cols(0, 9), prob_dist);

  rowvec hist1 = model.norm_hist(data, eucl_dist);
  urowvec hist2 = model.raw_hist(data, prob_dist);

  writable::list res(9);

  res[0] = writable::logicals({status});
  res[1] = as_doubles_matrix(model.means);
  res[2] = as_doubles({scalar_likelihood});
  res[3] = as_doubles(set_likelihood.t());
  res[4] = as_doubles({overall_likelihood});
  res[5] = as_integers(gaus_id);
  res[6] = as_integers(gaus_ids.t());
  res[7] = as_doubles(hist1.t());
  res[8] = as_integers(hist2.t());

  return res;
}

Sanderson, Conrad, and Ryan Curtin. 2017. “An Open Source C++ Implementation of Multi-Threaded Gaussian Mixture Models, K-Means and Expectation Maximisation.” In 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS), 1–8. https://doi.org/10.1109/ICSPCS.2017.8270510.

mirror server hosted at Truenetwork, Russian Federation.