Help for package PPCI

Type:

Package

Title:

Projection Pursuit for Cluster Identification

Version:

0.1.5

Author:

David Hofmeyr <dhofmeyr@sun.ac.za> [aut, cre] Nicos Pavlidis <n.pavlidis@lancaster.ac.uk> [aut]

Maintainer:

David Hofmeyr <dhofmeyr@sun.ac.za>

Description:

Implements recently developed projection pursuit algorithms for finding optimal linear cluster separators. The clustering algorithms use optimal hyperplane separators based on minimum density, Pavlidis et. al (2016) http://jmlr.org/papers/volume17/15-307/15-307.pdf; minimum normalised cut, Hofmeyr (2017) <doi:10.1109/TPAMI.2016.2609929>; and maximum variance ratio clusterability, Hofmeyr and Pavlidis (2015) <doi:10.1109/SSCI.2015.116>.

Depends:

R (≥ 2.10.0), rARPACK

License:

GPL-3

Encoding:

UTF-8

LazyData:

yes

RoxygenNote:

7.0.2

NeedsCompilation:

Packaged:

2020-03-06 09:20:06 UTC; dhofmeyr

Repository:

CRAN

Date/Publication:

2020-03-06 10:10:01 UTC

Projection Pursuit for Cluster Identification

Description

This package provides implementations of three recently developed projection pursuit methods for clustering. These methods optimise measures of clusterability of the univariate (projected) dataset that are motivated by three well established approaches to clustering; namely density clustering, centroid based clustering and clustering by graph cuts. The resulting partitions are formed by hyperplanes orthogonal to the optimal projection vectors, and multiple such partitioning hyperplanes are combined in a hierarchical model to generate complete clustering solutions. Model visualisations through low dimensional projections of the data/clusters are provided through multiple plotting functions, which facilitate model validation. Simple model modification functions then allow for pseudo-interactive clustering.

The three main clustering algorithms are implemented in the functions mddc, mcdc and ncutdc. Each takes two mandatory arguments, the data matrix (X) and the number of clusters (K). Numerous optional arguments allow the user to modify the specifics of optimisation, etc. If the correct number of clusters is not known an approximate number can be used, and the resulting solutions visualised using the functions tree_plot (provides a visualisation of the entire model) and node_plot (provides a more detailed visualisation of a single node in the hierarchical model). Both of these cfunctions can be accessed using the base plot function applied to the output of one of mddc, mcdc or ncutdc. Nodes can then be removed using the function tree_prune, or added using the function tree_split, depending on the apparent validity of the existing clustering model.

Details

Package: PPCI

Type: Package

Title: Projection Pursuit for Cluster Identification

Version: 0.1.4

Depends: R (>= 2.10.0), rARPACK

License: GPL-3

LazyData: yes

Author(s)

David Hofmeyr[aut, cre] and Nicos Pavlidis[aut]

Maintainer: David Hofmeyr <dhofmeyr@sun.ac.za>

References

Pavlidis N.G., Hofmeyr D.P., Tasoulis S.K. (2016) Minimum Density Hyperplanes. Journal of Machine Learning Research, 17(156), 1–33.

Hofmeyr, D., Pavlidis, N. (2015) Maximum Clusterability Divisive Clustering. Computational Intelligence, 2015 IEEE Symposium Series on, pp. 780–786.

Hofmeyr, D. (2017) Clustering by Minimum Cut Hyperplanes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8), 1547 – 1560.

Add Nodes To a Plot of a Hierarchical Clustering Model

Description

Used to add nodes to a plot region created for visualising hierarchical clustering models arising from functions mcdc, mddc and ncutdc. Not intended for use outside function tree_plot.

Discrimination of Cancerous and Non-Cancerous Breast Masses

Description

This data set contains case information from 699 patients from Wisconsin General Hospital who received examination for mammographic masses, which were classified as either benign or malignant.

Usage

breastcancer

Format

A list with entries $x (a 699x9 matrix with each row corresponding to an individual case) and $c (a vector of labels indicating whether the mass was diagnosed as benign (2) or malignant (4)).

Source

UCI Machine Learning Repository.

References

Lichman, M. (2013) UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. [https://archive.ics.uci.edu/ml]

External Cluster Validity Metrics

Description

Computes four popular external cluster validity metrics (adjusted Rand index, purity, V-measure and Normalised Mutual Information) through comparison of cluster assignments and true class labels.

Usage

cluster_performance(assigned, labels, beta)

Arguments

assigned

a vector of cluster assignments made by a clustering algorithm.

labels

a vector of true class labels to be compared with assigned.

beta

(optional) positive numeric, used in the computation of V-measure. larger values apply higher weight to homogeneity over completeness measures. if omitted then beta = 1 (equal weight applied to both measures).

Value

a vector containing the four evaluation metrics listed in the description.

References

Zhao Y., Karypis G. (2004) Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning, 55(3), 311–331.

Strehl A., Ghosh J. (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617.

Rosenberg A., Hirschberg J. (2007) V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. EMNLP-CoNLL, 7, 410–420. Citeseer.

Hubert, L., Arabie, P. (1985) Comparing Partitions. Journal of Classification, 2(1), 193–218.

Examples

## load dermatology dataset
data(dermatology)

## obtain clustering solution using MCDC
sol <- mcdc(dermatology$x, 6)

## evaluate solution using external cluster validity measures
cluster_performance(sol$cluster, dermatology$c)

Eryhemato-Squamous Disease Identification

Description

This data set contains clinical and histopathological information from 366 cases of 6 different skin disorders/diseases for the purpous of differential diagnosis of eryhemato-squamous disease.

Usage

dermatology

Format

A list with entries $x (a 366x34 matrix with each row corresponding to an individual case) and $c (a vector of labels indicating the skin condition).

Source

UCI Machine Learning Repository.

References

Lichman, M. (2013) UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. https://archive.ics.uci.edu/ml

Gradient of the Variance Ratio Clusterability Across a Hyperplane

Description

Finds the gradient of the variance ratio across the best hyperplane orthogonal to a given projection vector. Used to obtain maximum clusterability hyperplanes using gradient based optimisation.

Usage

df_mc(v, X, P)

Arguments

v

a numeric vector of length ncol(X)

X

a numeric matrix (num_data x num_dimensions) to be projected on v

P

a list of parameters including (at least) $nmin (positive integer minimum cluster size).

Value

the (vector) gradient of the variance across the best hyperplane orthogonal to v.

Gradient of the Integrated Density on a Hyperplane

Description

Finds the gradient of the integrated density of the best hyperplanes orthogonal to a given projection vector (assumes the data have zero mean vector). Used to obtain minimum density hyperplanes using gradient based optimisation.

Usage

df_md(v, X, P)

Arguments

v

a numeric vector of length ncol(X)

X

a numeric matrix (num_data x num_dimensions) to be projected on v

P

a list of parameters including (at least) $h (positive numeric bandwidth value), $alpha (positive numeric constraint width), $C (positive numeric affecting the slope of the penalty), $COV (covariance matrix of X)

Value

the (vector) gradient of the integrated density of the best hyperplane orthogonal to v.

Gradient of the Normalised Cut Across a Hyperplane

Description

Finds the gradient of the normalised cut across the best hyperplane orthogonal to a given projection vector. Used to obtain minimum normalised cut hyperplanes using gradient based optimisation.

Usage

df_ncut(v, X, P)

Arguments

v

a numeric vector of length ncol(X)

X

a numeric matrix (num_data x num_dimensions) to be projected on v

P

a list of parameters including (at least) $s (positive numeric scaling parameter), $nmin (positive integer minimum cluster size).

Value

the (vector) gradient of the normalised cut across the best hyperplane orthogonal to v.

Gradient of the Penalised Density at a Point

Description

Used within bisection algorithm to find the location of the best hyperplane orthogonal to a given projection vector. Not intended for use outside of functions f_md and df_md

Variance Ratio Clusterability Across a Hyperplane

Description

Finds the variance ratio clusterability measured across the best hyperplane orthogonal to a given projection vector. Used as projection index for projection pursuit to find maximum clusterability hyperplanes.

Usage

f_mc(v, X, P)

Arguments

v

a numeric vector of length ncol(X)

X

a numeric matrix (num_data x num_dimensions) to be projected on v

P

a list of parameters including (at least) $nmin (positive integer minimum cluster size)

Value

the value of the variance ratio across the best hyperplane orthogonal to v

Integrated Density on a Hyperplane

Description

Finds the integrated density of the best hyperplanes orthogonal to a given projection vector (assumes the data have zero mean vector). Used as projection index for projection pursuit to find minimum density hyperplanes.

Usage

f_md(v, X, P)

Arguments

v

a numeric vector of length ncol(X)

X

a numeric matrix (num_data x num_dimensions) to be projected on v

P

a list of parameters including (at least) $h (positive numeric bandwidth value), $alpha (positive numeric constraint width), $C (positive numeric affecting the slope of the penalty)

Value

the value of the integrated density of the best hyperplane orthogonal to v

Normalised Cut Across a Hyperplane

Description

Finds the normalised cut across the best hyperplane orthogonal to a given projection vector. Used as projection index for projection pursuit to find minimum normalised cut hyperplanes.

Usage

f_ncut(v, X, P)

Arguments

v

a numeric vector of length ncol(X)

X

a numeric matrix (num_data x num_dimensions) to be projected on v

P

a list of parameters including (at least) $s (positive numeric scaling parameter), $nmin (positive integer minimum cluster size)

Value

the value of the normalised cut across the best hyperplane orthogonal to v

Visualise a Hyperplane Separator for Clustering

Description

Provides a visualisation of the partition induced by a hyperplane separator generated by one of mch, mdh and ncuth.

Usage

hp_plot(sol, labels, transparency)

Arguments

sol

a solution arising from one of the functions mch, mdh and ncuth.

labels

(optional) a vector of class labels. if provided then points in different classes are plotted in different colours. the external validity of the partition is also indicated when this can be computed using true class labels.

transparency

(optional) if ommitted then points in the scatterplot are shown as solid. If set to a value in (0, 1) then points are shown with transparency. Lower values correspond with a greater degree of transparency.

Check if the Current Solution is a Valid Minimum Density Hyperplane

Description

Used within minimum density projection pursuit. Not intended for independent use outside of other functions.

Location of Optimal Variance Ratio Hyperplane

Description

The value of b which maximises the variance ratio across the hyperplane H(v, b) = [x : v'x = b]. If v is a locally optimal solution for the projection index f_mc then H(v, b) is a locally optimal hyperplane.

Usage

mc_b(v, X, P)

Arguments

v

a numeric vector of length ncol(X)

X

a numeric matrix (num_data x num_dimensions) to be projected on v

P

a list of parameters including (at least) $nmin (positive integer minimum cluster size).

Value

the value of b given in the description.

Divisive Clustering Using Maximum Clusterability

Description

Generates a binary partitioning tree by recursively partitioning a dataset using a hierarchical collection of hyperplanes with high variance ratio custerability across them.

Usage

mcdc(X, K, v0, split.index, minsize, verb, labels, maxit, ftol)

Arguments

X

a numeric matrix (num_data x num_dimensions); the dataset to be clustered.

K

the number of clusters to extract.

split.index

(optional) determines the order in which clusters are split (in decreasing order of split indices). can be a numeric valued function(v, X, P) of projection vector v, data matrix X and list of parameters P. can also be one of "size" (split the largest cluster), "fval" (split the cluster with the maximum variance ratio) or "Fdist" (split indices determined by the non-central F-distribution. See SSCI paper for details. slight difference from the paper is that when the data size is above 2000 cluster size is used instead. This is because the naive estimation of the model degrees of freedom has been found to be unreliable when the number of data is large). if omitted then "Fdist" is used.

v0

(optional) initial projection direction(s). a function(X) of the data being split, which returns a matrix with ncol(X) rows. each column of the output of v0(X) is used as an initialisation for projection pursuit. the solution with the maximum variance ratio is used within the final model. initialisations are determined separately for each cluster being split. if omitted then a single initialisation is used; the vector joining the cluster means of a 2-means solution.

minsize

(optional) the minimum cluster size allowable. if omitted then minsize = 1.

verb

(optional) verbosity level of optimisation procedure. verb==0 produces no output. verb==1 produces plots illustrating the progress of projection pursuit via plots of the projected data. verb==2 adds to these plots additional information about the progress. verb==3 creates a folder in working directory and stores all plots for verb==2. if omitted then verb==0.

labels

(optional) vector of class labels. not used in the actual clustering procedure. only used for illustrative purposes for values of verb>0.

maxit

(optional) maximum number of iterations in optimisation. if omitted then maxit=50.

ftol

(optional) tolerance level for convergence of optimisation, based on relative function value improvements. if omitted then ftol = 1e-8.

Value

a named list with class ppci_cluster_solution containing

$cluster

cluster assignment vector.

$model

matrix containing the would-be location of each node (depth and position at depth) within a complete tree of appropriate depth.

$nodes

unnamed list each element of which is a named list containing details of the binary partitions at each node in the model.

$data

the data matrix being clustered.

$method

=="MCDC". used in plotting and model modification functions.

$args

named list of arguments passed to mcdc.

References

Hofmeyr, D., Pavlidis, N. (2015) Maximum Clusterability Divisive Clustering. Computational Intelligence, 2015 IEEE Symposium Series on, pp. 780–786.

Examples

## load the dermatology dataset
data(dermatology)

## obtain a clustering solution using MCDC
sol <- mcdc(dermatology$x, 6)

## evaluate the performance of the solution using external cluster validity metrics
cluster_performance(sol$cluster, dermatology$c)

Maximum Clusterability Dimension Reduction

Description

Finds a linear projection of a data set using projection pursuit to maximise the variance ratio clusterability measured in each dimension separately.

Usage

mcdr(X, p, minsize, v0, verb, labels, maxit, ftol)

Arguments

X

a numeric matrix (num_data x num_dimensions); the dataset.

p

an integer; the number of dimensions in the projection.

v0

(optional) initial projection direction(s). a function(X) of the data, which returns a matrix with ncol(X) rows. each column of the output of v0(X) is used as an initialisation for projection pursuit. the solution with the minimum normalised cut is used within the final model. if omitted then a single initialisation is used for each column of the projection matrix; the first principal component within the null space of the other columns.

verb

labels

(optional) vector of class labels. not used in the actual projection pursuit. only used for illustrative purposes for values of verb>0.

maxit

(optional) maximum number of iterations in optimisation for each value of alpha. if omitted then maxit=15.

ftol

(optional) tolerance level for convergence of optimisation, based on relative function value improvements. if omitted then ftol = 1e-5.

minsize

(optional) the minimum number of data on each side of a hyperplane. if omitted then minsize = 1.

Value

a named list with class ppci_projection_solution with the following components

$projection

the num_dimensions x p projection matrix.

$fitted

the num_data x p projected data set.

$data

the input data matrix.

$method

=="MCDC".

$params

list of parameters used to find $projection.

References

Hofmeyr, D., Pavlidis, N. (2015) Maximum Clusterability Divisive Clustering. Computational Intelligence, 2015 IEEE Symposium Series on, pp. 780–786.

Examples

### not run
run = FALSE
if(run){
  ## load optidigits dataset
  data(optidigits)

  ## find nine dimensional projection (one fewer than
  ## the number of clusters, as is common in clustering)
  sol <- mcdr(optidigits$x, 9)

  ## visualise the solution via the first 3 pairs of dimensions
  plot(sol, pairs = 3, labels = optidigits$c)

  ## compare with PCA projection
  pairs(optidigits$x%*%eigen(cov(optidigits$x))$vectors[,1:3], col = optidigits$c)
  }

Maximum Clusteriability Hyperplane

Description

Finds maximum clusterability hyperplane(s) for clustering.

Usage

mch(X, v0, minsize, verb, labels, maxit, ftol)

Arguments

X

a numeric matrix (num_data x num_dimensions); the dataset to be clustered.

v0

(optional) initial projection direction(s). a matrix with ncol(X) rows. each column of v0 is used as an initialisation for projection pursuit. if omitted then a single initialisation is used; the vector joining the cluster means from a 2-means solution.

minsize

(optional) the minimum cluster size allowable. if omitted then minsize = 1.

verb

labels

(optional) vector of class labels. not used in the actual clustering procedure. only used for illustrative purposes for values of verb>0.

maxit

(optional) maximum number of iterations in optimisation. if omitted then maxit=50.

ftol

(optional) tolerance level for convergence of optimisation, based on relative function value improvements. if omitted then ftol = 1e-8.

Value

a named list with class ppci_hyperplane_solution with the following components

$cluster

cluster assignment vector.

$v

the optimal projection vector.

$b

the value of b making H(v, b) the minimum normalised cut hyperplane.

$fitted

data projected into two dimensional subspace defined by $v and the principal component in the null space of $v.

$data

the input data matrix.

$fval

the variance ratio clusterability across H(v, b).

$method

=="MCDC".

$params

list of parameters used to find H(v, b).

$alternatives

an unnamed list. If more than one initilisation is considered, the alternatives to the best are stored in this field.

References

Hofmeyr, D., Pavlidis, N. (2015) Maximum Clusterability Divisive Clustering. Computational Intelligence, 2015 IEEE Symposium Series on, pp. 780–786.

Examples

## generate dataset with elongated clusters for which variance ratio in
## both dimensions is misleading for clustering
set.seed(1)
S <- matrix(c(1, .7, .7, 1), 2, 2)
X <- matrix(rnorm(2000), ncol = 2)%*%S
X[,1] <- X[,1] + rep(c(.8, -.8), each = 500)
X[,2] <- X[,2] + rep(c(-.8, .8), each = 500)

## find the optimal variance ratio hyperplane solution
sol <- mch(X)

## visualise the solution
plot(X, col = sol$cluster)

Maximum Clusterability Projection Pursuit

Description

A gateway function between functions mch and ppclust.optim, the latter calls R's base optimisation function. Not intended for independent use.

Location of Minimum Density Hyperplane

Description

The value of b which minimises the penalised integrated density on the hyperplane H(v, b) = [x : v'x = b] (assumes the data have zero mean vector). If v is a locally optimal solution for the projection index f_md then H(v, b) is a locally optimal hyperplane.

Usage

md_b(v, X, P)

Arguments

v

a numeric vector of length ncol(X)

X

a numeric matrix (num_data x num_dimensions) to be projected on v

P

a list of parameters including (at least) $h (positive numeric bandwidth value), $alpha (positive numeric constraint width), $C (positive numeric affecting the slope of the penalty)

Value

the value of b given in the description.

Relative Depth of a Hyperplane

Description

Finds the relative depth, min((M1-m)/m, (M2-m)/m), where M1 is the maximum density below the optimal hyperplane orthogonal to v and M2 the maximum above, while m is the integrated density on the optimal hyperplane orthogonal to v.

Usage

md_reldepth(v, X, P)

Arguments

v

a numeric vector of length ncol(X)

X

a numeric matrix (num_data x num_dimensions) to be projected on v

P

a list of parameters including (at least) $h (positive numeric bandwidth value), $alpha (positive numeric constraint width), $C (positive numeric affecting the slope of the penalty)

Value

the value of the relative depth of the best hyperplane orthogonal to v.

Divisive Clustering Using Minimum Density Hyperplanes

Description

Generates a binary partitioning tree by recursively partitioning a dataset using a hierarchical collection of hyperplanes with low empirical density integral.

Usage

mddc(X, K, minsize, split.index, v0, bandwidth,
      alphamin, alphamax, verb, labels, maxit, ftol)

Arguments

X

a numeric matrix (num_data x num_dimensions); the dataset to be clustered.

K

the number of clusters to extract.

minsize

(optional) minimum cluster size. if omitted then minsize = 1.

split.index

v0

(optional) initial projection direction(s). a function(X) of the data being split, which returns a matrix with ncol(X) rows. each column of the output of v0(X) is used as an initialisation for projection pursuit. the solution with the greatest relative depthis used within the final model. initialisations are determined separately for each cluster being split. if omitted then a single initialisation is used; the first principal component.

bandwidth

(optional) used to compute the bandwidth parameter (h) for the kernel density estimator. a numeric valued function(X) of the cluster being split. if omitted then bandwidth(X) = 0.9*eigen(cov(X))$values[1]^.5*nrow(X)^(-0.2).

alphamin

(optional) initial (scaled) bound on the distance of the optimal hyperplane from the mean of the data (or subset being split). if omitted then alphamin = 0.

alphamax

(optional) maximum (scaled) distance of the optimal hyperplane from the mean of the data (or subset being split). if omitted then alphamax = 1.

verb

labels

(optional) vector of class labels. not used in the actual clustering procedure. only used for illustrative purposes for values of verb>0.

maxit

(optional) maximum number of iterations in optimisation for each value of alpha. if omitted then maxit=15.

ftol

(optional) tolerance level for convergence of optimisation, based on relative function value improvements. if omitted then ftol = 1e-5.

Value

a named list with class ppci_cluster_solution containing

$cluster

cluster assignment vector.

$model

matrix containing the would-be location of each node (depth and position at depth) within a complete tree of appropriate depth.

$nodes

unnamed list each element of which is a named list containing details of the binary partitions at each node in the model.

$data

the data matrix being clustered.

$method

=="MDH". used in plotting and model modification functions.

$args

named list of arguments passed to mddc.

References

Pavlidis N.G., Hofmeyr D.P., Tasoulis S.K. (2016) Minimum Density Hyperplanes. Journal of Machine Learning Research, 17(156), 1–33.

Examples

## load dermatology dataset
data(dermatology)

## obtain a clustering solution using minimum density hyperplanes
sol <- mddc(dermatology$x, 6)

## evaluate the solution using external cluster validity metrics
cluster_performance(sol$cluster, dermatology$c)

Minimum Density Dimension Reduction

Description

Finds a linear projection of a data set using projection pursuit to find the vector(s) orthogonal to minimum density hyperplanes.

Usage

mddr(X, p, minsize, v0, bandwidth,
      alphamin, alphamax, verb, labels, maxit, ftol)

Arguments

X

a numeric matrix (num_data x num_dimensions); the dataset.

p

an integer; the number of dimensions in the projection.

v0

bandwidth

alphamin

(optional) initial (scaled) bound on the distance of the optimal hyperplane from the mean of the data. if omitted then alphamin = 0.

alphamax

(optional) maximum (scaled) distance of the optimal hyperplane from the mean of the data. if omitted then alphamax = 1.

verb

labels

(optional) vector of class labels. not used in the actual projection pursuit. only used for illustrative purposes for values of verb>0.

maxit

(optional) maximum number of iterations in optimisation for each value of alpha. if omitted then maxit=15.

ftol

(optional) tolerance level for convergence of optimisation, based on relative function value improvements. if omitted then ftol = 1e-5.

minsize

(optional) the minimum number of data on each side of a hyperplane. if omitted then minsize = 1.

Value

a named list with class ppci_projection_solution with the following components

$projection

the num_dimensions x p projection matrix.

$fitted

the num_data x p projected data set.

$data

the input data matrix.

$method

=="MDH".

$params

list of parameters used to find $projection.

References

Pavlidis N.G., Hofmeyr D.P., Tasoulis S.K. (2016) Minimum Density Hyperplanes. Journal of Machine Learning Research, 17(156), 1–33.

Examples

### not run
run = FALSE
if(run){
  ## load optidigits dataset
  data(optidigits)

  ## find nine dimensional projection (one fewer than
  ## the number of clusters, as is common in clustering)
  sol <- mddr(optidigits$x, 9)

  ## visualise the solution via the first 3 pairs of dimensions
  plot(sol, pairs = 3, labels = optidigits$c)

  ## compare with PCA projection
  pairs(optidigits$x%*%eigen(cov(optidigits$x))$vectors[,1:3], col = optidigits$c)
  }

Minimum Density Hyperplane

Description

Finds minimum density hyperplane(s) for clustering.

Usage

mdh(X, v0, minsize, bandwidth, alphamin, alphamax, verb, labels, maxit, ftol)

Arguments

X

a numeric matrix (num_data x num_dimensions); the dataset to be clustered.

v0

minsize

(optional) minimum cluster size. if omitted then minsize = 1.

bandwidth

(optional) positive numeric bandwidth parameter (h) for the kernel density estimator. if omitted then bandwidth = 0.9*eigen(cov(X))$values[1]^.5*nrow(X)^(-0.2).

alphamin

(optional) initial (scaled) bound on the distance of the optimal hyperplane from the mean of the data. if omitted then alphamin = 0.

alphamax

(optional) maximum/final (scaled) distance of the optimal hyperplane from the mean of the data. if omitted then alphamax = 1.

verb

labels

(optional) vector of class labels. not used in the actual clustering procedure. only used for illustrative purposes for values of verb>0.

maxit

(optional) maximum number of iterations in optimisation for each value of alpha. if omitted then maxit=15.

ftol

(optional) tolerance level for convergence of optimisation, based on relative function value improvements. if omitted then ftol = 1e-5.

Value

a named list with class ppci_hyperplane_solution with the following components

$cluster

cluster assignment vector.

$v

the optimal projection vector.

$b

the value of b making H(v, b) the minimum density hyperplane.

$fitted

data projected into two dimensional subspace defined by $v and the principal component in the null space of $v.

$data

the input data matrix.

$rel.dep

the relative depth of H(v, b).

$fval

the integrated dentsity on H(v, b).

$method

=="MDH".

$params

list of parameters used to find H(v, b).

$alternatives

an unnamed list. If more than one initilisation is considered, the alternatives to the best are stored in this field.

References

Pavlidis N.G., Hofmeyr D.P., Tasoulis S.K. (2016) Minimum Density Hyperplanes. Journal of Machine Learning Research, 17(156), 1–33.

Examples

## load breast cancer dataset
data(breastcancer)

## find minimum density hyperplane
sol <- mdh(breastcancer$x)

## visualise the solution
plot(sol)

## evaluate the quality of the partition
success_ratio(sol$cluster, breastcancer$c)

Minimum Density Projection Pursuit

Description

A gateway function between functions mdh and ppclust.optim, the latter calls R's base optimisation function. Not intended for independent use.

Location of Minimum Normalised Cut Hyperplane

Description

The value of b which minimises the normalised cut across the hyperplane H(v, b) = [x : v'x = b]. If v is a locally optimal solution for the projection index f_ncut then H(v, b) is a locally optimal hyperplane.

Usage

ncut_b(v, X, P)

Arguments

v

a numeric vector of length ncol(X)

X

a numeric matrix (num_data x num_dimensions) to be projected on v

P

a list of parameters including (at least) $s (positive numeric scaling parameter), $nmin (positive integer minimum cluster size).

Value

the value of b given in the description.

Divisive Clustering Using Minimum Normalised Cut Hyperplanes

Description

Generates a binary partitioning tree by recursively partitioning a dataset using a hierarchical collection of hyperplanes with low normalised cut measured across them.

Usage

ncutdc(X, K, split.index, v0, s, minsize, verb, labels, maxit, ftol)

Arguments

X

a numeric matrix (num_data x num_dimensions); the dataset to be clustered.

K

the number of clusters to extract.

split.index

v0

(optional) initial projection direction(s). a function(X) of the data being split, which returns a matrix with ncol(X) rows. each column of the output of v0(X) is used as an initialisation for projection pursuit. the solution with the minimum normalised cut is used within the final model. initialisations are determined separately for each cluster being split. if omitted then a single initialisation is used; the first principal component.

s

(optional) used to compute the scaling parameter (sigma) for pairwise similarities. a numeric valued function(X) of the cluster being split. if omitted then s(X) = 100*eigen(cov(X))$values[1]^.5*nrow(X)^(-0.2).

minsize

(optional) the minimum cluster size allowable. if omitted then minsize = 1.

verb

labels

(optional) vector of class labels. not used in the actual clustering procedure. only used for illustrative purposes for values of verb>0.

maxit

(optional) maximum number of iterations in optimisation. if omitted then maxit=50.

ftol

(optional) tolerance level for convergence of optimisation, based on relative function value improvements. if omitted then ftol = 1e-8.

Value

a named list with class ppci_cluster_solution containing

$cluster

cluster assignment vector.

$model

matrix containing the would-be location of each node (depth and position at depth) within a complete tree of appropriate depth.

$nodes

unnamed list each element of which is a named list containing details of the binary partitions at each node in the model.

$data

the data matrix being clustered.

$method

=="NCutH". used in plotting and model modification functions.

$args

named list of arguments passed to ncutdc.

References

Hofmeyr, D. (2016) Clustering by Minimum Cut Hyperplanes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8), 1547 – 1560.

Examples

## load dermatology dataset
data(dermatology)

## obtain clustering solution
sol <- ncutdc(dermatology$x, 6)

## evaluate solution using external cluster validity metrics
cluster_performance(sol$cluster, dermatology$c)

Minimum Normalised Cut Dimension Reduction

Description

Finds a linear projection of a data set using projection pursuit based on minimising the normalised cut measured in each dimension separately.

Usage

ncutdr(X, p, v0, s, minsize, verb, labels, maxit, ftol)

Arguments

X

a numeric matrix (num_data x num_dimensions); the dataset.

p

an integer; the number of dimensions in the projection.

v0

s

(optional) used to compute the scaling parameter (sigma) for pairwise similarities. a numeric valued function(X) of the data. if omitted then s(X) = 100*eigen(cov(X))$values[1]^.5*nrow(X)^(-0.2).

minsize

(optional) the minimum cluster size allowable when computing the normalised cut. if omitted then minsize = 1.

verb

labels

(optional) vector of class labels. not used in the actual projection pursuit. only used for illustrative purposes for values of verb>0.

maxit

(optional) maximum number of iterations in optimisation. if omitted then maxit=50.

ftol

(optional) tolerance level for convergence of optimisation, based on relative function value improvements. if omitted then ftol = 1e-8.

Value

a named list with class ppci_projection_solution with the following components

$projection

the num_dimensions x p projection matrix.

$fitted

the num_data x p projected data set.

$data

the input data matrix.

$method

=="NCutH".

$params

list of parameters used to find $projection.

References

Hofmeyr, D. (2016) Clustering by Minimum Cut Hyperplanes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8), 1547 – 1560.

Examples

### not run
run = FALSE
if(run){
  ## load optidigits dataset
  data(optidigits)

  ## find nine dimensional projection (one fewer than
  ## the number of clusters, as is common in clustering)
  sol <- ncutdr(optidigits$x, 9)

  ## visualise the solution via the first 3 pairs of dimensions
  plot(sol, pairs = 3, labels = optidigits$c)

  ## compare with PCA projection
  pairs(optidigits$x%*%eigen(cov(optidigits$x))$vectors[,1:3], col = optidigits$c)
  }

Minimum Normalised Cut Hyperplane

Description

Finds minimum normalised cut hyperplane(s) for clustering.

Usage

ncuth(X, v0, s, minsize, verb, labels, maxit, ftol)

Arguments

X

a numeric matrix (num_data x num_dimensions); the dataset to be clustered.

v0

s

(optional) positive numeric scaling parameter (sigma). if omitted then s = 100*eigen(cov(X))$values[1]^.5*nrow(X)^(-0.2).

minsize

(optional) the minimum cluster size allowable. if omitted then minsize = 1.

verb

labels

(optional) vector of class labels. not used in the actual clustering procedure. only used for illustrative purposes for values of verb>0.

maxit

(optional) maximum number of iterations in optimisation. if omitted then maxit=50.

ftol

(optional) tolerance level for convergence of optimisation, based on relative function value improvements. if omitted then ftol = 1e-8.

Value

a named list with class ppci_hyperplane_solution with the following components

$cluster

cluster assignment vector.

$v

the optimal projection vector.

$b

the value of b making H(v, b) the minimum normalised cut hyperplane.

$fitted

data projected into two dimensional subspace defined by $v and the principal component in the null space of $v.

$data

the input data matrix.

$fval

the normalised cut across H(v, b).

$method

=="NCutH".

$params

list of parameters used to find H(v, b).

$alternatives

an unnamed list. If more than one initilisation is considered, the alternatives to the best are stored in this field.

References

Hofmeyr, D. (2016) Clustering by Minimum Cut Hyperplanes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8), 1547 – 1560.

Examples

## load breast cancer dataset
data(breastcancer)

## find minimum normalised cut hyperplane
sol <- ncuth(breastcancer$x)

## visualise the solution
plot(sol)

## evaluate the performance of the solution
success_ratio(sol$cluster, breastcancer$c)

Minimum Normalised Cut Projection Pursuit

Description

A gateway function between functions ncuth and ppclust.optim, the latter calls R's base optimisation function. Not intended for independent use.

Visualise a Node in a Hierarchical Clustering Model

Description

Provides a visualisation of the partition at an internal node (or of the potential partition at a leaf node) via a two-dimensional projection of the data assigned to the node.

Usage

node_plot(sol, node, labels, transparency)

Arguments

sol

a clustering solution arising from one of the functions mcdc, mddc and ncutdc.

node

the node to be visualised. can be either an integer specifying the node number in sol$Nodes or a vector of length two specifying c(depth, position at depth) of the node.

labels

(optional) a vector of class labels. if provided then points in different classes are plotted in different colours. the external validity of each partition is also indicated when this can be computed using true class labels.

transparency

Euclidean Norm of a Vector

Description

Computes the Euclidean (L2) norm of a vector

Usage

  norm_vec(v)

Arguments

v

a numeric vector

Value

the Euclidean norm of v.

Optical Recognition of Handwritten Digits

Description

This data set contains vectorised images of handwritten digits (0–9), compressed to 8x8 pixels.

Usage

optidigits

Format

A list with entries $x (a 5620x64 matrix with each row corresponding to an image) and $c (a vector of labels indicating the written digit).

Source

UCI Machine Learning Repository.

References

Lichman, M. (2013) UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. https://archive.ics.uci.edu/ml

Visualise Cluster Means from optidigits data set

Description

Provides a visualisation of the cluster means computed from the optidigits data set, recast as images. Cluster labels are aligned with the true labels using simulated annealing to maximise the trace of the confusion matrix (or subset if number of clusters != number of classes.)

Usage

optidigits_mean_images(clusters)

Arguments

clusters

a vector of cluster assignments. Must take values in 1:k, where k is the number of clusters.

References

Lichman, M. (2013) UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. https://archive.ics.uci.edu/ml

Examples

### not run
run = FALSE
if(run){
  ## load optidigits dataset
  data(optidigits)

  ## obtain a clustering solution using normalised cut hyperplanes
  sol <- ncutdc(optidigits$x, 10)

  ## visualise the cluster means as images
  optidigits_mean_images(sol$cluster)
  }

Pen-based Recognition of Handwritten Digits

Description

This data set contains features derived from pen trajectories arising from handwritten digits (0–9) from 44 subjects.

Usage

optidigits

Format

A list with entries $x (a 10992x16 matrix with each row corresponding to a pen trajectory) and $c (a vector of labels indicating the written digit).

Source

UCI Machine Learning Repository.

References

Lichman, M. (2013) UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. https://archive.ics.uci.edu/ml

Visualise a Hierarchical Clustering Model, or a Node Within a Hierarchical Clustering Model

Description

Provides a visualisation of a hierarchical clustering model via two-dimensional projections of the data assigned to each node. Alternatively a more detailed visualisation of a single node in a hierarchical model.

Usage

## S3 method for class 'ppci_cluster_solution'
plot(x, node = NULL, labels = NULL, node.numbers = NULL, transparency = NULL, ...)

Arguments

x

a clustering solution arising from one of the functions mcdc, mddc and ncutdc.

node

(optional) integer corresponding to a node number (see node.numbers below). If provided then a detailed plot of the specified node is provided, instead of the entire cluster hierarchy.

labels

(optional) a vector of class labels. if provided then points in different classes are plotted in different colours.

node.numbers

(optional) logical. if TRUE then numbers are added to the plot to indicate the order in which nodes were added to the model. if omitted then node.numbers = TRUE.

transparency

(optional) if ommitted then points in scatterplots are shown as solid. If set to a value in (0, 1) then points are shown with transparency. Lower values correspond with a greater degree of transparency.

...

additional graphical parameters. Currently none implemented.

Value

None

Visualise a Hyperplane Separator for Clustering

Description

Provides a visualisation of the partition induced by a hyperplane separator generated by one of mch, mdh and ncuth.

Usage

## S3 method for class 'ppci_hyperplane_solution'
plot(x, labels = NULL, transparency = NULL, ...)

Arguments

x

a solution arising from one of the functions mch, mdh and ncuth.

labels

(optional) a vector of class labels. if provided then points in different classes are plotted in different colours.

transparency

...

additional graphical parameters. Currently none implemented.

Value

None

Visualise a Data Set Projected from Projection Pursuit

Description

Provides a visualisation of a data set using a low dimensional projection obtained using projection pursuit based on cluster based projection indices.

Usage

## S3 method for class 'ppci_projection_solution'
plot(x, labels = NULL, pairs = NULL, PCA = NULL, transparency = NULL, ...)

Arguments

x

a solution arising from one of the functions mcdr, mddr and ncutdr.

labels

(optional) a vector of class labels. if provided then points in different classes are plotted in different colours.

pairs

(optional) if omitted then the first two dimensions are shown. If an integer > 1 then all pairs from the first 1:pairs dimensions are shown.

PCA

(optional) logical. If TRUE then an additional principal component rotation is applied. This can make more clusters visible within a low dimensional projection.

transparency

...

additional graphical parameters. Currently none implemented.

Value

None

Optimisation Call for Projection Pursuit Algorithms

Description

Provides a gateway to R's base optimisation function optim, with additional options as required by projection pursuit algorithms within the package. Not intended for independent use.

Determine the Largest Number of Nodes at Any Depth in a Clustering Hierarchy

Description

Used to specify the geometry of the plot region in the function tree_plot. Not intended for independent use.

Evaluate External Valifity os a Binary Partition

Description

Computes the success ratio of a binary partition by comparing the solution with true class labels.

Usage

success_ratio(assigned, labels)

Arguments

assigned

a vector of cluster assignments made by a clustering algorithm.

labels

a vector of true class labels to be compared with assigned.

Value

the success ratio of the cluster assignment solution.

References

Hofmeyr, D. (2016) Clustering by Minimum Cut Hyperplanes. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Examples

## load optidigits dataset
data(optidigits)

## generate a binary partition using minimum normalised cut hyperplane
sol <- ncuth(optidigits$x)

## evaluate using success ratio
success_ratio(sol$cluster, optidigits$c)

Visualise a Hierarchical Clustering Model

Description

Provides a visualisation of a hierarchical clustering model via two-dimensional projections of the data assigned to each node.

Usage

tree_plot(sol, labels, node.numbers, transparency)

Arguments

sol

a clustering solution arising from one of the functions mcdc, mddc and ncutdc.

labels

(optional) a vector of class labels. if provided then points in different classes are plotted in different colours.

node.numbers

(optional) logical. if TRUE then numbers are added to the plot to indicate the order in which nodes were added to the model. if omitted then node.numbers = TRUE.

transparency

Prune a Hierarchical Clustering Model

Description

Removes the subtree rooted at the specified node from a hierarchical clustering model generated by one of mcdc, mddc and ncutdc.

Usage

tree_prune(sol, node)

Arguments

sol

a clustering solution arising from one of the functions mcdc, mddc and ncutdc.

node

the node at which to prune the hierarchy. can be either an integer specifying the node number in sol$nodes or a vector of length two specifying c(depth, position at depth) of the node.

Value

a list with the same components as sol.

Examples

## load the optidigits dataset
data(optidigits)

## cluster using minimum normalised cut hyperplanes,
## assuming no domain knowledge begin with 12 clusters
sol <- ncutdc(optidigits$x, 12)

## the node numbered 11 has been split,
## yet it appears there may not be multiple clusters present.
## inspect this node more closely
plot(sol, node = 11)

## remove this node from the model
sol_new <- tree_prune(sol, 11)

## compare the solutions using external cluster validity metrics

cluster_performance(sol$cluster, optidigits$c)

cluster_performance(sol_new$cluster, optidigits$c)

Split a Leaf in a Hierarchical Clustering Model

Description

Adds an additional binary partition to an existing hierarchical clustering model produced by one of mcdc, mddc and ncutdc.

Usage

tree_split(sol, node, ...)

Arguments

sol

a clustering solution arising from one of the functions mcdc, mddc and ncutdc.

node

the node to be further partitioned. can be either an integer specifying the node number in sol$nodes or a vector of length two specifying c(depth, position at depth) of the node.

...

any modifications to parameters used in optimisation. these should have the same names and types as the corresponding arguments for the method used to construct sol.

Value

a list with the same components as sol. the $args field will reflect any changes included in ... above.

Examples

## load the optidigits dataset
data(optidigits)

## cluster using minimum normalised cut hyperplanes,
## assuming no domain knowledge begin with 8 clusters
sol <- ncutdc(optidigits$x, 8)

## visualise solution
plot(sol)

## node 13 shows evidence of multiple clusters. Inspect this node more closely
plot(sol, node = 13)

## split node 13
sol_new <- tree_split(sol, 13)

## compare the solutions using external cluster validity metrics
cluster_performance(sol$cluster, optidigits$c)

cluster_performance(sol_new$cluster, optidigits$c)

Face Recognition

Description

This data set contains vectorised images of the faces of 10 different human subjects with different poses and lighting conditions. The images were compressed to 30x20 pixels.

Usage

yale

Format

A list with entries $x (a 2000x600 matrix with each row corresponding to an image) and $y (a vector of labels indicating the human subject).

Source

Yale Faces Database B. Compressed images (30x40) available from [https://cervisia.org/machine_learning_data.php/]. Further compression was performed by the package developers. In addition only the first 200 images of each subject are included.

References

Georghiades, A.S. and Belhumeur, P.N. and Kriegman, D.J. (2001) From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6) pp. 643–660.

Projection Pursuit for Cluster Identification

Description

Details

Author(s)

References

See Also

Add Nodes To a Plot of a Hierarchical Clustering Model

Description

Discrimination of Cancerous and Non-Cancerous Breast Masses

Description

Usage

Format

Source

References

External Cluster Validity Metrics

Description

Usage

Arguments

Value

References

Examples

Eryhemato-Squamous Disease Identification

Description

Usage

Format

Source

References

Gradient of the Variance Ratio Clusterability Across a Hyperplane

Description

Usage

Arguments

Value

Gradient of the Integrated Density on a Hyperplane

Description

Usage

Arguments

Value

Gradient of the Normalised Cut Across a Hyperplane

Description

Usage

Arguments

Value

Gradient of the Penalised Density at a Point

Description

Variance Ratio Clusterability Across a Hyperplane

Description

Usage

Arguments

Value

Integrated Density on a Hyperplane

Description

Usage

Arguments

Value

Normalised Cut Across a Hyperplane

Description

Usage

Arguments

Value

Visualise a Hyperplane Separator for Clustering

Description

Usage

Arguments

Check if the Current Solution is a Valid Minimum Density Hyperplane

Description

Location of Optimal Variance Ratio Hyperplane

Description

Usage

Arguments

Value

Divisive Clustering Using Maximum Clusterability

Description

Usage

Arguments

Value

References

Examples

Maximum Clusterability Dimension Reduction

Description

Usage