Help for package HVT

Type:

Package

Date:

2025-07-03

Title:

Constructing Hierarchical Voronoi Tessellations and Overlay Heatmaps for Data Analysis

Version:

25.2.5

Description:

Facilitates building topology preserving maps for data analysis.

License:

Apache License 2.0

Encoding:

UTF-8

Imports:

MASS, grDevices, splancs, stats, dplyr, NbClust, purrr, magrittr, ggplot2, tidyr, scales, cluster, reshape2, FNN,Rtsne,umap, plyr, markovchain, methods,deldir, gridExtra

Depends:

R (≥ 4.0.0)

BugReports:

https://github.com/Mu-Sigma/HVT/issues

URL:

https://github.com/Mu-Sigma/HVT

RoxygenNote:

7.3.2

Suggests:

knitr,rmarkdown,testthat,geozoo, plotly, rlang, DT,patchwork,sp,Hmisc,data.table,gtable, htmlwidgets,skimr,tibble,devtools,gifski, tidyverse,DataExplorer,htmltools,corrplot,kableExtra,polyclip,conf.design

VignetteBuilder:

knitr

NeedsCompilation:

Packaged:

2025-07-03 17:46:54 UTC; vishwavani

Maintainer:

Zubin Dowlaty <zubin.dowlaty@mu-sigma.com>

Repository:

CRAN

Date/Publication:

2025-07-04 07:50:05 UTC

Collate:

'Add_boundary_points.R' 'Corrected_Tessellations.R' 'Transform_Coordinates.R' 'ScaleMat.R' 'DelaunayInfo.R' 'Delete_Outpoints.R' 'VQ_codebookSplit.R' 'clusterPlot.R' 'clustHVT.R' 'diagPlot.R' 'diagSuggestion.R' 'displayTable.R' 'edaPlots.R' 'getCellId.R' 'getCentroids.R' 'getCentroids_for_opti.R' 'getOptimalCentroids.R' 'getTransitionProbability.R' 'global.R' 'hvq.R' 'madPlot.R' 'msm_plots.R' 'msm.R' 'plotAnimatedFlowmap.R' 'plotHVT.R' 'plotModelDiagnostics.R' 'plotNovelCells.R' 'plotQuantErrorHistogram.R' 'plotStateTransition.R' 'plotZscore.R' 'reconcileTransitionProbability.R' 'removeNovelty.R' 'scoreHVT.R' 'scoreLayeredHVT.R' 'summary.R' 'trainHVT.R'

Author:

Zubin Dowlaty [aut, cre], Mu Sigma, Inc. [cph]

VQ_codebookSplit

Description

Vector Quantization by codebook split method

Usage

VQ_codebookSplit(dataset, quant.err = 0.5, epsilon = NULL)

Arguments

dataset

Matrix. A matrix of multivariate data. Each row corresponds to an observation, and each column corresponds to a variable. Missing values are not accepted.

quant.err

Numeric. The quantization error for the algorithm.

epsilon

Numeric. The value to offset the codebooks during the codebook split. Default is NULL, in which case the value is set to quant.err parameter.

Details

Performs Vector Quantization by codebook split method. Initially, the entire dataset is considered to be one cluster where the codebook is the mean of the cluster. The quantization criteria is checked and the codebook is split such that the new codebooks are (codebook+epsilon) and (codebook-epsilon). The observations are reassigned to these new codebooks based on the nearest neighbour condition and the means recomputed for the new clusters. This is done iteratively until all the clusters meet the quantization criteria.

Value

clusters

List. A list showing each ID assigned to a cluster.

nodes.clust

List. A list corresponding to nodes' details.

idnodes

List. A list of ID and segments similar to nodes.clust with additional columns for nodes ID.

error.quant

List. A list of quantization error for all levels and nodes.

plt.clust

List. A list of logical values indicating if the quantization error was met.

summary

Summary. Output table with summary.

Author(s)

Sangeet Moy Das <sangeet.das@mu-sigma.com>

Examples



data("iris", package = "datasets")
iris <- iris[, 1:2]

vqOutput <- VQ_codebookSplit(iris, quant.err = 0.5)

Performing Hierarchical Clustering Analysis

Description

This is the main function to perform hierarchical clustering analysis which determines optimal number of clusters, perform AGNES clustering and plot the 2D cluster hvt plot.

Usage

clustHVT(
  data,
  trainHVT_results,
  scoreHVT_results,
  clustering_method = "ward.D2",
  indices,
  clusters_k = "champion",
  type = "default",
  domains.column
)

Arguments

data

Data frame. A data frame intended for performing hierarchical clustering analysis.

trainHVT_results

List. A list object which is obtained as a result of trainHVT function.

scoreHVT_results

List. A list object which is obtained as a result of scoreHVT function.

clustering_method

Character. The method used for clustering in both NbClust and hclust function. Defaults to ‘ward.D2’.

indices

Character. The indices used for determining the optimal number of clusters in NbClust function. By default it uses 20 different indices.

clusters_k

Character. A parameter that specifies the number of clusters for the provided data. The options include “champion,” “challenger,” or any integer between 1 and 20. Selecting “champion” will use the highest number of clusters recommended by the ‘NbClust’ function, while “challenger” will use the second-highest recommendation. If a numerical value from 1 to 20 is provided, that exact number will be used as the number of clusters.

type

Character. The type of output required. Default is 'default'. Other option is 'plot' which will return only the clustered heatmap.

domains.column

Character. A vector of cluster names for the clustered heatmap. Used only when type is 'plot'.

Value

A list object that contains the hierarchical clustering results.

[[1]]

Summary of k suggested by all indices with plots

[[2]]

A dendogram plot with the selected number of clusters

[[3]]

A 2D Cluster HVT Plotly visualization that colors cells according to clusters derived from AGNES clustering results. It is interactive, allowing users to view cell contents by hovering over them

Author(s)

Vishwavani <vishwavani@mu-sigma.com>

Examples

data("EuStockMarkets")
dataset <- data.frame(t = as.numeric(time(EuStockMarkets)),
                     DAX = EuStockMarkets[, "DAX"],
                     SMI = EuStockMarkets[, "SMI"],
                     CAC = EuStockMarkets[, "CAC"],
                     FTSE = EuStockMarkets[, "FTSE"])
rownames(EuStockMarkets) <- dataset$t
hvt.results<- trainHVT(dataset[-1],n_cells = 30, depth = 1, quant.err = 0.1,
                      distance_metric = "L1_Norm", error_metric = "max",
                      normalize = TRUE,quant_method = "kmeans")
scoring <- scoreHVT(dataset, hvt.results, analysis.plots = TRUE, names.column = dataset[,1])
centroid_data <- scoring$centroidData
hclust_data_1 <- centroid_data[,2:3]
clust.results <- clustHVT(data = hclust_data_1, 
                         trainHVT_results = hvt.results,
                         scoreHVT_results = scoring, 
                         clusters_k = 'champion', indices = 'hartigan')

function for displaying table

Description

This is the main function for displaying data in table format

Usage

displayTable(data, scroll = TRUE, limit = 20)

Arguments

data

Data frame. The dataframe to be displayed in table format.

scroll

Logical. A value to have a scroll or not in the table. Default is TRUE.

limit

Numeric. A value to indicate how many rows to display. Default is 20.

Value

A table with proper formatting for html notebook

Author(s)

Vishwavani <vishwavani@mu-sigma.com>

Examples

data <- datasets::EuStockMarkets
dataset <- as.data.frame(data)
displayTable(dataset)

plots for data analysis

Description

This is the main function that provides exploratory data analysis plots

Usage

edaPlots(
  df,
  time_column,
  output_type = "summary",
  n_cols = -1,
  grey_bars = NULL
)

Arguments

df

Dataframe. A data frame object.

time_column

Character. The name of the time column in the data frame. Can be given only when the data is time series

output_type

Character. The name of the output to be displayed. Options are 'summary', 'histogram', 'boxplot', 'timeseries' & 'correlation'. Default value is summary.

n_cols

Numeric. A value to indicate how many columns to be included in the output.

grey_bars

List. A list of timestamps where each list contains two elements: start and end period, which will be highlighted in gray in the time series plot. Default value is NULL.

Value

Five objects which include time series plots, data distribution plots, box plots, correlation plot and a descriptive statistics table.

Author(s)

Vishwavani <vishwavani@mu-sigma.com>

Examples

dataset <- data.frame(date = as.numeric(time(EuStockMarkets)),
DAX = as.numeric(EuStockMarkets[, "DAX"]),
SMI = as.numeric(EuStockMarkets[, "SMI"]),
CAC = as.numeric(EuStockMarkets[, "CAC"]),
FTSE = as.numeric(EuStockMarkets[, "FTSE"]))
edaPlots(dataset)
edaPlots(dataset, time_column = 'date', output_type = 'timeseries', n_cols = 4)

getOptimalCentroids

Description

Get Optimal Centroids

Usage

getOptimalCentroids(
  x,
  iter.max,
  algorithm,
  n_cells,
  seed = 100,
  function_to_calculate_distance_metric,
  function_to_calculate_error_metric = c("mean", "max"),
  quant.err,
  distance_metric = "L1_Norm",
  quant_method = c("kmeans", "kmedoids"),
  ...
)

Arguments

x

Data Frame. A dataframe of multivariate data. Each row corresponds to an observation, and each column corresponds to a variable. Missing values are not accepted.

algorithm

String. The type of algorithm used for quantization. Available algorithms are Hartigan and Wong, "Lloyd", "Forgy", "MacQueen". (default is "Hartigan-Wong")

n_cells

Numeric. Indicating the number of nodes per hierarchy.

seed

Numeric. Random Seed.

function_to_calculate_distance_metric

Function. The function is to find 'L1_Norm" or "L2_Norm" distances. L1_Norm is selected by default.

function_to_calculate_error_metric

Character. The error metric can be "mean" or "max". mean is selected by default

quant.err

Numeric. The quantization error for the algorithm.

distance_metric

Character. The distance metric to calculate inter point distance. It can be 'L1_Norm" or "L2_Norm". L1_Norm is selected by default.

quant_method

Character. The quant_method can be "kmeans" or "kmedoids". kmeans is selected by default

Details

The raw data is first scaled and this scaled data is supplied as input to the vector quantization algorithm. Vector quantization technique uses a parameter called quantization error. This parameter acts as a threshold and determines the number of levels in the hierarchy. It means that, if there are 'n' number of levels in the hierarchy, then all the clusters formed till this level will have quantization error equal or greater than the threshold quantization error. The user can define the number of clusters in the first level of hierarchy and then each cluster in first level is sub-divided into the same number of clusters as there are in the first level. This process continues and each group is divided into smaller clusters as long as the threshold quantization error is met. The output of this technique will be hierarchically arranged vector quantized data.

Value

values

List. A list showing observations assigned to a cluster.

maxQE

List. A list corresponding to maximum QE values for each cell.

meanQE

List. A list corresponding to mean QE values for each cell.

centers

List. A list of quantization error for all levels and nodes.

nsize

List. A list corresponding to number of observations in respective groups.

Author(s)

Shubhra Prakash <shubhra.prakash@mu-sigma.com>, Sangeet Moy Das <sangeet.das@mu-sigma.com>

Creating Transition Probability Matrix

Description

This is the main function to create transition probability matrix The transition probability matrix quantifies the likelihood of transitioning from one state to another. States: The table includes the current states and the possible next states. Probabilities: For each current state, it lists the probability of transitioning to each of the next possible states.

Usage

getTransitionProbability(
  df,
  cellid_column,
  time_column,
  type = "with_self_state"
)

Arguments

df

Data frame. The input data frame should contain two columns, cell ID from scoreHVT function and time stamp of that dataset.

cellid_column

Character. Name of the column containing cell IDs.

time_column

Character. Name of the column containing time stamps.

type

Character. A character value indicating the type of transition probability table to create. Accepted entries are "with_self_state" and "without_self_state".

Value

Stores a data frames with transition probabilities.

Author(s)

PonAnuReka Seenivasan <ponanureka.s@mu-sigma.com>, Vishwavani <vishwavani@mu-sigma.com>

Examples

dataset <- data.frame(t = as.numeric(time(EuStockMarkets)),
                      DAX = EuStockMarkets[, "DAX"],
                      SMI = EuStockMarkets[, "SMI"],
                      CAC = EuStockMarkets[, "CAC"],
                      FTSE = EuStockMarkets[, "FTSE"])
hvt.results<- trainHVT(dataset[-1],n_cells = 60, depth = 1, quant.err = 0.1,
                       distance_metric = "L1_Norm", error_metric = "max",
                       normalize = TRUE,quant_method = "kmeans")
scoring <- scoreHVT(dataset, hvt.results)
cell_id <- scoring$scoredPredictedData$Cell.ID
time_stamp <- dataset$t
dataset <- data.frame(cell_id, time_stamp)
table <- getTransitionProbability(dataset, cellid_column = "cell_id",time_column = "time_stamp")

Performing Monte Carlo Simulations of Markov Chain

Description

This is the main function to perform Monte Carlo simulations of Markov Chain on the dynamic forecasting of HVT States of a time series dataset. It includes both ex-post and ex-ante analysis offering valuable insights into future trends while resolving state transition challenges through clustering and nearest-neighbor methods to enhance simulation accuracy.

Usage

msm(
  state_time_data,
  forecast_type = "ex-post",
  initial_state,
  n_ahead_ante,
  transition_probability_matrix,
  num_simulations = 100,
  trainHVT_results,
  scoreHVT_results,
  actual_data = NULL,
  raw_dataset,
  k = 5,
  handle_problematic_states = FALSE,
  n_nearest_neighbor = 1,
  show_simulation = TRUE,
  mae_metric = "median",
  time_column = NULL,
  plot_type = "static"
)

Arguments

state_time_data

DataFrame. A dataframe containing state transitions over time(cell id and timestamp)

forecast_type

Character. A character to indicate the type of forecasting. Accepted values are "ex-post" or "ex-ante".

initial_state

Numeric. An integer indicatiog the state at t0.

n_ahead_ante

Numeric. A vector of n ahead points to be predicted further in ex-ante analyzes.

transition_probability_matrix

DataFrame. A dataframe of transition probabilities/ output of 'getTransitionProbability' function

num_simulations

Integer. A number indicating the total number of simulations to run. Default is 100.

trainHVT_results

List.'trainHVT' function output

scoreHVT_results

List. 'scoreHVT' function output

actual_data

Dataframe. A dataFrame for ex-post prediction period with teh actual raw data values

raw_dataset

DataFrame. A dataframe of input raw dataset from the mean and standard deviation will be calculated to scale up the predicted values

k

Integer. A number of optimal clusters when handling problematic states. Default is 5.

handle_problematic_states

Logical. To indicate whether to handle problematic states or not. Default is FALSE.

n_nearest_neighbor

Integer. A number of nearest neighbors to consider when handling problematic states. Default is 1.

show_simulation

Logical. To indicate whether to show the simulation lines in plots or not. Default is TRUE.

mae_metric

Character. A character to indicate which metric to calculate Mean Absolute Error. Accepted entries are "mean", "median", or "mode". Default is "median".

time_column

Character. The name of the column containing time data. Used for aligning and plotting the results.

plot_type

Character. A character to indicate what type of plot should be generated. Accepred entries are "static" (ggplot object) or "interactive"(plotly object). Default is "static".

Value

A list object that contains the forecasting plots and MAE values.

[[1]]

Simulation plots and MAE values for state and centroids plot

[[2]]

Summary Table, Dendogram plot and Clustered Heatmap when handle_problematic_states is TRUE

Author(s)

Vishwavani <vishwavani@mu-sigma.com>

Examples

dataset <- data.frame(t = as.numeric(time(EuStockMarkets)),
DAX = EuStockMarkets[, "DAX"],
SMI = EuStockMarkets[, "SMI"],
CAC = EuStockMarkets[, "CAC"],
FTSE = EuStockMarkets[, "FTSE"])
hvt.results<- trainHVT(dataset[,-1],n_cells = 60, depth = 1, quant.err = 0.1,
                      distance_metric = "L1_Norm", error_metric = "max",
                      normalize = TRUE,quant_method = "kmeans")
scoring <- scoreHVT(dataset, hvt.results)
cell_id <- scoring$scoredPredictedData$Cell.ID
time_stamp <- dataset$t
temporal_data <- data.frame(cell_id, time_stamp)
table <- getTransitionProbability(temporal_data, 
cellid_column = "cell_id",time_column = "time_stamp")
colnames(temporal_data) <- c("Cell.ID","t")
ex_post_forecasting <- dataset[1800:1860,]
ex_post <- msm(state_time_data = temporal_data,
              forecast_type = "ex-post",
              transition_probability_matrix = table,
              initial_state = 2,
              num_simulations = 100,
              scoreHVT_results = scoring,
              trainHVT_results = hvt.results,
              actual_data = ex_post_forecasting,
              raw_dataset = dataset,
              mae_metric = "median",
             show_simulation = FALSE,
             time_column = 't')

Generating flow maps and animations based on transition probabilities

Description

This is the main function for generating flow maps and animations based on transition probabilities including self states and excluding self states. Flow maps are a type of data visualization used to represent the transition probability of different states. Animations are the gifs used to represent the movement of data through the cells.

Usage

plotAnimatedFlowmap(
  hvt_model_output,
  transition_probability_df,
  df,
  flow_map = "All",
  cellid_column,
  time_column
)

Arguments

hvt_model_output

List. Output from a trainHVT function.

transition_probability_df

List. Output from getTransitionProbability function

df

Data frame. The input dataframe should contain two columns, cell ID from scoreHVT function and time stamp of that dataset.

flow_map

Character. Type of flow map ('self_state', 'without_self_state', 'All' or NULL)

cellid_column

Character. Name of the column containing cell IDs.

time_column

Character. Name of the column containing time stamps

Value

A list of flow maps and animation gifs.

Author(s)

PonAnuReka Seenivasan <ponanureka.s@mu-sigma.com>, Vishwavani <vishwavani@mu-sigma.com>

Examples

dataset <- data.frame(date = as.numeric(time(EuStockMarkets)),
                      DAX = EuStockMarkets[, "DAX"],
                      SMI = EuStockMarkets[, "SMI"],
                      CAC = EuStockMarkets[, "CAC"],
                      FTSE = EuStockMarkets[, "FTSE"])
hvt.results<- trainHVT(dataset,n_cells = 60, depth = 1, quant.err = 0.1,
                       distance_metric = "L1_Norm", error_metric = "max",
                       normalize = TRUE,quant_method = "kmeans")
scoring <- scoreHVT(dataset, hvt.results)
cell_id <- scoring$scoredPredictedData$Cell.ID
time_stamp <- dataset$date
dataset <- data.frame(cell_id, time_stamp)
table <- getTransitionProbability(dataset, cellid_column = "cell_id",
time_column = "time_stamp")
plots <- plotAnimatedFlowmap(hvt_model_output = hvt.results,
transition_probability_df = table,df = dataset,
flow_map = 'All',cellid_column = "cell_id", time_column = "time_stamp")

Plot the hierarchical tessellations.

Description

This is the main plotting function to construct hierarchical voronoi tessellations in 1D,2D or Interactive surface plot.

Usage

plotHVT(
  hvt.results,
  line.width = 0.5,
  color.vec = "black",
  centroid.size = 0.6,
  centroid.color = "black",
  child.level = 1,
  hmap.cols,
  separation_width = 7,
  layer_opacity = c(0.5, 0.75, 0.99),
  dim_size = 1000,
  plot.type = "2Dhvt",
  quant.error.hmap = NULL,
  cell_id = FALSE,
  cell_id_position = "bottom",
  cell_id_size = 2.6
)

Arguments

hvt.results

(1D/2DProj/2Dhvt/2Dheatmap/surface_plot) List. A list containing the output of trainHVT function which has the details of the tessellations to be plotted.

line.width

(2Dhvt/2Dheatmap) Numeric Vector. A vector indicating the line widths of the tessellation boundaries for each level.

color.vec

(2Dhvt/2Dheatmap) Vector. A vector indicating the colors of the boundaries of the tessellations at each level.

centroid.size

(2Dhvt/2Dheatmap) Numeric Vector. A vector indicating the size of centroids for each level.

centroid.color

(2Dhvt/2Dheatmap) Numeric Vector. A vector indicating the color of centroids for each level.

child.level

(2Dheatmap/surface_plot) Numeric. Indicating the level for which the plot should be displayed

hmap.cols

(2Dheatmap/surface_plot) Numeric or Character. The column number or column name from the dataset indicating the variables for which the heat map is to be plotted.

separation_width

(surface_plot) Numeric. An integer indicating the width between hierarchical levels in surface plot

layer_opacity

(surface_plot) Numeric. A vector indicating the opacity of each hierarchical levels in surface plot

dim_size

(surface_plot) Numeric. An integer controls the resolution or granularity of the 3D surface grid

plot.type

Character. An option to indicate which type of plot should be generated. Accepted entries are '1D','2Dproj','2Dhvt','2Dheatmap'and 'surface_plot'. Default value is '2Dhvt'.

quant.error.hmap

(2Dheatmap) Numeric. A number representing the quantization error threshold to be highlighted in the heatmap. When a value is provided, it will emphasize cells with quantization errors equal or less than the specified threshold, indicating that these cells cannot be further subdivided in the next depth layer. The default value is NULL, meaning all cells will be colored in the heatmap across various depths.

cell_id

(2Dhvt/2Dheatmap) Logical. A logical indicating whether the cell IDs should be displayed

cell_id_position

(2Dhvt/2Dheatmap) Character. A character indicating the position of the cell IDs. Accepted entries are 'top' , 'bottom', 'left' and 'right'.

cell_id_size

(2Dhvt/2Dheatmap) Numeric. A numeric vector indicating the size of the cell IDs for all levels.

Value

plot object containing the visualizations of reduced dimension(1D/2D) for the given dataset.

Author(s)

Shubhra Prakash <shubhra.prakash@mu-sigma.com>, Sangeet Moy Das <sangeet.das@mu-sigma.com>, Vishwavani <vishwavani@mu-sigma.com>

Examples

data("EuStockMarkets")
hvt.results <- trainHVT(EuStockMarkets, n_cells = 60, depth = 1, quant.err = 0.1, 
                       distance_metric = "L1_Norm", error_metric = "max",
                       normalize = TRUE,quant_method="kmeans")
                       
#change the 'plot.type' argument to '2Dproj' or '2DHVT' to visualize respective plots.                      
plotHVT(hvt.results, plot.type='1D')

#change the 'plot.type' argument to 'surface_plot' to visualize the Interactive surface plot                   
plotHVT(hvt.results,child.level = 1, 
hmap.cols = "DAX", plot.type = '2Dheatmap')

Make the diagnostic plots for hierarchical voronoi tessellations

Description

This is the main function that generates diagnostic plots for hierarchical voronoi tessellations models and scoring.

Usage

plotModelDiagnostics(model_obj)

Arguments

model_obj

List. A list obtained from the trainHVT function or scoreHVT function

Value

For trainHVT, Minimum Intra-DataPoint Distance Plot, Minimum Intra-Centroid Distance Plot Mean Absolute Deviation Plot, Distribution of Number of Observations in Cells, for Training Data and Mean Absolute Deviation Plot for Validation Data are plotted. For scoreHVT Mean Absolute Deviation Plot for Training Data and Validation Data are plotted

Author(s)

Shubhra Prakash <shubhra.prakash@mu-sigma.com>

Examples

data("EuStockMarkets")
hvt.results <- trainHVT(EuStockMarkets, n_cells = 60, depth = 1, quant.err = 0.1, 
                       distance_metric = "L1_Norm", error_metric = "max",
                       normalize = TRUE, quant_method="kmeans", diagnose = TRUE, 
                       hvt_validation = TRUE)
plotModelDiagnostics(hvt.results)

Plot the identified outlier cells in the voronoi tessellation map.

Description

This is the main plotting function to construct hierarchical voronoi tessellations and highlight the outlier cells

Usage

plotNovelCells(
  plot.cells,
  hvt.map,
  line.width = c(0.6),
  color.vec = c("#141B41"),
  pch = 21,
  centroid.size = 0.5,
  title = NULL,
  maxDepth = 1
)

Arguments

plot.cells

Vector. A vector indicating the cells to be highlighted in the map

hvt.map

List. A list containing the output of trainHVT function which has the details of the tessellations to be plotted

line.width

Numeric Vector. A vector indicating the line widths of the tessellation boundaries for each level

color.vec

Vector. A vector indicating the colors of the boundaries of the tessellations at each level

pch

Numeric. Symbol of the centroids of the tessellations (parent levels) Default value is 21.

centroid.size

Numeric. Size of centroids of first level tessellations. Default value is 0.5

title

String. Set a title for the plot. (default = NULL)

maxDepth

Numeric. An integer indicating the number of levels. (default = NULL)

Value

Returns a ggplot object containing hierarchical voronoi tessellation plot highlighting the outlier cells

Author(s)

Shantanu Vaidya <shantanu.vaidya@mu-sigma.com>

Examples

data("EuStockMarkets")
hvt.results <- trainHVT(EuStockMarkets, n_cells = 60, depth = 1, quant.err = 0.1, 
                       distance_metric = "L1_Norm", error_metric = "max",
                       normalize = TRUE,quant_method="kmeans")
#selected 55,58 are for demo purpose
plotNovelCells(c(55,58),hvt.results)

Make the quantization error plots for training and scoring.

Description

This is the function that produces histograms displaying the distribution of Quantization Error (QE) values for both train and test datasets, highlighting mean values with dashed lines for quick evaluation.

Usage

plotQuantErrorHistogram(hvt.results, hvt.scoring)

Arguments

hvt.results

List. A list of hvt.results obtained from the trainHVT function.

hvt.scoring

List. A list of hvt.scoring obtained from the scoreHVT function.

Value

Returns the ggplot object containing the quantization error distribution plots for the given HVT results of training and scoring

Author(s)

Shubhra Prakash <shubhra.prakash@mu-sigma.com>

Examples

data("EuStockMarkets")
dataset <- data.frame(date = as.numeric(time(EuStockMarkets)),
                     DAX = EuStockMarkets[, "DAX"],
                     SMI = EuStockMarkets[, "SMI"],
                     CAC = EuStockMarkets[, "CAC"],
                     FTSE = EuStockMarkets[, "FTSE"])
rownames(EuStockMarkets) <- dataset$date
#Split in train and test
train <- EuStockMarkets[1:1302, ]
test <- EuStockMarkets[1303:1860, ]

hvt.results<- trainHVT(train,n_cells = 60, depth = 1, quant.err = 0.1,
                      distance_metric = "L1_Norm", error_metric = "max",
                      normalize = TRUE, quant_method = "kmeans")
scoring <- scoreHVT(test, hvt.results)
plotQuantErrorHistogram(hvt.results, scoring)

Creating State Transition Plot

Description

This is the main function to create a state transition plot from a data frame. A state transition plot is a type of data visualization used to represent the changes or transitions in states over time for a given system. State refers to a particular condition or status of a cell at a specific point in time. Transition refers to the change of state for a cell from one condition to another over time.

Usage

plotStateTransition(
  df,
  sample_size = NULL,
  line_plot = NULL,
  cellid_column,
  time_column,
  v_intercept = NULL,
  time_periods = NULL
)

Arguments

df

Data frame. The Input data frame should contain two columns. Cell ID from scoreHVT function and time stamp of that dataset.

sample_size

Numeric. An integer indicating the fraction of the data frame to visualize in the plot. Default value is 0.2

line_plot

Logical. A logical value indicating to create a line plot. Default value is NULL.

cellid_column

Character. Name of the column containing cell IDs.

time_column

Character. Name of the column containing time stamps.

v_intercept

Numeric. A numeric value indicating the time stamp to draw a vertical line on the plot.

time_periods

List. A list of vectors, each containing start and end times for highlighting time periods.

Value

A plotly object representing the state transition plot for the given data frame.

Author(s)

PonAnuReka Seenivasan <ponanureka.s@mu-sigma.com>, Vishwavani <vishwavani@mu-sigma.com>

Examples

dataset <- data.frame(date = as.numeric(time(EuStockMarkets)),
DAX = EuStockMarkets[, "DAX"],
SMI = EuStockMarkets[, "SMI"],
CAC = EuStockMarkets[, "CAC"],
FTSE = EuStockMarkets[, "FTSE"])

hvt.results<- trainHVT(dataset,n_cells = 60, depth = 1, quant.err = 0.1,
                       distance_metric = "L1_Norm", error_metric = "max",
                       normalize = TRUE,quant_method = "kmeans")
scoring <- scoreHVT(dataset, hvt.results)
cell_id <- scoring$scoredPredictedData$Cell.ID
time_stamp <- dataset$date
dataset <- data.frame(cell_id, time_stamp)
plotStateTransition(dataset, sample_size = 1, cellid_column = "cell_id",time_column = "time_stamp")

Plots of z scores

Description

This is the main function to plot the z scores against cell ids.

Usage

plotZscore(
  data,
  cell_range = NULL,
  segment_size = 2,
  reference_lines = c(-1.65, 1.65)
)

Arguments

data

Data frame. A data frame of cell id and features.

cell_range

Vector. A numeric vector of cell id range for which the plot should be displayed. Default is NULL, which plots all the cells.

segment_size

Integer. A numeric value to indicate the size of the bars in the plot. Default is 2.

reference_lines

Vector. A numeric vector of confidence interval values for the reference lines in the plot. Default is c(-1.65, 1.65).

Value

A grid of plots of z score against cell id of teh given features.

Author(s)

Vishwavani <vishwavani@mu-sigma.com>

Examples

data("EuStockMarkets")
dataset <- data.frame(t = as.numeric(time(EuStockMarkets)),
                     DAX = EuStockMarkets[, "DAX"],
                     SMI = EuStockMarkets[, "SMI"],
                     CAC = EuStockMarkets[, "CAC"],
                     FTSE = EuStockMarkets[, "FTSE"])
rownames(EuStockMarkets) <- dataset$t
hvt.results<- trainHVT(dataset[-1],n_cells = 60, depth = 1, quant.err = 0.1,
                      distance_metric = "L1_Norm", error_metric = "max",
                      normalize = TRUE,quant_method = "kmeans")
col_names <- c("Cell.ID","DAX","SMI","CAC","FTSE")
data <- dplyr::arrange(dplyr::select(hvt.results[[3]][["summary"]],col_names),Cell.ID)
data <- round(data, 2)
plotZscore(data)

Reconciliation of Transition Probability

Description

This is the main function for creating reconciliation plots and tables which helps in comparing the transition probabilities calculated manually and from markovchain function

Usage

reconcileTransitionProbability(
  df,
  hmap_type = NULL,
  cellid_column,
  time_column
)

Arguments

df

Data frame. The input data frame should contain two columns, cell ID from scoreHVT function and timestamp of that dataset.

hmap_type

Character. ('self_state', 'without_self_state', or 'All')

cellid_column

Character. Name of the column containing cell IDs.

time_column

Character. Name of the column containing timestamps

Value

A list of plotly heatmap objects and tables representing the transition probability heatmaps.

Author(s)

PonAnuReka Seenivasan <ponanureka.s@mu-sigma.com>, Vishwavani <vishwavani@mu-sigma.com>

Examples

dataset <- data.frame(date = as.numeric(time(EuStockMarkets)),
                      DAX = EuStockMarkets[, "DAX"],
                      SMI = EuStockMarkets[, "SMI"],
                      CAC = EuStockMarkets[, "CAC"],
                      FTSE = EuStockMarkets[, "FTSE"])
hvt.results<- trainHVT(dataset,n_cells = 60, depth = 1, quant.err = 0.1,
                       distance_metric = "L1_Norm", error_metric = "max",
                       normalize = TRUE,quant_method = "kmeans")
scoring <- scoreHVT(dataset, hvt.results)
cell_id <- scoring$scoredPredictedData$Cell.ID
time_stamp <- dataset$date
dataset <- data.frame(cell_id, time_stamp)
reconcileTransitionProbability(dataset, hmap_type = "All", 
cellid_column = "cell_id", time_column = "time_stamp")

Remove identified novelty cell(s)

Description

This function is used to remove the identified novelty cells.

Usage

removeNovelty(outlier_cells, hvt_results)

Arguments

outlier_cells

Vector. A vector with the cell number of the identified novelty

hvt_results

List. A list having the results of the compressed map i.e. output of trainHVT function

Value

A list of two items

[[1]]

Dataframe of novelty cell(s)

[[2]]

Dataframe without the novelty cell(s) from the dataset used in model training

Author(s)

Shantanu Vaidya <shantanu.vaidya@mu-sigma.com>

Examples

data("EuStockMarkets")
hvt.results <- trainHVT(EuStockMarkets, n_cells = 60, depth = 1, quant.err = 0.1, 
                       distance_metric = "L1_Norm", error_metric = "max",
                       normalize = TRUE,quant_method="kmeans")
identified_Novelty_cells <<- c(2, 10)
output_list <- removeNovelty(identified_Novelty_cells, hvt.results) 
data_with_novelty <- output_list[[1]]
data_without_novelty <- output_list[[2]]

Score which cell each point in the test dataset belongs to.

Description

This function scores each data point in the test dataset based on a trained hierarchical Voronoi tessellations model.

Usage

scoreHVT(
  dataset,
  hvt.results.model,
  child.level = 1,
  mad.threshold = 0.2,
  line.width = 0.6,
  color.vec = c("navyblue", "slateblue", "lavender"),
  normalize = TRUE,
  distance_metric = "L1_Norm",
  error_metric = "max",
  yVar = NULL,
  analysis.plots = FALSE,
  names.column = NULL
)

Arguments

dataset

Data frame. A data frame which to be scored. Can have categorical columns if 'analysis.plots' are required.

hvt.results.model

List. A list obtained from the trainHVT function

child.level

Numeric. A number indicating the depth for which the heat map is to be plotted.

mad.threshold

Numeric. A numeric value indicating the permissible Mean Absolute Deviation.

line.width

Vector. A vector indicating the line widths of the tessellation boundaries for each layer.

color.vec

Vector. A vector indicating the colors of the tessellation boundaries at each layer.

normalize

Logical. A logical value indicating if the dataset should be normalized. When set to TRUE, the data (testing dataset) is standardized by ‘mean’ and ‘sd’ of the training dataset referred from the trainHVT(). When set to FALSE, the data is used as such without any changes.

distance_metric

Character. The distance metric can be L1_Norm(Manhattan) or L2_Norm(Eucledian). L1_Norm is selected by default. The distance metric is used to calculate the distance between an n dimensional point and centroid. The distance metric can be different from the one used during training.

error_metric

Character. The error metric can be mean or max. max is selected by default. max will return the max of m values and mean will take mean of m values where each value is a distance between a point and centroid of the cell.

yVar

Character. A character or a vector representing the name of the dependent variable(s)

analysis.plots

Logical. A logical value indicating that the scored plot should be plotted or not. If TRUE, the identifier column(character column) name should be supplied in 'names.column' argument. The output will be a 2D heatmap plotly which gives info on the cell id and the observations of a cell.

names.column

Character. A character or a vector representing the name of the identifier column/character column.

Value

Dataframe containing scored data, plots and summary

Author(s)

Shubhra Prakash <shubhra.prakash@mu-sigma.com>, Sangeet Moy Das <sangeet.das@mu-sigma.com> , Vishwavani <vishwavani@mu-sigma.com>

Examples

data("EuStockMarkets")
dataset <- data.frame(date = as.numeric(time(EuStockMarkets)),
                     DAX = EuStockMarkets[, "DAX"],
                     SMI = EuStockMarkets[, "SMI"],
                     CAC = EuStockMarkets[, "CAC"],
                     FTSE = EuStockMarkets[, "FTSE"])
rownames(EuStockMarkets) <- dataset$date
# Split in train and test
train <- EuStockMarkets[1:1302, ]
test <- EuStockMarkets[1303:1860, ]
#model training
hvt.results<- trainHVT(train,n_cells = 60, depth = 1, quant.err = 0.1,
                      distance_metric = "L1_Norm", error_metric = "max",
                      normalize = TRUE,quant_method = "kmeans")
scoring <- scoreHVT(test, hvt.results)
data_scored <- scoring$scoredPredictedData

Score which cell and what layer each data point in the test dataset belongs to

Description

This function that scores the cell and corresponding layer for each data point in a test dataset using three hierarchical vector quantization (HVT) models (Map A, Map B, Map C) and returns a data frame containing the scored layer output. The function incorporates the scored results from each map and merges them to provide a comprehensive result.

Usage

scoreLayeredHVT(
  data,
  hvt_mapA,
  hvt_mapB,
  hvt_mapC,
  mad.threshold = 0.2,
  normalize = TRUE,
  seed = 300,
  distance_metric = "L1_Norm",
  error_metric = "max",
  child.level = 1,
  yVar = NULL
)

Arguments

data

Data Frame. A data frame containing test dataset. The data frame should have all the variable(features) used for training.

hvt_mapA

A list of hvt.results.model obtained from trainHVT function while performing 'trainHVT()' on train data

hvt_mapB

A list of hvt.results.model obtained from trainHVT function while performing 'trainHVT()' on data with novelty(s)

hvt_mapC

A list of hvt.results.model obtained from trainHVT function while performing 'trainHVT()' on data without novelty(s)

mad.threshold

Numeric. A number indicating the permissible Mean Absolute Deviation

normalize

Logical. A logical value indicating if the dataset should be normalized. When set to TRUE, the data (testing dataset) is standardized by 'mean' and 'sd' of the training dataset referred from the trainHVT(). When set to FALSE, the data is used as such without any changes. (Default value is TRUE).

seed

Numeric. Random Seed.

distance_metric

error_metric

child.level

Numeric. A number indicating the level for which the heat map is to be plotted.

yVar

Character. A character or a vector representing the name of the dependent variable(s)

Value

Dataframe containing scored layer output

Author(s)

Shubhra Prakash <shubhra.prakash@mu-sigma.com>, Sangeet Moy Das <sangeet.das@mu-sigma.com>, Shantanu Vaidya <shantanu.vaidya@mu-sigma.com>,Somya Shambhawi <somya.shambhawi@mu-sigma.com>

Examples

data("EuStockMarkets")
dataset <- data.frame(date = as.numeric(time(EuStockMarkets)),
                     DAX = EuStockMarkets[, "DAX"],
                     SMI = EuStockMarkets[, "SMI"],
                     CAC = EuStockMarkets[, "CAC"],
                     FTSE = EuStockMarkets[, "FTSE"])
rownames(EuStockMarkets) <- dataset$date

train <- EuStockMarkets[1:1302, ]
test <- EuStockMarkets[1303:1860, ]

###MAP-A
hvt_mapA <- trainHVT(train, n_cells = 150, depth = 1, quant.err = 0.1,
                    distance_metric = "L1_Norm", error_metric = "max",
                    normalize = TRUE,quant_method = "kmeans")
                    
identified_Novelty_cells <- c(127,55,83,61,44,35,27,77)
output_list <- removeNovelty(identified_Novelty_cells, hvt_mapA)
data_with_novelty <- output_list[[1]] 
data_with_novelty <- data_with_novelty[, -c(1,2)]

### MAP-B
hvt_mapB <- trainHVT(data_with_novelty,n_cells = 10, depth = 1, quant.err = 0.1,
                    distance_metric = "L1_Norm", error_metric = "max",
                    normalize = TRUE,quant_method = "kmeans")
data_without_novelty <- output_list[[2]]

### MAP-C
hvt_mapC <- trainHVT(data_without_novelty,n_cells = 135,
                    depth = 1, quant.err = 0.1, distance_metric = "L1_Norm",
                    error_metric = "max", quant_method = "kmeans",
                    normalize = TRUE)
                    
##SCORE LAYERED
data_scored <- scoreLayeredHVT(test, hvt_mapA, hvt_mapB, hvt_mapC)

Table for displaying summary

Description

This is the main function for displaying summary from model training and scoring

Usage

summary(data, limit = 20, scroll = TRUE)

Arguments

data

List. A listed object from trainHVT or scoreHVT

limit

Numeric. A value to indicate how many rows to display.

scroll

Logical. A value to indicate whether to display scroll bar or not. Default value is TRUE.

Value

A consolidated table of summary for training, scoring and forecasting

Author(s)

Vishwavani <vishwavani@mu-sigma.com>, Alimpan Dey <alimpan.dey@mu-sigma.com>

Examples

data <- datasets::EuStockMarkets
dataset <- as.data.frame(data)
#model training
hvt.results <- trainHVT(dataset, n_cells = 60, depth = 1, quant.err = 0.1,
                      distance_metric = "L1_Norm", error_metric = "max",
                      normalize = TRUE, quant_method = "kmeans", dim_reduction_method = 'sammon')
summary(data =  hvt.results)

Constructing Hierarchical Voronoi Tessellations

Description

This is the main function to construct hierarchical voronoi tessellations. This is done using hierarchical vector quantization(hvq). The data is represented in 2D coordinates and the tessellations are plotted using these coordinates as centroids. For subsequent levels, transformation is performed on the 2D coordinates to get all the points within its parent tile. Tessellations are plotted using these transformed points as centroids.

Usage

trainHVT(
  dataset,
  min_compression_perc = NA,
  n_cells = NA,
  depth = 1,
  quant.err = 0.2,
  normalize = FALSE,
  distance_metric = "L1_Norm",
  error_metric = "max",
  quant_method = "kmeans",
  scale_summary = NA,
  diagnose = FALSE,
  hvt_validation = FALSE,
  train_validation_split_ratio = 0.8,
  dim_reduction_method = "sammon",
  tsne_theta = 0.2,
  tsne_eta = 200,
  tsne_perplexity = 30,
  tsne_verbose = TRUE,
  tsne_max_iter = 500,
  umap_n_neighbors = 60,
  umap_n_components = 2,
  umap_min_dist = 0.1
)

Arguments

dataset

Data frame. A data frame, with numeric columns (features) will be used for training the model.

min_compression_perc

Numeric. An integer, indicating the minimum compression percentage to be achieved for the dataset. It indicates the desired level of reduction in dataset size compared to its original size.

n_cells

Numeric. An integer, indicating the number of cells per hierarchy (level).

depth

Numeric. An integer, indicating the number of levels. A depth of 1 means no hierarchy (single level), while higher values indicate multiple levels (hierarchy).

quant.err

Numeric. A number indicating the quantization error threshold. A cell will only breakdown into further cells if the quantization error of the cell is above the defined quantization error threshold.

normalize

Logical. A logical value indicating if the dataset should be normalized. When set to TRUE, scales the values of all features to have a mean of 0 and a standard deviation of 1 (Z-score).

distance_metric

error_metric

quant_method

Character. The quantization method can be kmeans or kmedoids. Kmeans uses means (centroids) as cluster centers while Kmedoids uses actual data points (medoids) as cluster centers. kmeans is selected by default.

scale_summary

List. A list with user-defined mean and standard deviation values for all the features in the dataset. Pass the scale summary when normalize is set to FALSE.

diagnose

Logical. A logical value indicating whether user wants to perform diagnostics on the model. Default value is FALSE.

hvt_validation

Logical. A logical value indicating whether user wants to holdout a validation set and find mean absolute deviation of the validation points from the centroid. Default value is FALSE.

train_validation_split_ratio

Numeric. A numeric value indicating train validation split ratio. This argument is only used when hvt_validation has been set to TRUE. Default value for the argument is 0.8.

dim_reduction_method

Character.The dim_reduction_method can be one of "tsne", "umap", "sammon".

tsne_theta

Numeric.The tsne_theta is only used when dim_reduction_method is set to "tsne". Default value is 0.5 and common values are between 0.2 and 0.5.

tsne_eta

Numeric.The tsne_eta are used only when dim_reduction method is set to "tsne". Default value is 200.

tsne_perplexity

Numeric.The tsne_perplexity is only used when dim_reduction_method is set to "tsne". Default value is 30 and common values are between between 30 and 50.

tsne_verbose

Logical. A logical value which indicates the t-SNE algorithm to print detailed information about its progress to the console.

tsne_max_iter

Numeric.The tsne_max_iter is used only when dim_reduction_method is set to "tsne". Default value is 1000.More iterations can improve results but increase computation time.

umap_n_neighbors

Integer.The umap_n_neighbors is used only when dim_reduction_method is set to "umap". Default value is 15.Controls the balance between local and global structure in data.

umap_n_components

Integer.The umap_n_components is used only when dim_reduction_method is set to "umap". Default value is 2.Indicates the number of dimensions for embedding.

umap_min_dist

Numeric.The umap_map_dist is used only when dim_reduction_method is set to "umap". Default value is 0.1.Controls how tightly UMAP packs points together.

Value

A Nested list that contains the hierarchical tessellation information. This list has to be given as input argument to plot the tessellations.

[[1]]

A list containing information related to plotting tessellations. This information will include coordinates, boundaries, and other details necessary for visualizing the tessellations

[[2]]

A list containing information related to Sammon’s projection coordinates of the data points in the reduced-dimensional space.

[[3]]

A list containing detailed information about the hierarchical vector quantized data along with a summary section containing no of points, Quantization Error and the centroids for each cell.

[[4]]

A list that contains all the diagnostics information of the model when diagnose is set to TRUE. Otherwise NA.

[[5]]

A list that contains all the information required to generates a Mean Absolute Deviation (MAD) plot, if hvt_validation is set to TRUE. Otherwise NA

[[6]]

A list containing detailed information about the hierarchical vector quantized data along with a summary section containing no of points, Quantization Error and the centroids for each cell which is the output of 'hvq'

[[7]]

model info: A list that contains model-generated timestamp, input parameters passed to the model , the validation results and the dimensionality reduction evaluation metrics table.

Author(s)

Shubhra Prakash <shubhra.prakash@mu-sigma.com>, Sangeet Moy Das <sangeet.das@mu-sigma.com>, Shantanu Vaidya <shantanu.vaidya@mu-sigma.com>,Bidesh Ghosh <bidesh.gosh@mu-sigma.com>,Alimpan Dey <alimpan.dey@mu-sigma.com>

Examples

data("EuStockMarkets")
hvt.results <- trainHVT(EuStockMarkets, n_cells = 60, depth = 1, quant.err = 0.1, 
                       distance_metric = "L1_Norm", error_metric = "max",
                       normalize = TRUE,quant_method="kmeans")

VQ_codebookSplit

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Performing Hierarchical Clustering Analysis

Description

Usage

Arguments

Value

Author(s)

Examples

function for displaying table

Description

Usage

Arguments

Value

Author(s)

Examples

plots for data analysis

Description

Usage

Arguments

Value

Author(s)

Examples

getOptimalCentroids

Description

Usage

Arguments

Details

Value

Author(s)

Creating Transition Probability Matrix

Description

Usage

Arguments

Value

Author(s)

Examples

Performing Monte Carlo Simulations of Markov Chain

Description

Usage

Arguments

Value

Author(s)

Examples

Generating flow maps and animations based on transition probabilities

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Plot the hierarchical tessellations.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Make the diagnostic plots for hierarchical voronoi tessellations

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Plot the identified outlier cells in the voronoi tessellation map.

Description

Usage

Arguments

Value