Type: Package
Title: Another Test of Association for Count Data
Version: 0.1.0
Date: 2025-12-20
Description: The Upsilon test assesses association among categorical variables against the null hypothesis of independence (Luo 2021 MS thesis; ProQuest Publication No. 28649813). While promoting dominant function patterns, it demotes non-dominant function patterns. It is robust to low expected count—continuity correction like Yates's seems unnecessary. Using a common null population following a uniform distribution, contingency tables are comparable by statistical significance—not the case for most association tests defining a varying null population by tensor product of observed marginals. Although Pearson's chi-squared test, Fisher's exact test, and Woolf's G-test (related to mutual information) are useful in some contexts, the Upsilon test appeals to ranking association patterns not necessarily following same marginal distributions, such as in count data from DNA sequencing—an important modern scientific domain.
Encoding: UTF-8
License: LGPL (≥ 3)
Imports: Rcpp (≥ 1.0.8), Rdpack, ggplot2 (≥ 3.4.0), reshape2, scales
RdMacros: Rdpack
LinkingTo: Rcpp
RoxygenNote: 7.3.3
Suggests: knitr, rmarkdown, testthat (≥ 3.0.0), DescTools, USP, metan, FunChisq, patchwork
VignetteBuilder: knitr
Config/testthat/edition: 3
NeedsCompilation: yes
Packaged: 2025-12-20 13:50:51 UTC; joesong
Author: Xuye Luo [aut], Joe Song ORCID iD [aut, cre]
Maintainer: Joe Song <joemsong@nmsu.edu>
Repository: CRAN
Date/Publication: 2026-01-06 11:40:07 UTC

Fast Zero-Tolerant Pearson's Chi-squared Test of Association

Description

Performs a fast zero-tolerant Pearson's chi-squared test (Pearson 1900) to evaluate association between observations from two categorical variables.

Usage

fast.chisq.test(x, y, log.p = FALSE)

Arguments

x

a vector to specify observations of the first categorical variable. The vector can be of numeric, character, or logical type. NA values must be removed or replaced before calling the function.

y

a vector to specify observations of the second categorical variable. Must not contain NA values and must be of the same length as x.

log.p

a logical. If TRUE, the p-value is calculated in closed form to natural logarithm of p-value to improve numerical precision when p-value approaches zero. Defaults to FALSE.

Value

A list with class "htest" containing the following components:

statistic

the value of chi-squared test statistic.

parameter

the degrees of freedom.

p.value

the p-value of the test.

estimate

Cramér's V statistic representing the effect size.

method

a character string indicating the method used.

data.name

a character string giving the names of input data.

Note

The test uses an internal hash table, instead of matrix, to store the contingency table. Savings in both runtime and memory saving can be substantial if the contingency table is sparse and large. The test is implemented in C++, to give an additional layer of speedup over an R implementation.

References

Pearson K (1900). “X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302), 157–175. doi:10.1080/14786440009463897.

Examples

library("Upsilon")
weather <- c(
  "rainy", "sunny", "rainy", "sunny", "rainy"
)
mood <- c(
  "wistful", "upbeat", "upbeat", "upbeat", "wistful"
)

fast.chisq.test(weather, mood)

# The result is equivalent to: 
modified.chisq.test(table(weather, mood))

Fast Zero-Tolerant G-Test of Association

Description

Performs a fast zero-tolerant G-test (Woolf 1957) to evaluate association between observations from two categorical variables.

Usage

fast.gtest(x, y, log.p = FALSE)

Arguments

x

a vector to specify observations of the first categorical variable. The vector can be of numeric, character, or logical type. NA values must be removed or replaced before calling the function.

y

a vector to specify observations of the second categorical variable. Must not contain NA values and must be of the same length as x.

log.p

a logical. If TRUE, the p-value is calculated in closed form to natural logarithm of p-value to improve numerical precision when p-value approaches zero. Defaults to FALSE.

Value

A list with class "htest" containing the following components:

statistic

the G-test statistic (Likelihood Ratio Chi-squared statistic).

parameter

the degrees of freedom.

p.value

the p-value of the test.

estimate

the mutual information between the two variables.

method

a character string indicating the method used.

data.name

a character string giving the names of the data.

Note

The test uses an internal hash table, instead of matrix, to store the contingency table. Savings in both runtime and memory saving can be substantial if the contingency table is sparse and large. The test is implemented in C++, to give an additional layer of speedup over an R implementation.

References

Woolf B (1957). “The log likelihood ratio test (the G-test); methods and tables for tests of heterogeneity in contingency tables.” Annals of Human Genetics, 21(4), 397–409. doi:10.1111/j.1469-1809.1972.tb00293.x.

Examples

library("Upsilon")
weather <- c(
  "rainy", "sunny", "rainy", "sunny", "rainy"
)
mood <- c(
  "wistful", "upbeat", "upbeat", "upbeat", "wistful"
)

fast.gtest(weather, mood)

# The result is equivalent to: 
modified.gtest(table(weather, mood))

Fast Upsilon Test of Association between Two Categorical Variables

Description

Performs a fast Upsilon test (Luo 2021) to evaluate association between observations from two categorical variables.

Usage

fast.upsilon.test(x, y, log.p = FALSE)

Arguments

x

a vector to specify observations of the first categorical variable. The vector can be of numeric, character, or logical type. NA values must be removed or replaced before calling the function.

y

a vector to specify observations of the second categorical variable. Must not contain NA values and must be of the same length as x.

log.p

a logical. If TRUE, the p-value is calculated in closed form to natural logarithm of p-value to improve numerical precision when p-value approaches zero. Defaults to FALSE.

Details

The Upsilon test is designed to promote dominant function patterns. In contrast to other tests of association to favor all function patterns, it is unique in demoting non-dominant function patterns.

Null hypothesis (H_0): Row and column variables are statistically independent.

Null population: A discrete uniform distribution, where each entry in the table has the same probability.

Null distribution: The Upsilon test statistic asymptotically follows a chi-squared distribution with (nrow(x) - 1)(ncol(x) - 1) degrees of freedom, under the null hypothesis on the null population.

See (Luo 2021) for full details of the Upsilon test.

Value

A list with class "htest" containing the following components:

statistic

the Upsilon test statistic.

parameter

the degrees of freedom.

p.value

the p-value of the test.

estimate

the effect size derived from the Upsilon statistic.

method

a character string indicating the method used.

data.name

a character string giving the name of input data.

Note

The test uses an internal hash table, instead of matrix, to store the contingency table. Savings in both runtime and memory saving can be substantial if the contingency table is sparse and large. The test is implemented in C++, to give an additional layer of speedup over an R implementation.

References

Luo X (2021). The Upsilon Test for Association Between Categorical Variables. Master's thesis, Department of Computer Science, New Mexico State University, Las Cruces, NM, United States. Publication No. 28649813. ProQuest Dissertations and Theses Global.

Examples

library("Upsilon")

weather <- c(
  "rainy", "sunny", "rainy", "sunny", "rainy"
)
mood <- c(
  "wistful", "upbeat", "upbeat", "upbeat", "wistful"
)

fast.upsilon.test(weather, mood)

# The result is equivalent to: 
upsilon.test(table(weather, mood))

Zero-Tolerant Pearson's Chi-squared Statistic

Description

Calculates Pearson's chi-squared test statistic for contingency tables, ignoring entries with zero-expected count.

Usage

modified.chisq.statistic(x)

Arguments

x

a matrix or data frame of floating or integer numbers to specify a contingency table. Entries must be non-negative.

Details

This test is useful if p-value must be returned on a contingency table with valid non-negative counts, where the build-in R implementation of chisq.test could return NA as p-value, regardless of a pattern being strong or weak. See Examples.

Unlike chisq.test, this function handles tables with empty rows or columns (where expected values are 0) by calculating the test statistic over non-zero entries only. This prevents the result from becoming NA, while giving meaningful p-values.

Value

The numeric value of the modified Pearson's chi-squared test statistic.

Note

This function only takes contingency table as input. It does not support goodness-of-fit test on vectors. It does not offer an option to apply Yates's continuity correction on 2 \times 2 tables.

References

Luo X (2021). The Upsilon Test for Association Between Categorical Variables. Master's thesis, Department of Computer Science, New Mexico State University, Las Cruces, NM, United States. Publication No. 28649813. ProQuest Dissertations and Theses Global.

Examples

library("Upsilon")

# Create a table with empty rows or columns
x <- matrix(c(0, 3, 0, 3, 0, 0), nrow = 2, byrow = TRUE)
print(x)

# Standard chisq.test might warn or fail on a table with empty rows or columns
chisq.test(x) 

# Modified statistic handles it gracefully
modified.chisq.statistic(x)

Zero-Tolerant Pearson's Chi-squared Test for Contingency Tables

Description

Performs Pearson's chi-squared test (Pearson 1900) on contingency tables, slightly modified to handle rows or columns of all zeros.

Usage

modified.chisq.test(x, log.p = FALSE)

Arguments

x

a matrix or data frame of floating or integer numbers to specify a contingency table. Entries must be non-negative.

log.p

a logical. If TRUE, the p-value is calculated in closed form to natural logarithm of p-value to improve numerical precision when p-value approaches zero. Defaults to FALSE.

Details

This test is useful if p-value must be returned on a contingency table with valid non-negative counts, where the build-in R implementation of chisq.test could return NA as p-value, regardless of a pattern being strong or weak. See Examples.

Unlike chisq.test, this function handles tables with empty rows or columns (where expected values are 0) by calculating the test statistic over non-zero entries only. This prevents the result from becoming NA, while giving meaningful p-values.

Value

A list with class "htest" containing:

statistic

the chi-squared test statistic (calculated ignoring entries of 0-expected count).

parameter

the degrees of freedom.

p.value

the p-value by the test.

estimate

Cramér's V statistic.

observed

the observed counts.

expected

the expected counts under the null hypothesis.

Note

This function only takes contingency table as input. It does not support goodness-of-fit test on vectors. It does not offer an option to apply Yates's continuity correction on 2 \times 2 tables.

References

Pearson K (1900). “X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302), 157–175. doi:10.1080/14786440009463897.

Examples

library("Upsilon")

# A table with a dominant function and an empty column
x <- matrix(
  c(0, 3, 0,
    3, 0, 0), 
   nrow = 2, byrow = TRUE)
print(x)

# Standard chisq.test fails or returns NA warning
chisq.test(x)

# Modified chi-squared test is significant:
modified.chisq.test(x)

Zero-Tolerant G-Test for Contingency Tables

Description

Performs G-test (Woolf 1957) on contingency tables, slightly modified to handle rows or columns of all zeros.

Usage

modified.gtest(x, log.p = FALSE)

Arguments

x

a matrix or data frame of floating or integer numbers to specify a contingency table. Entries must be non-negative.

log.p

a logical. If TRUE, the p-value is calculated in closed form to natural logarithm of p-value to improve numerical precision when p-value approaches zero. Defaults to FALSE.

Details

This test is useful if a p-value must be returned on a contingency table with valid non-negative counts, where other implementations of G-test could return NA as the p-value, regardless of a pattern being strong or weak.

This function handles tables with empty rows or columns (where expected values are 0) by calculating the test statistic over non-zero entries only. This prevents the result from becoming NA, while giving meaningful p-values.

Value

A list with class "htest" containing:

statistic

the G statistic (log-likelihood ratio).

parameter

the degrees of freedom.

p.value

the p-value of the test.

estimate

the value of mutual information.

method

a character string indicating the method used.

data.name

a character string, name of the input data.

observed

the observed counts.

expected

the expected counts under the null hypothesis.

References

Woolf B (1957). “The log likelihood ratio test (the G-test); methods and tables for tests of heterogeneity in contingency tables.” Annals of Human Genetics, 21(4), 397–409. doi:10.1111/j.1469-1809.1972.tb00293.x.

Examples

library("Upsilon")

# Create a sparse table with empty rows/cols
x <- matrix(
  c(0, 3, 0, 
    3, 0, 0), 
  nrow = 2, byrow = TRUE
)
print(x)
# Perform the modified G-test
modified.gtest(x)

Plot Matrix with Entries Represented by Balloons of Varying Sizes and Colors

Description

Creates a "balloon plot" to visualize numeric data in a matrix or contingency table.

Usage

plot_matrix(
  x,
  title = "Balloon plot",
  shape.color = c("tomato"),
  s.min = 1,
  s.max = 30,
  x.axis = NULL,
  y.axis = NULL,
  x.lab = "",
  y.lab = "",
  bg.color = "white",
  grid.color = "black",
  grid.width = 0.1,
  size.by = c("column", "row", "global", "none"),
  color.by = c("column", "row", "global", "none"),
  number.size = 6,
  shape.by = c("column", "row", ""),
  shapes = c(21, 22, 23, 24)
)

Arguments

x

a numeric matrix or table to be plotted.

title

a character string for the main title of the plot. Defaults to "Balloon plot".

shape.color

a character string specifying the color for entries (e.g., "tomato", "blue").

s.min

a numeric value specifying the minimum size of the shapes. Defaults to 5.

s.max

a numeric value specifying the maximum size of the shapes. Defaults to 30.

x.axis

a character vector for custom x-axis labels. If NULL, column names of x are used. Set to "" to hide labels.

y.axis

a character vector for custom y-axis labels. If NULL, row names of x are used. Set to "" to hide labels.

x.lab

a character string for the x-axis title. Defaults to "".

y.lab

a character string for the y-axis title. Defaults to "".

bg.color

a character string for the background color of the tiles. Defaults to "white".

grid.color

a character string specifying color of grid lines (NA to remove).

grid.width

a numeric value to specify the width of grid lines.

size.by

a character string to specify how to scale the size of balloon: "column" (Default), "row", "global", or "none".

color.by

a character string to specify how to scale the color of balloon: "global" (Default), "row", "column", or "none".

number.size

a numeric value specifying the font size for text.

shape.by

a character string to specify how to choose the shape of balloon: "column" (Default), "row", or "" (none).

shapes

a character vector to specify shape codes.

Details

Each entry in the matrix is represented by a shape, with size and color corresponding to the magnitude of value in the entry. It offers an alternative to heatmap for displaying count data.

Value

A ggplot object.

Examples

library(ggplot2)
mat <- matrix(c(10, 20, 30, 50, 80, 60, 40, 30), nrow = 2)
rownames(mat) <- c("Row1", "Row2")
colnames(mat) <- c("C1", "C2", "C3", "C4")

# Color by Row (Row 1 = red, Row 2 = blue)
plot_matrix(mat, color.by = "row", shape.color = c("tomato", "steelblue"))

# Color by Column (Rainbow colors)
plot_matrix(mat, color.by = "column", shape.color = c("red", "green", "blue", "orange"))

Recover Raw Data Vectors from Contingency Table

Description

Converts a contingency table (count data) back into two vectors of raw observations. This is useful when you have a summary table but need to run tests that require raw data vectors (like the functions in this package).

Usage

table.to.vectors(x)

Arguments

x

A numeric matrix or contingency table containing non-negative integer counts. Must not contain NA values.

Value

A list containing two integer vectors:

x_vector

A vector of row indices corresponding to the observations.

y_vector

A vector of column indices corresponding to the observations.

Examples

library("Upsilon")

# Create a sample contingency table
# Rows = Variable A (levels 1,2), Cols = Variable B (levels 1,2,3)
tab <- matrix(c(10, 5, 2, 8, 5, 10), nrow = 2, byrow = TRUE)
print(tab)

# Recover the raw vectors
res <- table.to.vectors(tab)

# Check the result
length(res$x_vector) # Should be sum(tab) = 40
head(cbind(res$x_vector, res$y_vector))
table(res$x_vector, res$y_vector) # Should as same as tab

Upsilon Goodness-of-Fit Test Statistic

Description

(FOR INTERNAL USE ONLY) Calculates the Upsilon statistic for a Goodness-of-Fit (GoF) test.

Usage

upsilon.gof.statistic(x, p = rep(1/length(x), length(x)), rescale.p = TRUE)

Arguments

x

a numeric vector or one-column matrix representing observed counts.

p

a numeric vector of probabilities of the same length as x. Defaults to a uniform distribution (1/length(x)).

rescale.p

a logical scalar. If TRUE (default), p is rescaled to sum to 1. If FALSE, and p does not sum to 1, an error is raised.

Details

This statistic measures the discrepancy between observed counts and expected probabilities.

Value

A numeric value of the Upsilon Goodness-of-Fit statistic.

Examples

library("Upsilon")
counts <- c(10, 20, 30)
upsilon.gof.statistic(counts)

Upsilon Goodness-of-Fit Test for Count Data

Description

(FOR INTERNAL USE ONLY) Performs the Upsilon Goodness-of-Fit test to determine if a sample of observed counts fits a specified probability distribution. The Upsilon statistic uses a specific normalization (dividing by the average expected count) which differs from the standard Pearson's Chi-squared test.

Usage

upsilon.gof.test(
  x,
  p = rep(1/length(x), length(x)),
  rescale.p = TRUE,
  log.p = FALSE
)

Arguments

x

A numeric vector representing observed counts. Must be non-negative.

p

A numeric vector of probabilities of the same length as x. Defaults to a uniform distribution (1/length(x)).

rescale.p

Logical. If TRUE (default), p is rescaled to sum to 1. If FALSE, p must sum to 1, otherwise an error is raised.

log.p

a logical. If TRUE, the p-value is calculated in closed form to natural logarithm of p-value to improve numerical precision when p-value approaches zero. Defaults to FALSE.

Value

A list with class "htest" containing:

statistic

The Upsilon test statistic.

parameter

The degrees of freedom (k - 1).

p.value

The p-value of the test.

estimate

The effect size.

method

A character string indicating the method used.

data.name

A character string giving the name(s) of the data.

observed

The observed counts.

expected

The expected counts.

residuals

The Pearson residuals.

p.normalized

The probability vector used (after rescaling if applicable).

Examples

library("Upsilon")

# Test against uniform distribution
counts <- c(10, 20, 30)
upsilon.gof.test(counts)

Upsilon Test Statistic for Contingency Tables

Description

Calculates the Upsilon test statistic \Upsilon.

Usage

upsilon.statistic(x)

Arguments

x

a matrix or data frame of floating or integer numbers to specify a contingency table. Entries must be non-negative.

Details

The Upsilon test is designed to promote dominant function patterns. In contrast to other tests of association to favor all function patterns, it is unique in demoting non-dominant function patterns.

Null hypothesis (H_0): Row and column variables are statistically independent.

Null population: A discrete uniform distribution, where each entry in the table has the same probability.

Null distribution: The Upsilon test statistic asymptotically follows a chi-squared distribution with (nrow(x) - 1)(ncol(x) - 1) degrees of freedom, under the null hypothesis on the null population.

See (Luo 2021) for full details of the Upsilon test.

Value

The numeric value of Upsilon test statistic \Upsilon.

References

Luo X (2021). The Upsilon Test for Association Between Categorical Variables. Master's thesis, Department of Computer Science, New Mexico State University, Las Cruces, NM, United States. Publication No. 28649813. ProQuest Dissertations and Theses Global.

Examples

library("Upsilon")

# Create a contingency table
x <- matrix(c(
    0, 3, 0, 
    3, 0, 0), 
  nrow = 2, byrow = TRUE)
print(x)

# Calculate statistic
upsilon.statistic(x)

Upsilon Test of Association for Count Data

Description

Performs the Upsilon test to evaluate association among categorical variables represented by a contingency table.

Usage

upsilon.test(x, log.p = FALSE)

Arguments

x

a matrix or data frame of floating or integer numbers to specify a contingency table. Entries must be non-negative.

log.p

a logical. If TRUE, the p-value is calculated in closed form to natural logarithm of p-value to improve numerical precision when p-value approaches zero. Defaults to FALSE.

Details

The Upsilon test is designed to promote dominant function patterns. In contrast to other tests of association to favor all function patterns, it is unique in demoting non-dominant function patterns.

Null hypothesis (H_0): Row and column variables are statistically independent.

Null population: A discrete uniform distribution, where each entry in the table has the same probability.

Null distribution: The Upsilon test statistic asymptotically follows a chi-squared distribution with (nrow(x) - 1)(ncol(x) - 1) degrees of freedom, under the null hypothesis on the null population.

See (Luo 2021) for full details of the Upsilon test.

Value

A list with class "htest" containing:

statistic

the value of the Upsilon statistic.

parameter

the degrees of freedom.

p.value

the p-value.

estimate

the effect size.

method

a character string giving the test name.

data.name

a character string giving the name of input data.

observed

the observed counts, a matrix copy of the input data.

expected

the expected counts under the null hypothesis using the observed marginals.

References

Luo X (2021). The Upsilon Test for Association Between Categorical Variables. Master's thesis, Department of Computer Science, New Mexico State University, Las Cruces, NM, United States. Publication No. 28649813. ProQuest Dissertations and Theses Global.

Examples

library("Upsilon")

# A contingency table with independent row and column variables
x <- matrix(
  c(1, 1, 0, 
    1, 1, 0,
    1, 1, 0), 
  nrow = 3, byrow = TRUE
 )
 
print(x)

upsilon.test(x)

# A contingency table with a non-dominant function
x <- matrix(
  c(4, 0, 0, 
    0, 1, 0,
    0, 0, 1), 
  nrow = 3, byrow = TRUE
 )
 
print(x)

upsilon.test(x)

# A contingency table with a dominant function
x <- matrix(
  c(2, 0, 0, 
    0, 2, 0,
    0, 0, 2), 
  nrow = 3, byrow = TRUE)
  
print(x)

upsilon.test(x)

# Another contingency table with a dominant function
x <- matrix(
  c(3, 0, 0, 
    0, 3, 0,
    0, 0, 0), 
  nrow = 3, byrow = TRUE)

print(x)

upsilon.test(x)

mirror server hosted at Truenetwork, Russian Federation.