Help for package SLEMI

Title:

Statistical Learning Based Estimation of Mutual Information

Version:

1.0.2

Description:

The implementation of the algorithm for estimation of mutual information and channel capacity from experimental data by classification procedures (logistic regression). Technically, it allows to estimate information-theoretic measures between finite-state input and multivariate, continuous output. Method described in Jetka et al. (2019) <doi:10.1371/journal.pcbi.1007132>.

Depends:

R (≥ 3.6.0)

License:

GPL (≥ 3)

URL:

https://github.com/TJetka/SLEMI

BugReports:

https://github.com/TJetka/SLEMI/issues

Encoding:

UTF-8

LazyData:

true

Imports:

e1071, ggplot2, gridExtra, nnet, Hmisc, reshape2, stringr, doParallel, caret, corrplot, foreach, methods

Suggests:

knitr, rmarkdown, testthat (≥ 2.1.0), data.table, covr

VignetteBuilder:

knitr

RoxygenNote:

7.2.3

NeedsCompilation:

Packaged:

2023-11-19 18:32:06 UTC; tomasz_jetka

Author:

Tomasz Jetka [aut, cre], Karol Nienaltowski [ctb], Michal Komorowski [ctb]

Maintainer:

Tomasz Jetka <t.jetka@gmail.com>

Repository:

CRAN

Date/Publication:

2023-11-19 20:00:02 UTC

SLEMI: Statistical Learning Based Estimation of Mutual Information

Description

Author(s)

Maintainer: Tomasz Jetka t.jetka@gmail.com

Other contributors:

Karol Nienaltowski [contributor]
Michal Komorowski [contributor]

Removing NAs observations from a data frame

Description

Internal, auxiliary functions

Usage

aux_deleteNA_df(data)

Arguments

data

is a data.frame object

Value

a data.frame object with the same structure as data and no observation with missing (NA) values

Examples

df=data.frame(x=c(rnorm(10),NA,NA),y=c(NA,NA,rnorm(10)))
SLEMI:::aux_deleteNA_df(df)

Tuning GGplot Themes

Description

Internal, auxiliary functions

Usage

aux_theme_publ(base_size = 12, base_family = "sans", version = 1)

Arguments

base_size

integer that sets the default size of font used in the plot

base_family

character that indicates what type of font should be used

version

integer that changes the characteristic of the plot, values 1,2 and 3 accepted.

Details

This function changes the theme of plots created with the use of ggplot package

Examples

library(ggplot2)
ggplot(data=data.frame(x=1:10,y=rnorm(10)),aes(x=x,y=y))+
geom_point()+SLEMI:::aux_theme_publ(version=2)

Calculation of expression `x\cdot \log y`

Description

Internal, auxiliary functions

Usage

aux_x_log_y(x, y)

Arguments

x

is a numeric vector

y

is a numeric vector (the same length of x)

Value

Function calculates the value of expression x\cdot \log y element-wise in a numerically stable way. The result is a numeric vector of the same length as x. It is assumed that 0\cdot \log 0 = 0.

Examples

SLEMI:::aux_x_log_y(1,2)
SLEMI:::aux_x_log_y(0,0)
SLEMI:::aux_x_log_y(1000,100)

Main algorithm to calculate channel capacity by SLEMI approach

Description

Usage

capacity_logreg_algorithm(
  data,
  signal = "signal",
  response = "response",
  side_variables = NULL,
  formula_string = NULL,
  model_out = TRUE,
  cc_maxit = 100,
  lr_maxit = 1000,
  MaxNWts = 5000
)

Arguments

data

must be a data.frame object. Cannot contain NA values.

signal

is a character object with names of columns of dataRaw to be treated as channel's input.

response

is a character vector with names of columns of dataRaw to be treated as channel's output

side_variables

(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included

formula_string

(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users.

model_out

is the logical indicating if the calculated logistic regression model should be included in output list

cc_maxit

is the number of iteration of iterative optimisation of the algorithm to estimate channel capacity. Default is 100.

lr_maxit

is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000.

MaxNWts

is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000.

Value

a list with three elements:

output$cc - channel capacity in bits
output$p_opt - optimal probability distribution
output$regression - confusion matrix of logistic regression predictions
output$model - nnet object describing logistic regression model (if model_out=TRUE)

References

[1] Jetka T, Nienaltowski K, Winarski T, Blonski S, Komorowski M, Information-theoretic analysis of multivariate single-cell signaling responses using SLEMI, PLoS Comput Biol, 15(7): e1007132, 2019, https://doi.org/10.1371/journal.pcbi.1007132.

Examples

tempdata=data_example1 
outputCLR1=capacity_logreg_algorithm(data=tempdata, signal="signal",
response="response",cc_maxit=3,model_out=FALSE,
formula_string = "signal~response")

Estimate channel capacity between discrete input and continuous output

Description

The main wrapping function for basic usage of SLEMI package for estimation of channel capacity. Firstly, data is pre-processed (all arguments are checked, observation with NAs are removed, variables are scaled and centered (if scale=TRUE)). Then basic estimation is carried out and (if testing=TRUE) diagnostic tests are computed. If output directory path is given (output_path is not NULL), graphs visualising the data and the analysis are saved there, together with a compressed output object (as .rds file) with full estimation results.

Usage

capacity_logreg_main(
  dataRaw,
  signal = "input",
  response = NULL,
  output_path = NULL,
  side_variables = NULL,
  formula_string = NULL,
  cc_maxit = 100,
  lr_maxit = 1000,
  MaxNWts = 5000,
  testing = FALSE,
  model_out = TRUE,
  scale = TRUE,
  TestingSeed = 1234,
  testing_cores = 1,
  boot_num = 10,
  boot_prob = 0.8,
  sidevar_num = 10,
  traintest_num = 10,
  partition_trainfrac = 0.6,
  plot_width = 6,
  plot_height = 4,
  data_out = FALSE
)

Arguments

dataRaw

must be a data.frame object

signal

is a character object with names of columns of dataRaw to be treated as channel's input.

response

is a character vector with names of columns of dataRaw to be treated as channel's output

output_path

is the directory in which output will be saved

side_variables

(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included

formula_string

(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users.

cc_maxit

is the number of iteration of iterative optimisation of the algorithm to estimate channel capacity. Default is 100.

lr_maxit

is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000.

MaxNWts

is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000.

testing

is the logical indicating if the testing procedures should be executed

model_out

is the logical indicating if the calculated logistic regression model should be included in output list

scale

is a logical indicating if the response variables should be scaled and centered before fitting logistic regression

TestingSeed

is the seed for random number generator used in testing procedures

testing_cores

- number of cores to be used in parallel computing (via doParallel package)

boot_num

is the number of bootstrap tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.

boot_prob

is the proportion of initial size of data to be used in bootstrap

sidevar_num

is the number of re-shuffling tests of side variables to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.

traintest_num

is the number of overfitting tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.

partition_trainfrac

is the fraction of data to be used as a training dataset

plot_width

- basic dimensions (width) of plots, in inches

plot_height

- basic dimensions (height) of plots, in inches

data_out

is the logical indicating if the data should be included in output list

Details

In a typical experiment aimed to quantify information flow a given signaling system, input values x_1\leq x_2 \ldots... \leq x_m, ranging from 0 to saturation are considered. Then, for each input level, x_i, n_i observations are collected, which are represented as vectors

y^i_j \sim P(Y|X = x_i)

Within information theory the degree of information transmission is measured as the mutual information

MI(X,Y) = \sum_{i=1}^{m} P(x_i)\int_{R^k} P(y|X = x_i)log_2\frac{P(y|X = x_i)}{P(y)}dy,

where P(y) is the marginal distribution of the output. MI is expressed in bits and 2^{MI} can be interpreted as the number of inputs that the system can resolve on average.

The maximization of mutual information with respect to the input distribution, P(X), defines the information capacity, C. Formally,

C^* = max_{P(X)} MI(X,Y)

Information capacity is expressed in bits and 2^{C^*} can be interpreted as the maximal number of inputs that the system can effectively resolve.

In contrast to existing approaches, instead of estimating, possibly highly dimensional, conditional output distributions P(Y|X =x_i), we propose to estimate the discrete, conditional input distribution, P(x_i |Y = y), which is known to be a simpler problem. Estimation of the MI using estimates of P(x_i |Y = y), denoted here as \hat{P}(x_i|Y = y), is possible as the MI, can be alternatively written as

MI(X,Y) = \sum_{i=1}^{m} P(x_i)\int_{R^k} P(y|X = x_i)log_2\frac{P(x_i|Y = y)}{P(x_i)}dy

The expected value (as in above expression) with respect to distribution P(Y|X = x_i) can be approximated by the average with respect to data

MI(X,Y) \approx \sum_{i=1}^{m} P(x_i)\frac{1}{n_i} \sum_{j=1}^{n_i} P(y|X = x_i)log_2\frac{\hat{P}(x_i|Y = y^i_j)}{P(x_i)}dy

Here, we propose to use logistic regression as \hat{P}(x_i|Y = y^i_j). Specifically,

log\frac{P(x_i |Y = y)}{P(x_m|Y = y)} \approx \alpha_i +\beta_iy

Following this approach, channel capacity can be calculated by optimising MI with respect to the input distribution, P(X). However, this, potentially difficult problem, can be divided into two simpler maximization problems, for which explicit solutions exist. Therefore, channel capacity can be obtained from the two explicit solutions in an iterative procedure known as alternate maximization (similarly as in Blahut-Arimoto algorithm) [1].

Additional parameters: lr_maxit and maxNWts are the same as in definition of multinom function from nnet package. An alternative model formula (using formula_string arguments) should be provided if data are not suitable for description by logistic regression (recommended only for advanced users). Preliminary scaling of data (argument scale) should be used similarly as in other data-driven approaches, e.g. if response variables are comparable, scaling (scale=FALSE) can be omitted, while if they represent different phenomenon (varying by units and/or magnitude) scaling is recommended.

Value

a list with several elements:

output$regression - confusion matrix of logistic regression predictions
output$cc - channel capacity in bits
output$p_opt - optimal probability distribution
output$model - nnet object describing logistic regression model (if model_out=TRUE)
output$params - parameters used in algorithm
output$time - computation time of calculations
output$testing - a 2- or 4-element output list of testing procedures (if testing=TRUE)
output$testing_pv - one-sided p-values of testing procedures (if testing=TRUE)
output$data - raw data used in analysis

References

[1] Csiszar I, Tusnady G, Information geometry and alternating minimization procedures, Statistics & Decisions 1 Supplement 1 (1984), 205–237.

[2] Jetka T, Nienaltowski K, Winarski T, Blonski S, Komorowski M, Information-theoretic analysis of multivariate single-cell signaling responses using SLEMI, PLoS Comput Biol, 15(7): e1007132, 2019, https://doi.org/10.1371/journal.pcbi.1007132.

Examples

tempdata=data_example1
outputCLR1=capacity_logreg_main(dataRaw=tempdata,
signal="signal", response="response",cc_maxit = 10,
formula_string = "signal~response")

tempdata=data_example2
outputCLR2=capacity_logreg_main(dataRaw=tempdata,
signal="signal", response=c("X1","X2"),cc_maxit = 10,
formula_string = "signal~X1+X2") 

#For further details see vignette

Testing procedures for estimation of channel capacity

Description

Diagnostic procedures that allows to compute the uncertainty of estimation of channel capacity by SLEMI approach. Two main procedures are implemented: bootstrap, which execute estimation with using a fraction of data and overfitting test, which divides data into two parts: training and testing. Each of them is repeated specified number of times to obtain a distribution of our estimators. It is recommended to conduct estimation by calling capacity_logreg_main.R.

Usage

capacity_logreg_testing(
  data,
  signal = "signal",
  response = "response",
  side_variables = NULL,
  cc_maxit = 100,
  lr_maxit = 1000,
  MaxNWts = 5000,
  formula_string = NULL,
  TestingSeed = 1234,
  testing_cores = 1,
  boot_num = 10,
  boot_prob = 0.8,
  sidevar_num = 10,
  traintest_num = 10,
  partition_trainfrac = 0.6
)

Arguments

data

must be a data.frame object. Cannot contain NA values.

signal

is a character object with names of columns of dataRaw to be treated as channel's input.

response

is a character vector with names of columns of dataRaw to be treated as channel's output

side_variables

(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included

cc_maxit

is the number of iteration of iterative optimisation of the algorithm to estimate channel capacity. Default is 100.

lr_maxit

is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000.

MaxNWts

is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000.

formula_string

(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users.

TestingSeed

is the seed for random number generator used in testing procedures

testing_cores

- number of cores to be used in parallel computing (via doParallel package)

boot_num

is the number of bootstrap tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.

boot_prob

is the proportion of initial size of data to be used in bootstrap. Default is 0.8.

sidevar_num

is the number of re-shuffling tests of side variables to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.

traintest_num

is the number of overfitting tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.

partition_trainfrac

is the fraction of data to be used as a training dataset. Default is 0.6.

Details

If side variables are added within the analysis (side_variables is not NULL), two additional procedures are carried out: reshuffling test and reshuffling with bootstrap test, which are based on permutation of side variables values within the dataset. Additional parameters: lr_maxit and MaxNWts are the same as in definition of multinom function from nnet package. An alternative model formula (using formula_string arguments) should be provided if data are not suitable for description by logistic regression (recommended only for advanced users).

Value

a list with four elements:

output$bootstrap - confusion matrix of logistic regression predictions
output$resamplingMorph - channel capacity in bits
output$traintest - optimal probability distribution
output$bootResampMorph - nnet object describing logistic regression model (if model_out=TRUE)

Each of above is a list, where an element is an output of a single repetition of the channel capacity algorithm

References

Examples

## Please set boot_num and traintest_num with larger numbers 
## for a more reliable testing
tempdata=data_example1
outputCLR1_testing=capacity_logreg_testing(data=tempdata,
signal="signal", response="response",cc_maxit=10,
TestingSeed=11111, boot_num=1,boot_prob=0.8,testing_cores=1,
traintest_num=1,partition_trainfrac=0.6)

Plotting output of capacity estimation. Auxiliary functions.

Description

INPUT:

Usage

capacity_output_graph_boxplots(
  data,
  signal,
  response,
  path,
  height = 4,
  width = 6
)

capacity_output_graph_violinMean(
  data,
  signal,
  response,
  path,
  height = 4,
  width = 6
)

capacity_output_graph_boxplotsSideVar(
  data,
  signal,
  side_variables,
  path,
  height = 4,
  width = 6
)

capacity_output_graph_capacity(cc_output, path, height = 4, width = 6)

Arguments

data

is a data.frame object

signal

is a character object that indicates columns of data that should be treated as channel's input

response

is a character vector that indicates columns of data that should be treated as channel's output

path

character giving the directory, where graphs should be saved

height

integer indicating the height of a single plot

width

integer indicating the width of a single plot

side_variables

is a character vector that indicates side variables' columns of data

cc_output

a list that is the output of capacity_logreg_algorithm function

Exemplary data set I

Description

A dataset describing simple one dimensional input - one dimensional output channel with 500 observations per input. In addition, each observation is assigned to one of three types that occurs with propensities (0.6,0.3,0.1), respectively Conditional output distributions are Gaussian.

Usage

data_example1

Format

A data frame with 1500 rows and 3 variables:

signal: Label of input
response: The value of output
sideVar: Label of the type of given observation

Source

synthetic

Exemplary data set II

Description

A dataset describing a channel with 3 possible inputs and 3-dimensional output with 500 observations per input. Conditional output distributions are multivariate Gaussians.

Usage

data_example2

Format

A data frame with 1500 rows and 4 variables:

signal: Label of input
X1: The value of first dimension of output
X2: The value of second dimension of output
X3: The value of third dimension of output

Source

synthetic

Data from experiment with NFkB pathway

Description

In the paper describing methodological aspects of our algorithm we present the analysis of information transmission in NfkB pathway upon the stimulation of TNF-\alpha. Experimental data from this experiment in the form of single-cell time series are attached to the package as a data.frame object and can be accessed using 'data_nfkb' variable. Each row of ‘data_nfkb' represents a single observation of a cell. Column ’signal' indicates the level of TNF-\alpha stimulation for a given cell, while columns 'response_T', gives the normalised ratio of nuclear and cytoplasmic transcription factor as described in Supplementary Methods of the corresponding publication. In the CRAN version of the package we included only a subset of the data (5 time measurements). For the full datasets, please access GitHub pages.

Usage

data_nfkb

Format

A data frame with 15632 rows and 6 variables:

signal: Level of TNFa stimulation
response_0: The concentration of normalised NfkB transcription factor, measured at time 0
response_3: The concentration of normalised NfkB transcription factor, measured at time 3
response_21: The concentration of normalised NfkB transcription factor, measured at time 21
response_90: The concentration of normalised NfkB transcription factor, measured at time 90
response_120: The concentration of normalised NfkB transcription factor, measured at time 120

Details

For each concentration, there are at least 1000 single-cell observation (with the exception of 0.5ng stimulation, where the number of identified cells is almost 900)

Source

in-house experimental data

Formula generator for logistic regression algorithm

Description

Internal, auxiliary functions

Usage

func_formula_generator(
  signal = "signal",
  response = "response",
  side_variables = NULL
)

Arguments

signal

is a character object that indicates columns of data to be treated as channel's input

response

is a character vector that indicates columns of data to be treated as channel's output

side_variables

is a character vector that indicates side variables' columns of data

Value

A character object that includes a standard formula syntax to use in algorithm for capacity calculation

Examples

SLEMI:::func_formula_generator(signal="signal",response="response", side_variables=NULL)
SLEMI:::func_formula_generator(signal="inputX",response="responseY", side_variables="SV1")
SLEMI:::func_formula_generator(signal="signalX",response=c("r_1","r_2","r_5"), side_variables="SV")

Initial verification of input

Description

Internal, auxiliary functions

Usage

func_input_checks(data, signal, response, side_variables)

Arguments

data

is an input object that should be a data.frame

signal

is a character object that indicates input columns of data

response

is a character vector that indicates output's columns of data

side_variables

is a character vector that indicates side variables' columns of data

Value

If all initial data is valid, string "ok" is returned. Otherwise, error is given.

Examples

data=data_example1
SLEMI:::func_input_checks(data=data,signal="signal",
response="response",side_variables="sideVar")

# The following examples will give errors, because  the data has 
# inconsistent format for the analysis. Only to check the adequacy of
# initial checks. 
# data=as.matrix(data_example1)
# SLEMI:::func_input_checks(data=data,signal="signal",
# response="response",side_variables="sideVar")
# data=data_example1
# SLEMI:::func_input_checks(data=data,signal="input",
# response="response",side_variables="sideVar")

Iterative updating of prior probabilities in logistic regression estimator

Description

Internal, auxiliary functions

Usage

func_iterative_logreg_update(prob_lr, p0, cell_id, signal_levels, cc_maxit)

Arguments

prob_lr

is a matrix of class probabilities for each observation

p0

is a numeric vector of prior probabilities used for logistic regression estimation

cell_id

a list of logical vectors indicating class labels of each observation

signal_levels

is a vector of class labels

cc_maxit

is the number of iteration of procedure to be carried out

Value

A list with components

p_opt - a numeric vectors with estimated optimal input probabilities
MI_opt - a numerical value of estimated channel capacity

Initial verification and transformation of input variable

Description

Internal, auxiliary functions

Usage

func_signal_transform(data, signal)

Arguments

data

is a data.frame

signal

is a character that indicates columns of data that include the labels of input

Value

A data.frame that is a copy of data provided with signal column transformed to factor class. If signal has been numeric initially, additional column is created "signal_RAW" that is an exact copy of original column

Examples

data=data_example1
data1=SLEMI:::func_signal_transform(data,"signal")
data$signal=as.character(data$signal)
data2=SLEMI:::func_signal_transform(data,"signal")
data$signal=as.numeric(data$signal)
data3=SLEMI:::func_signal_transform(data,"signal")

Main algorithm to calculate mutual information by SLEMI approach

Description

Usage

mi_logreg_algorithm(
  data,
  signal = "signal",
  response = "response",
  side_variables = NULL,
  pinput = NULL,
  formula_string = NULL,
  lr_maxit = 1000,
  MaxNWts = 5000,
  model_out = TRUE
)

Arguments

data

must be a data.frame object. Cannot contain NA values.

signal

is a character object with names of columns of dataRaw to be treated as channel's input.

response

is a character vector with names of columns of dataRaw to be treated as channel's output

side_variables

(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included

pinput

is a numeric vector with prior probabilities of the input values. Uniform distribution is assumed as default (pinput=NULL).

formula_string

(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users.

lr_maxit

is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000.

MaxNWts

is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000.

model_out

is the logical indicating if the calculated logistic regression model should be included in output list

Value

a list with three elements:

output$mi - mutual information in bits
output$pinput - prior probabilities used in estimation
output$regression - confusion matrix of logistic regression model
output$model - nnet object describing logistic regression model (if model_out=TRUE)

References

Examples

## Estimate mutual information directly
temp_data=data_example1
output=mi_logreg_algorithm(data=data_example1,
                   signal = "signal",
                   response = "response")

Estimate mutual information between discrete input and continuous output

Description

The main wrapping function for basic usage of SLEMI package for estimation of mutual information. Firstly, data is pre-processed (all arguments are checked, observation with NAs are removed, variables are scaled and centered (if scale=TRUE)). Then basic estimation is carried out and (if testing=TRUE) diagnostic tests are computed. If output directory path is given (output_path is not NULL), graphs visualising the data and the analysis are saved there, together with a compressed output object (as .rds file) with full estimation results.

Usage

mi_logreg_main(
  dataRaw,
  signal = "input",
  response = NULL,
  output_path = NULL,
  side_variables = NULL,
  pinput = NULL,
  formula_string = NULL,
  lr_maxit = 1000,
  MaxNWts = 5000,
  testing = FALSE,
  model_out = TRUE,
  scale = TRUE,
  TestingSeed = 1234,
  testing_cores = 1,
  boot_num = 10,
  boot_prob = 0.8,
  sidevar_num = 10,
  traintest_num = 10,
  partition_trainfrac = 0.6,
  plot_width = 6,
  plot_height = 4,
  data_out = FALSE
)

Arguments

dataRaw

must be a data.frame object

signal

is a character object with names of columns of dataRaw to be treated as channel's input.

response

is a character vector with names of columns of dataRaw to be treated as channel's output

output_path

is the directory in which output will be saved

side_variables

(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included

pinput

is a numeric vector with prior probabilities of the input values. Uniform distribution is assumed as default (pinput=NULL).

formula_string

(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users.

lr_maxit

is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000.

MaxNWts

is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000.

testing

is the logical indicating if the testing procedures should be executed

model_out

is the logical indicating if the calculated logistic regression model should be included in output list

scale

is a logical indicating if the response variables should be scaled and centered before fitting logistic regression

TestingSeed

is the seed for random number generator used in testing procedures

testing_cores

- number of cores to be used in parallel computing (via doParallel package)

boot_num

is the number of bootstrap tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.

boot_prob

is the proportion of initial size of data to be used in bootstrap

sidevar_num

is the number of re-shuffling tests of side variables to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.

traintest_num

is the number of overfitting tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.

partition_trainfrac

is the fraction of data to be used as a training dataset

plot_width

- basic dimensions (width) of plots, in inches

plot_height

- basic dimensions (height) of plots, in inches

data_out

is the logical indicating if the data should be included in output list

Details

y^i_j \sim P(Y|X = x_i)

Within information theory the degree of information transmission is measured as the mutual information

MI(X,Y) = \sum_{i=1}^{m} P(x_i)\int_{R^k} P(y|X = x_i)log_2\frac{P(y|X = x_i)}{P(y)}dy,

where P(y) is the marginal distribution of the output. MI is expressed in bits and 2^{MI} can be interpreted as the number of inputs that the system can resolve on average.

In contrast to existing approaches, instead of estimating, possibly highly dimensional, conditional output distributions P(Y|X =x_i), we propose to estimate the discrete, conditional input distribution, P(x_i |Y = y), which is known to be a simpler problem. Estimation of the MI using estimates of P(x_i |Y = y), denoted here as \hat{P}(x_i|Y = y), is possible as the MI, can be alternatively written as

MI(X,Y) = \sum_{i=1}^{m} P(x_i)\int_{R^k} P(y|X = x_i)log_2\frac{P(x_i|Y = y)}{P(x_i)}dy

The expected value (as in above expression) with respect to distribution P(Y|X = x_i) can be approximated by the average with respect to data

MI(X,Y) \approx \sum_{i=1}^{m} P(x_i)\frac{1}{n_i} \sum_{j=1}^{n_i} P(y|X = x_i)log_2\frac{\hat{P}(x_i|Y = y^i_j)}{P(x_i)}dy

Here, we propose to use logistic regression as \hat{P}(x_i|Y = y^i_j). Specifically,

log\frac{P(x_i |Y = y)}{P(x_m|Y = y)} \approx \alpha_i +\beta_iy

Value

a list with several elements:

output$regression - confusion matrix of logistic regression predictions
output$mi - mutual information in bits
output$model - nnet object describing logistic regression model (if model_out=TRUE)
output$params - parameters used in algorithm
output$time - computation time of calculations
output$testing - a 2- or 4-element output list of testing procedures (if testing=TRUE)
output$testing_pv - one-sided p-values of testing procedures (if testing=TRUE)
output$data - raw data used in analysis

References

Examples

tempdata=data_example1
outputCLR1=mi_logreg_main(dataRaw=tempdata, signal="signal", response="response")

tempdata=data_example2
outputCLR2=mi_logreg_main(dataRaw=tempdata, signal="signal", response=c("X1","X2","X3")) 

#For further details see vignette

Testing procedures for estimation of mutual information

Description

Diagnostic procedures that allows to compute the uncertainty of estimation of mutual information by SLEMI approach. Two main procedures are implemented: bootstrap, which execute estimation with using a fraction of data and overfitting test, which divides data into two parts: training and testing. Each of them is repeated specified number of times to obtain a distribution of our estimators. It is recommended to call this function from mi_logreg_main.R.

Usage

mi_logreg_testing(
  data,
  signal = "signal",
  response = "response",
  side_variables = NULL,
  pinput = NULL,
  lr_maxit = 1000,
  MaxNWts = 5000,
  formula_string = NULL,
  TestingSeed = 1234,
  testing_cores = 1,
  boot_num = 10,
  boot_prob = 0.8,
  sidevar_num = 10,
  traintest_num = 10,
  partition_trainfrac = 0.6
)

Arguments

data

must be a data.frame object. Cannot contain NA values.

signal

is a character object with names of columns of dataRaw to be treated as channel's input.

response

is a character vector with names of columns of dataRaw to be treated as channel's output

side_variables

(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included

pinput

is a numeric vector with prior probabilities of the input values. Uniform distribution is assumed as default (pinput=NULL).

lr_maxit

is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000.

MaxNWts

is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000.

formula_string

(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users.

TestingSeed

is the seed for random number generator used in testing procedures

testing_cores

- number of cores to be used in parallel computing (via doParallel package)

boot_num

is the number of bootstrap tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.

boot_prob

is the proportion of initial size of data to be used in bootstrap

sidevar_num

is the number of re-shuffling tests of side variables to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.

traintest_num

is the number of overfitting tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates.

partition_trainfrac

is the fraction of data to be used as a training dataset

Details

Value

a list with elements:

output$bootstrap - bootstrap test
output$traintest - overfitting test
output$reshuffling_sideVar - (if side_variables is not NULL) re-shuffling test
output$bootstrap_Reshuffling_sideVar - (if side_variables is not NULL) re-shuffling test with a bootstrap

Each of the above is a list, where an element is a standard output of a single mi_logreg_algorithm run.

References

Examples

## Compute  uncertainty of mutual information estimator using 1 core
## Set boot_num and traintest_num with larger numbers for more reliable testing
tempdata=data_example1
output=mi_logreg_testing(data=tempdata,
                   signal = "signal",
                   response = "response",
                   testing_cores = 1,boot_num=1,traintest_num=1)

Plotting output of capacity estimation and additional exploratory graphs.

Description

INPUT:

Usage

output_graphs_main(
  data,
  signal,
  response,
  side_variables,
  cc_output,
  output_path,
  height = 4,
  width = 6
)

Arguments

data

is a data.frame object

signal

is a character object that indicates columns of data that should be treated as channel's input

response

is a character vector that indicates columns of data that should be treated as channel's output

side_variables

is a character vector that indicates side variables' columns of data

cc_output

a list that is a standard output of capacity_logreg_algorithm function

output_path

character giving the directory, where graphs should be saved

height

integer indicating the height of a single plot

width

integer indicating the width of a single plot

Value

A list with ggplot or gtable object. Each plot is also saved in 'output_path' directory in separate pdfs files which include:

MainPlot.pdf - a simple summary plot with basic distribution visualization and capacity estimate
MainPlot_full.pdf - a comprehensive summary plot with distribution visualization and capacity estimate
capacity.pdf - a diagram presenting the capacity estimates
io_relation.pdf - a graph with input-output relation
kdensities.pdf - kernel density estimator of data distribution
histograms.pdf - histograms of data
boxplots.pdf - boxplots of data
violin.pdf - violin plots of data

Calculates Probability of pairwise discrimination

Description

Estimates probabilities of correct discrimination (PCDs) between each pair of input/signal values using a logistic regression model.

Usage

prob_discr_pairwise(
  dataRaw,
  signal = "input",
  response = NULL,
  side_variables = NULL,
  formula_string = NULL,
  output_path = NULL,
  scale = TRUE,
  lr_maxit = 1000,
  MaxNWts = 5000,
  diagnostics = TRUE
)

Arguments

dataRaw

must be a data.frame object

signal

is a character object with names of columns of dataRaw to be treated as channel's input.

response

is a character vector with names of columns of dataRaw to be treated as channel's output

side_variables

(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included

formula_string

(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users.

output_path

is a directory where a pie chart with calculated probabilities will be saved. If NULL, the graph will not be created.

scale

is a logical indicating if the response variables should be scaled and centered before fitting logistic regression

lr_maxit

is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000.

MaxNWts

is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000.

diagnostics

is a logical indicating if details of logistic regression fitting should be included in output list

Details

In order to estimate PCDs, for a given pair of input values x_i and x_j, we propose to fit a logistic regression model using response data corresponding to the two considered inputs, i.e. y^l_u, for l\in\{i,j\} and u ranging from 1 to n_l. To ensure that both inputs have equal contribution to the calculated discriminability, equal probabilities should be assigned, P(X) = (P(x_i),P(x_j))=(1/2,1/2). Once the regression model is fitted, probability of assigning a given cellular response, y, to the correct input value is estimated as

\max \{ \hat{P}_{lr}(x_i|Y=y;P(X)), \hat{P}_{lr}(x_j|Y=y;P(X))\}.

Note that P(x_j|Y=y)=1-P(x_i|Y=y) as well as \hat{P}_{lr}(x_j|Y=y;P(X))=1-\hat{P}_{lr}(x_i|Y=y;P(X)) The average of the above probabilities over all observations y^i_l yields PCDs

PCD_{x_i,x_j}=\frac{1}{2}\frac{1}{n_i}\sum_{l=1}^{n_i}\max\{ \hat{P}_{lr}(x_i|Y=y_i^l;P(X)),\hat{P}_{lr}(x_i^l|Y=y;P(X))\} +

\frac{1}{2} \frac{1}{n_j} \sum_{l=1}^{n_j} \max \{ \hat{P}_{lr}(x_i|Y=y_j^l;P(X)), \hat{P}_{lr}(x_j|Y=y_j^l;P(X))\}.

Value

a list with two elements:

output$prob_matr - a n\times n matrix, where n is the number of inputs, with probabilities of correct discrimination between pairs of input values.
output$diagnostics - (if diagnostics=TRUE) a list corresponding to logistic regression models fitted for each pair of input values. Each element consists of three sub-elements: 1) nnet_model - nnet object summarising logistic regression model; 2) prob_lr - probabilities of assignment obtained from logistic regression model; 3) confusion_matrix - confusion matrix of classificator.

References

Examples

## Calculate probabilities of discrimination for nfkb dataset
 it=21 # choose from 0, 3, 6, ..., 120 for measurements at other time points
 output=prob_discr_pairwise(dataRaw=data_nfkb[data_nfkb$signal%in%c("0ng","1ng","100ng"),],
                            signal = "signal",
                           response = paste0("response_",it))

Sampling procedures used for testing capacity algorithm

Description

Internal, auxiliary functions

Usage

sampling_bootstrap(data, prob = 1, dataDiv)

sampling_shuffle(data, side_variables)

sampling_partition(data, dataDiv, partition_trainfrac)

Arguments

data

is a data.frame to be resampled

prob

is numeric for the portion of data that should be sampled from the whole dataset (only in sampling_bootstrap)

dataDiv

a character indicating column of data, with respect to which, data should be split before bootstrap

side_variables

is a vector of characters indicating columns of data the will be reshuffled (only in sampling_shuffle)

partition_trainfrac

is a numeric for the portion of data that will be used as a training and testing datasets

Details

These function allow to re-sample, bootstrap and divide initial dataset

Value

Function sampling_bootstrap returns a data.frame with the same structure as initial data object, but with prob proportion of observations for each dataDiv level. Function sampling_shuffle returns a data.frame with the same structure as initial data object with shuffled values of columns given in side_variables argument. Function sampling_partition returns a list of two data.frame objects - train and test that has the same structure as initial data argument with partition_trainfrac and 1-partition_trainfrac observations, respectively.

Examples

data=data_example1
dataBootstrap = SLEMI:::sampling_bootstrap(data=data,prob=0.8,data$signal)
dataShuffle = SLEMI:::sampling_shuffle(data=data,"sideVar")
dataTrainTest = SLEMI:::sampling_partition(data=data,dataDiv=data$signal,partition_trainfrac=0.6)

SLEMI: Statistical Learning Based Estimation of Mutual Information

Description

Author(s)

See Also

Removing NAs observations from a data frame

Description

Usage

Arguments

Value

Examples

Tuning GGplot Themes

Description

Usage

Arguments

Details

Examples

Calculation of expression x\cdot \log y

Description

Usage

Arguments

Value

Examples

Main algorithm to calculate channel capacity by SLEMI approach

Description

Usage

Arguments

Value

References

Examples

Estimate channel capacity between discrete input and continuous output

Description

Usage

Arguments

Details

Value

References

Examples

Testing procedures for estimation of channel capacity

Description

Usage

Arguments

Details

Value

References

Examples

Plotting output of capacity estimation. Auxiliary functions.

Description

Usage

Arguments

Exemplary data set I

Description

Usage

Format

Source

Exemplary data set II

Description

Usage

Format

Source

Data from experiment with NFkB pathway

Description

Usage

Format

Details

Source

Formula generator for logistic regression algorithm

Description

Usage

Arguments

Value

Examples

Initial verification of input

Description

Usage

Arguments

Value

Examples

Iterative updating of prior probabilities in logistic regression estimator

Description

Usage

Calculation of expression `x\cdot \log y`