Title: | Statistical Learning Based Estimation of Mutual Information |
Version: | 1.0.2 |
Description: | The implementation of the algorithm for estimation of mutual information and channel capacity from experimental data by classification procedures (logistic regression). Technically, it allows to estimate information-theoretic measures between finite-state input and multivariate, continuous output. Method described in Jetka et al. (2019) <doi:10.1371/journal.pcbi.1007132>. |
Depends: | R (≥ 3.6.0) |
License: | GPL (≥ 3) |
URL: | https://github.com/TJetka/SLEMI |
BugReports: | https://github.com/TJetka/SLEMI/issues |
Encoding: | UTF-8 |
LazyData: | true |
Imports: | e1071, ggplot2, gridExtra, nnet, Hmisc, reshape2, stringr, doParallel, caret, corrplot, foreach, methods |
Suggests: | knitr, rmarkdown, testthat (≥ 2.1.0), data.table, covr |
VignetteBuilder: | knitr |
RoxygenNote: | 7.2.3 |
NeedsCompilation: | no |
Packaged: | 2023-11-19 18:32:06 UTC; tomasz_jetka |
Author: | Tomasz Jetka [aut, cre], Karol Nienaltowski [ctb], Michal Komorowski [ctb] |
Maintainer: | Tomasz Jetka <t.jetka@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2023-11-19 20:00:02 UTC |
SLEMI: Statistical Learning Based Estimation of Mutual Information
Description
The implementation of the algorithm for estimation of mutual information and channel capacity from experimental data by classification procedures (logistic regression). Technically, it allows to estimate information-theoretic measures between finite-state input and multivariate, continuous output. Method described in Jetka et al. (2019) doi:10.1371/journal.pcbi.1007132.
Author(s)
Maintainer: Tomasz Jetka t.jetka@gmail.com
Other contributors:
Karol Nienaltowski [contributor]
Michal Komorowski [contributor]
See Also
Useful links:
Removing NAs observations from a data frame
Description
Internal, auxiliary functions
Usage
aux_deleteNA_df(data)
Arguments
data |
is a data.frame object |
Value
a data.frame object with the same structure as data and no observation with missing (NA) values
Examples
df=data.frame(x=c(rnorm(10),NA,NA),y=c(NA,NA,rnorm(10)))
SLEMI:::aux_deleteNA_df(df)
Tuning GGplot Themes
Description
Internal, auxiliary functions
Usage
aux_theme_publ(base_size = 12, base_family = "sans", version = 1)
Arguments
base_size |
integer that sets the default size of font used in the plot |
base_family |
character that indicates what type of font should be used |
version |
integer that changes the characteristic of the plot, values 1,2 and 3 accepted. |
Details
This function changes the theme of plots created with the use of ggplot package
Examples
library(ggplot2)
ggplot(data=data.frame(x=1:10,y=rnorm(10)),aes(x=x,y=y))+
geom_point()+SLEMI:::aux_theme_publ(version=2)
Calculation of expression x\cdot \log y
Description
Internal, auxiliary functions
Usage
aux_x_log_y(x, y)
Arguments
x |
is a numeric vector |
y |
is a numeric vector (the same length of x) |
Value
Function calculates the value of expression x\cdot \log y
element-wise in a numerically stable way.
The result is a numeric vector of the same length as x. It is assumed that 0\cdot \log 0 = 0
.
Examples
SLEMI:::aux_x_log_y(1,2)
SLEMI:::aux_x_log_y(0,0)
SLEMI:::aux_x_log_y(1000,100)
Main algorithm to calculate channel capacity by SLEMI approach
Description
Additional parameters: lr_maxit and maxNWts are the same as in definition of multinom function from nnet package. An alternative model formula (using formula_string arguments) should be provided if data are not suitable for description by logistic regression (recommended only for advanced users). It is recommended to conduct estimation by calling capacity_logreg_main.R.
Usage
capacity_logreg_algorithm(
data,
signal = "signal",
response = "response",
side_variables = NULL,
formula_string = NULL,
model_out = TRUE,
cc_maxit = 100,
lr_maxit = 1000,
MaxNWts = 5000
)
Arguments
data |
must be a data.frame object. Cannot contain NA values. |
signal |
is a character object with names of columns of dataRaw to be treated as channel's input. |
response |
is a character vector with names of columns of dataRaw to be treated as channel's output |
side_variables |
(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included |
formula_string |
(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users. |
model_out |
is the logical indicating if the calculated logistic regression model should be included in output list |
cc_maxit |
is the number of iteration of iterative optimisation of the algorithm to estimate channel capacity. Default is 100. |
lr_maxit |
is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000. |
MaxNWts |
is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000. |
Value
a list with three elements:
output$cc - channel capacity in bits
output$p_opt - optimal probability distribution
output$regression - confusion matrix of logistic regression predictions
output$model - nnet object describing logistic regression model (if model_out=TRUE)
References
[1] Jetka T, Nienaltowski K, Winarski T, Blonski S, Komorowski M, Information-theoretic analysis of multivariate single-cell signaling responses using SLEMI, PLoS Comput Biol, 15(7): e1007132, 2019, https://doi.org/10.1371/journal.pcbi.1007132.
Examples
tempdata=data_example1
outputCLR1=capacity_logreg_algorithm(data=tempdata, signal="signal",
response="response",cc_maxit=3,model_out=FALSE,
formula_string = "signal~response")
Estimate channel capacity between discrete input and continuous output
Description
The main wrapping function for basic usage of SLEMI package for estimation of channel capacity. Firstly, data is pre-processed (all arguments are checked, observation with NAs are removed, variables are scaled and centered (if scale=TRUE)). Then basic estimation is carried out and (if testing=TRUE) diagnostic tests are computed. If output directory path is given (output_path is not NULL), graphs visualising the data and the analysis are saved there, together with a compressed output object (as .rds file) with full estimation results.
Usage
capacity_logreg_main(
dataRaw,
signal = "input",
response = NULL,
output_path = NULL,
side_variables = NULL,
formula_string = NULL,
cc_maxit = 100,
lr_maxit = 1000,
MaxNWts = 5000,
testing = FALSE,
model_out = TRUE,
scale = TRUE,
TestingSeed = 1234,
testing_cores = 1,
boot_num = 10,
boot_prob = 0.8,
sidevar_num = 10,
traintest_num = 10,
partition_trainfrac = 0.6,
plot_width = 6,
plot_height = 4,
data_out = FALSE
)
Arguments
dataRaw |
must be a data.frame object |
signal |
is a character object with names of columns of dataRaw to be treated as channel's input. |
response |
is a character vector with names of columns of dataRaw to be treated as channel's output |
output_path |
is the directory in which output will be saved |
side_variables |
(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included |
formula_string |
(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users. |
cc_maxit |
is the number of iteration of iterative optimisation of the algorithm to estimate channel capacity. Default is 100. |
lr_maxit |
is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000. |
MaxNWts |
is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000. |
testing |
is the logical indicating if the testing procedures should be executed |
model_out |
is the logical indicating if the calculated logistic regression model should be included in output list |
scale |
is a logical indicating if the response variables should be scaled and centered before fitting logistic regression |
TestingSeed |
is the seed for random number generator used in testing procedures |
testing_cores |
- number of cores to be used in parallel computing (via doParallel package) |
boot_num |
is the number of bootstrap tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates. |
boot_prob |
is the proportion of initial size of data to be used in bootstrap |
sidevar_num |
is the number of re-shuffling tests of side variables to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates. |
traintest_num |
is the number of overfitting tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates. |
partition_trainfrac |
is the fraction of data to be used as a training dataset |
plot_width |
- basic dimensions (width) of plots, in inches |
plot_height |
- basic dimensions (height) of plots, in inches |
data_out |
is the logical indicating if the data should be included in output list |
Details
In a typical experiment aimed to quantify information flow a given signaling system, input values x_1\leq x_2 \ldots... \leq x_m
, ranging from 0 to saturation are considered.
Then, for each input level, x_i
, n_i
observations are collected, which are represented as vectors
y^i_j \sim P(Y|X = x_i)
Within information theory the degree of information transmission is measured as the mutual information
MI(X,Y) = \sum_{i=1}^{m} P(x_i)\int_{R^k} P(y|X = x_i)log_2\frac{P(y|X = x_i)}{P(y)}dy,
where P(y)
is the marginal distribution of the output. MI is expressed in bits and 2^{MI}
can be interpreted as the number of
inputs that the system can resolve on average.
The maximization of mutual information with respect to the input distribution, P(X)
,
defines the information capacity, C. Formally,
C^* = max_{P(X)} MI(X,Y)
Information capacity is expressed in bits and 2^{C^*}
can be interpreted as the maximal number of inputs that the system can
effectively resolve.
In contrast to existing approaches, instead of estimating, possibly highly dimensional, conditional output distributions P(Y|X =x_i), we propose to estimate the discrete, conditional input distribution,
P(x_i |Y = y)
, which is known to be a simpler problem. Estimation of the MI using estimates of P(x_i |Y = y)
, denoted here as \hat{P}(x_i|Y = y)
, is possible as the MI, can be
alternatively written as
MI(X,Y) = \sum_{i=1}^{m} P(x_i)\int_{R^k} P(y|X = x_i)log_2\frac{P(x_i|Y = y)}{P(x_i)}dy
The expected value (as in above expression) with respect to distribution P(Y|X = x_i)
can be approximated by the average with respect to data
MI(X,Y) \approx \sum_{i=1}^{m} P(x_i)\frac{1}{n_i} \sum_{j=1}^{n_i} P(y|X = x_i)log_2\frac{\hat{P}(x_i|Y = y^i_j)}{P(x_i)}dy
Here, we propose to use logistic regression as \hat{P}(x_i|Y = y^i_j)
. Specifically,
log\frac{P(x_i |Y = y)}{P(x_m|Y = y)} \approx \alpha_i +\beta_iy
Following this approach, channel capacity can be calculated by optimising MI with respect to the input distribution, P(X)
.
However, this, potentially difficult problem, can be divided into two simpler maximization problems, for which explicit solutions exist.
Therefore, channel capacity can be obtained from the two explicit solutions in an iterative procedure known as alternate maximization (similarly as in Blahut-Arimoto algorithm) [1].
Additional parameters: lr_maxit and maxNWts are the same as in definition of multinom function from nnet package. An alternative model formula (using formula_string arguments) should be provided if data are not suitable for description by logistic regression (recommended only for advanced users). Preliminary scaling of data (argument scale) should be used similarly as in other data-driven approaches, e.g. if response variables are comparable, scaling (scale=FALSE) can be omitted, while if they represent different phenomenon (varying by units and/or magnitude) scaling is recommended.
Value
a list with several elements:
output$regression - confusion matrix of logistic regression predictions
output$cc - channel capacity in bits
output$p_opt - optimal probability distribution
output$model - nnet object describing logistic regression model (if model_out=TRUE)
output$params - parameters used in algorithm
output$time - computation time of calculations
output$testing - a 2- or 4-element output list of testing procedures (if testing=TRUE)
output$testing_pv - one-sided p-values of testing procedures (if testing=TRUE)
output$data - raw data used in analysis
References
[1] Csiszar I, Tusnady G, Information geometry and alternating minimization procedures, Statistics & Decisions 1 Supplement 1 (1984), 205–237.
[2] Jetka T, Nienaltowski K, Winarski T, Blonski S, Komorowski M, Information-theoretic analysis of multivariate single-cell signaling responses using SLEMI, PLoS Comput Biol, 15(7): e1007132, 2019, https://doi.org/10.1371/journal.pcbi.1007132.
Examples
tempdata=data_example1
outputCLR1=capacity_logreg_main(dataRaw=tempdata,
signal="signal", response="response",cc_maxit = 10,
formula_string = "signal~response")
tempdata=data_example2
outputCLR2=capacity_logreg_main(dataRaw=tempdata,
signal="signal", response=c("X1","X2"),cc_maxit = 10,
formula_string = "signal~X1+X2")
#For further details see vignette
Testing procedures for estimation of channel capacity
Description
Diagnostic procedures that allows to compute the uncertainty of estimation of channel capacity by SLEMI approach. Two main procedures are implemented: bootstrap, which execute estimation with using a fraction of data and overfitting test, which divides data into two parts: training and testing. Each of them is repeated specified number of times to obtain a distribution of our estimators. It is recommended to conduct estimation by calling capacity_logreg_main.R.
Usage
capacity_logreg_testing(
data,
signal = "signal",
response = "response",
side_variables = NULL,
cc_maxit = 100,
lr_maxit = 1000,
MaxNWts = 5000,
formula_string = NULL,
TestingSeed = 1234,
testing_cores = 1,
boot_num = 10,
boot_prob = 0.8,
sidevar_num = 10,
traintest_num = 10,
partition_trainfrac = 0.6
)
Arguments
data |
must be a data.frame object. Cannot contain NA values. |
signal |
is a character object with names of columns of dataRaw to be treated as channel's input. |
response |
is a character vector with names of columns of dataRaw to be treated as channel's output |
side_variables |
(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included |
cc_maxit |
is the number of iteration of iterative optimisation of the algorithm to estimate channel capacity. Default is 100. |
lr_maxit |
is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000. |
MaxNWts |
is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000. |
formula_string |
(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users. |
TestingSeed |
is the seed for random number generator used in testing procedures |
testing_cores |
- number of cores to be used in parallel computing (via doParallel package) |
boot_num |
is the number of bootstrap tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates. |
boot_prob |
is the proportion of initial size of data to be used in bootstrap. Default is 0.8. |
sidevar_num |
is the number of re-shuffling tests of side variables to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates. |
traintest_num |
is the number of overfitting tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates. |
partition_trainfrac |
is the fraction of data to be used as a training dataset. Default is 0.6. |
Details
If side variables are added within the analysis (side_variables is not NULL), two additional procedures are carried out: reshuffling test and reshuffling with bootstrap test, which are based on permutation of side variables values within the dataset. Additional parameters: lr_maxit and MaxNWts are the same as in definition of multinom function from nnet package. An alternative model formula (using formula_string arguments) should be provided if data are not suitable for description by logistic regression (recommended only for advanced users).
Value
a list with four elements:
output$bootstrap - confusion matrix of logistic regression predictions
output$resamplingMorph - channel capacity in bits
output$traintest - optimal probability distribution
output$bootResampMorph - nnet object describing logistic regression model (if model_out=TRUE)
Each of above is a list, where an element is an output of a single repetition of the channel capacity algorithm
References
[1] Jetka T, Nienaltowski K, Winarski T, Blonski S, Komorowski M, Information-theoretic analysis of multivariate single-cell signaling responses using SLEMI, PLoS Comput Biol, 15(7): e1007132, 2019, https://doi.org/10.1371/journal.pcbi.1007132.
Examples
## Please set boot_num and traintest_num with larger numbers
## for a more reliable testing
tempdata=data_example1
outputCLR1_testing=capacity_logreg_testing(data=tempdata,
signal="signal", response="response",cc_maxit=10,
TestingSeed=11111, boot_num=1,boot_prob=0.8,testing_cores=1,
traintest_num=1,partition_trainfrac=0.6)
Plotting output of capacity estimation. Auxiliary functions.
Description
INPUT:
Usage
capacity_output_graph_boxplots(
data,
signal,
response,
path,
height = 4,
width = 6
)
capacity_output_graph_violinMean(
data,
signal,
response,
path,
height = 4,
width = 6
)
capacity_output_graph_boxplotsSideVar(
data,
signal,
side_variables,
path,
height = 4,
width = 6
)
capacity_output_graph_capacity(cc_output, path, height = 4, width = 6)
Arguments
data |
is a data.frame object |
signal |
is a character object that indicates columns of data that should be treated as channel's input |
response |
is a character vector that indicates columns of data that should be treated as channel's output |
path |
character giving the directory, where graphs should be saved |
height |
integer indicating the height of a single plot |
width |
integer indicating the width of a single plot |
side_variables |
is a character vector that indicates side variables' columns of data |
cc_output |
a list that is the output of capacity_logreg_algorithm function |
Exemplary data set I
Description
A dataset describing simple one dimensional input - one dimensional output channel with 500 observations per input. In addition, each observation is assigned to one of three types that occurs with propensities (0.6,0.3,0.1), respectively Conditional output distributions are Gaussian.
Usage
data_example1
Format
A data frame with 1500 rows and 3 variables:
- signal
Label of input
- response
The value of output
- sideVar
Label of the type of given observation
Source
synthetic
Exemplary data set II
Description
A dataset describing a channel with 3 possible inputs and 3-dimensional output with 500 observations per input. Conditional output distributions are multivariate Gaussians.
Usage
data_example2
Format
A data frame with 1500 rows and 4 variables:
- signal
Label of input
- X1
The value of first dimension of output
- X2
The value of second dimension of output
- X3
The value of third dimension of output
Source
synthetic
Data from experiment with NFkB pathway
Description
In the paper describing methodological aspects of our algorithm we present the analysis of information transmission
in NfkB pathway upon the stimulation of TNF-\alpha
. Experimental data from this experiment in the form of single-cell
time series are attached to the package as a data.frame object and can be accessed using 'data_nfkb' variable.
Each row of ‘data_nfkb' represents a single observation of a cell. Column ’signal' indicates the level of TNF-\alpha
stimulation for a given cell, while columns 'response_T', gives the normalised ratio of nuclear and cytoplasmic transcription
factor as described in Supplementary Methods of the corresponding publication. In the CRAN version of the package
we included only a subset of the data (5 time measurements). For the full datasets, please access GitHub pages.
Usage
data_nfkb
Format
A data frame with 15632 rows and 6 variables:
- signal
Level of TNFa stimulation
- response_0
The concentration of normalised NfkB transcription factor, measured at time 0
- response_3
The concentration of normalised NfkB transcription factor, measured at time 3
- response_21
The concentration of normalised NfkB transcription factor, measured at time 21
- response_90
The concentration of normalised NfkB transcription factor, measured at time 90
- response_120
The concentration of normalised NfkB transcription factor, measured at time 120
#'
Details
For each concentration, there are at least 1000 single-cell observation (with the exception of 0.5ng stimulation, where the number of identified cells is almost 900)
Source
in-house experimental data
Formula generator for logistic regression algorithm
Description
Internal, auxiliary functions
Usage
func_formula_generator(
signal = "signal",
response = "response",
side_variables = NULL
)
Arguments
signal |
is a character object that indicates columns of data to be treated as channel's input |
response |
is a character vector that indicates columns of data to be treated as channel's output |
side_variables |
is a character vector that indicates side variables' columns of data |
Value
A character object that includes a standard formula syntax to use in algorithm for capacity calculation
Examples
SLEMI:::func_formula_generator(signal="signal",response="response", side_variables=NULL)
SLEMI:::func_formula_generator(signal="inputX",response="responseY", side_variables="SV1")
SLEMI:::func_formula_generator(signal="signalX",response=c("r_1","r_2","r_5"), side_variables="SV")
Initial verification of input
Description
Internal, auxiliary functions
Usage
func_input_checks(data, signal, response, side_variables)
Arguments
data |
is an input object that should be a data.frame |
signal |
is a character object that indicates input columns of data |
response |
is a character vector that indicates output's columns of data |
side_variables |
is a character vector that indicates side variables' columns of data |
Value
If all initial data is valid, string "ok" is returned. Otherwise, error is given.
Examples
data=data_example1
SLEMI:::func_input_checks(data=data,signal="signal",
response="response",side_variables="sideVar")
# The following examples will give errors, because the data has
# inconsistent format for the analysis. Only to check the adequacy of
# initial checks.
# data=as.matrix(data_example1)
# SLEMI:::func_input_checks(data=data,signal="signal",
# response="response",side_variables="sideVar")
# data=data_example1
# SLEMI:::func_input_checks(data=data,signal="input",
# response="response",side_variables="sideVar")
Iterative updating of prior probabilities in logistic regression estimator
Description
Internal, auxiliary functions
Usage
func_iterative_logreg_update(prob_lr, p0, cell_id, signal_levels, cc_maxit)
Arguments
prob_lr |
is a matrix of class probabilities for each observation |
p0 |
is a numeric vector of prior probabilities used for logistic regression estimation |
cell_id |
a list of logical vectors indicating class labels of each observation |
signal_levels |
is a vector of class labels |
cc_maxit |
is the number of iteration of procedure to be carried out |
Value
A list with components
p_opt - a numeric vectors with estimated optimal input probabilities
MI_opt - a numerical value of estimated channel capacity
Initial verification and transformation of input variable
Description
Internal, auxiliary functions
Usage
func_signal_transform(data, signal)
Arguments
data |
is a data.frame |
signal |
is a character that indicates columns of data that include the labels of input |
Value
A data.frame that is a copy of data provided with signal column transformed to factor class. If signal has been numeric initially, additional column is created "signal_RAW" that is an exact copy of original column
Examples
data=data_example1
data1=SLEMI:::func_signal_transform(data,"signal")
data$signal=as.character(data$signal)
data2=SLEMI:::func_signal_transform(data,"signal")
data$signal=as.numeric(data$signal)
data3=SLEMI:::func_signal_transform(data,"signal")
Main algorithm to calculate mutual information by SLEMI approach
Description
Additional parameters: lr_maxit and maxNWts are the same as in definition of multinom function from nnet package. An alternative model formula (using formula_string arguments) should be provided if data are not suitable for description by logistic regression (recommended only for advanced users). It is recommended to conduct estimation by calling mi_logreg_main.R.
Usage
mi_logreg_algorithm(
data,
signal = "signal",
response = "response",
side_variables = NULL,
pinput = NULL,
formula_string = NULL,
lr_maxit = 1000,
MaxNWts = 5000,
model_out = TRUE
)
Arguments
data |
must be a data.frame object. Cannot contain NA values. |
signal |
is a character object with names of columns of dataRaw to be treated as channel's input. |
response |
is a character vector with names of columns of dataRaw to be treated as channel's output |
side_variables |
(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included |
pinput |
is a numeric vector with prior probabilities of the input values. Uniform distribution is assumed as default (pinput=NULL). |
formula_string |
(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users. |
lr_maxit |
is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000. |
MaxNWts |
is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000. |
model_out |
is the logical indicating if the calculated logistic regression model should be included in output list |
Value
a list with three elements:
output$mi - mutual information in bits
output$pinput - prior probabilities used in estimation
output$regression - confusion matrix of logistic regression model
output$model - nnet object describing logistic regression model (if model_out=TRUE)
References
[1] Jetka T, Nienaltowski K, Winarski T, Blonski S, Komorowski M, Information-theoretic analysis of multivariate single-cell signaling responses using SLEMI, PLoS Comput Biol, 15(7): e1007132, 2019, https://doi.org/10.1371/journal.pcbi.1007132.
Examples
## Estimate mutual information directly
temp_data=data_example1
output=mi_logreg_algorithm(data=data_example1,
signal = "signal",
response = "response")
Estimate mutual information between discrete input and continuous output
Description
The main wrapping function for basic usage of SLEMI package for estimation of mutual information. Firstly, data is pre-processed (all arguments are checked, observation with NAs are removed, variables are scaled and centered (if scale=TRUE)). Then basic estimation is carried out and (if testing=TRUE) diagnostic tests are computed. If output directory path is given (output_path is not NULL), graphs visualising the data and the analysis are saved there, together with a compressed output object (as .rds file) with full estimation results.
Usage
mi_logreg_main(
dataRaw,
signal = "input",
response = NULL,
output_path = NULL,
side_variables = NULL,
pinput = NULL,
formula_string = NULL,
lr_maxit = 1000,
MaxNWts = 5000,
testing = FALSE,
model_out = TRUE,
scale = TRUE,
TestingSeed = 1234,
testing_cores = 1,
boot_num = 10,
boot_prob = 0.8,
sidevar_num = 10,
traintest_num = 10,
partition_trainfrac = 0.6,
plot_width = 6,
plot_height = 4,
data_out = FALSE
)
Arguments
dataRaw |
must be a data.frame object |
signal |
is a character object with names of columns of dataRaw to be treated as channel's input. |
response |
is a character vector with names of columns of dataRaw to be treated as channel's output |
output_path |
is the directory in which output will be saved |
side_variables |
(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included |
pinput |
is a numeric vector with prior probabilities of the input values. Uniform distribution is assumed as default (pinput=NULL). |
formula_string |
(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users. |
lr_maxit |
is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000. |
MaxNWts |
is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000. |
testing |
is the logical indicating if the testing procedures should be executed |
model_out |
is the logical indicating if the calculated logistic regression model should be included in output list |
scale |
is a logical indicating if the response variables should be scaled and centered before fitting logistic regression |
TestingSeed |
is the seed for random number generator used in testing procedures |
testing_cores |
- number of cores to be used in parallel computing (via doParallel package) |
boot_num |
is the number of bootstrap tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates. |
boot_prob |
is the proportion of initial size of data to be used in bootstrap |
sidevar_num |
is the number of re-shuffling tests of side variables to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates. |
traintest_num |
is the number of overfitting tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates. |
partition_trainfrac |
is the fraction of data to be used as a training dataset |
plot_width |
- basic dimensions (width) of plots, in inches |
plot_height |
- basic dimensions (height) of plots, in inches |
data_out |
is the logical indicating if the data should be included in output list |
Details
In a typical experiment aimed to quantify information flow a given signaling system, input values x_1\leq x_2 \ldots... \leq x_m
, ranging from 0 to saturation are considered.
Then, for each input level, x_i
, n_i
observations are collected, which are represented as vectors
y^i_j \sim P(Y|X = x_i)
Within information theory the degree of information transmission is measured as the mutual information
MI(X,Y) = \sum_{i=1}^{m} P(x_i)\int_{R^k} P(y|X = x_i)log_2\frac{P(y|X = x_i)}{P(y)}dy,
where P(y)
is the marginal distribution of the output. MI is expressed in bits and 2^{MI}
can be interpreted as the number of
inputs that the system can resolve on average.
In contrast to existing approaches, instead of estimating, possibly highly dimensional, conditional output distributions P(Y|X =x_i)
, we propose to estimate the discrete, conditional input distribution,
P(x_i |Y = y)
, which is known to be a simpler problem. Estimation of the MI using estimates of P(x_i |Y = y)
, denoted here as \hat{P}(x_i|Y = y)
, is possible as the MI, can be
alternatively written as
MI(X,Y) = \sum_{i=1}^{m} P(x_i)\int_{R^k} P(y|X = x_i)log_2\frac{P(x_i|Y = y)}{P(x_i)}dy
The expected value (as in above expression) with respect to distribution P(Y|X = x_i)
can be approximated by the average with respect to data
MI(X,Y) \approx \sum_{i=1}^{m} P(x_i)\frac{1}{n_i} \sum_{j=1}^{n_i} P(y|X = x_i)log_2\frac{\hat{P}(x_i|Y = y^i_j)}{P(x_i)}dy
Here, we propose to use logistic regression as \hat{P}(x_i|Y = y^i_j)
. Specifically,
log\frac{P(x_i |Y = y)}{P(x_m|Y = y)} \approx \alpha_i +\beta_iy
Additional parameters: lr_maxit and maxNWts are the same as in definition of multinom function from nnet package. An alternative model formula (using formula_string arguments) should be provided if data are not suitable for description by logistic regression (recommended only for advanced users). Preliminary scaling of data (argument scale) should be used similarly as in other data-driven approaches, e.g. if response variables are comparable, scaling (scale=FALSE) can be omitted, while if they represent different phenomenon (varying by units and/or magnitude) scaling is recommended.
Value
a list with several elements:
output$regression - confusion matrix of logistic regression predictions
output$mi - mutual information in bits
output$model - nnet object describing logistic regression model (if model_out=TRUE)
output$params - parameters used in algorithm
output$time - computation time of calculations
output$testing - a 2- or 4-element output list of testing procedures (if testing=TRUE)
output$testing_pv - one-sided p-values of testing procedures (if testing=TRUE)
output$data - raw data used in analysis
References
[1] Jetka T, Nienaltowski K, Winarski T, Blonski S, Komorowski M, Information-theoretic analysis of multivariate single-cell signaling responses using SLEMI, PLoS Comput Biol, 15(7): e1007132, 2019, https://doi.org/10.1371/journal.pcbi.1007132.
Examples
tempdata=data_example1
outputCLR1=mi_logreg_main(dataRaw=tempdata, signal="signal", response="response")
tempdata=data_example2
outputCLR2=mi_logreg_main(dataRaw=tempdata, signal="signal", response=c("X1","X2","X3"))
#For further details see vignette
Testing procedures for estimation of mutual information
Description
Diagnostic procedures that allows to compute the uncertainty of estimation of mutual information by SLEMI approach. Two main procedures are implemented: bootstrap, which execute estimation with using a fraction of data and overfitting test, which divides data into two parts: training and testing. Each of them is repeated specified number of times to obtain a distribution of our estimators. It is recommended to call this function from mi_logreg_main.R.
Usage
mi_logreg_testing(
data,
signal = "signal",
response = "response",
side_variables = NULL,
pinput = NULL,
lr_maxit = 1000,
MaxNWts = 5000,
formula_string = NULL,
TestingSeed = 1234,
testing_cores = 1,
boot_num = 10,
boot_prob = 0.8,
sidevar_num = 10,
traintest_num = 10,
partition_trainfrac = 0.6
)
Arguments
data |
must be a data.frame object. Cannot contain NA values. |
signal |
is a character object with names of columns of dataRaw to be treated as channel's input. |
response |
is a character vector with names of columns of dataRaw to be treated as channel's output |
side_variables |
(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included |
pinput |
is a numeric vector with prior probabilities of the input values. Uniform distribution is assumed as default (pinput=NULL). |
lr_maxit |
is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000. |
MaxNWts |
is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000. |
formula_string |
(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users. |
TestingSeed |
is the seed for random number generator used in testing procedures |
testing_cores |
- number of cores to be used in parallel computing (via doParallel package) |
boot_num |
is the number of bootstrap tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates. |
boot_prob |
is the proportion of initial size of data to be used in bootstrap |
sidevar_num |
is the number of re-shuffling tests of side variables to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates. |
traintest_num |
is the number of overfitting tests to be performed. Default is 10, but it is recommended to use at least 50 for reliable estimates. |
partition_trainfrac |
is the fraction of data to be used as a training dataset |
Details
If side variables are added within the analysis (side_variables is not NULL), two additional procedures are carried out: reshuffling test and reshuffling with bootstrap test, which are based on permutation of side variables values within the dataset. Additional parameters: lr_maxit and MaxNWts are the same as in definition of multinom function from nnet package. An alternative model formula (using formula_string arguments) should be provided if data are not suitable for description by logistic regression (recommended only for advanced users).
Value
a list with elements:
output$bootstrap - bootstrap test
output$traintest - overfitting test
output$reshuffling_sideVar - (if side_variables is not NULL) re-shuffling test
output$bootstrap_Reshuffling_sideVar - (if side_variables is not NULL) re-shuffling test with a bootstrap
Each of the above is a list, where an element is a standard output of a single mi_logreg_algorithm run.
References
[1] Jetka T, Nienaltowski K, Winarski T, Blonski S, Komorowski M, Information-theoretic analysis of multivariate single-cell signaling responses using SLEMI, PLoS Comput Biol, 15(7): e1007132, 2019, https://doi.org/10.1371/journal.pcbi.1007132.
Examples
## Compute uncertainty of mutual information estimator using 1 core
## Set boot_num and traintest_num with larger numbers for more reliable testing
tempdata=data_example1
output=mi_logreg_testing(data=tempdata,
signal = "signal",
response = "response",
testing_cores = 1,boot_num=1,traintest_num=1)
Plotting output of capacity estimation and additional exploratory graphs.
Description
INPUT:
Usage
output_graphs_main(
data,
signal,
response,
side_variables,
cc_output,
output_path,
height = 4,
width = 6
)
Arguments
data |
is a data.frame object |
signal |
is a character object that indicates columns of data that should be treated as channel's input |
response |
is a character vector that indicates columns of data that should be treated as channel's output |
side_variables |
is a character vector that indicates side variables' columns of data |
cc_output |
a list that is a standard output of capacity_logreg_algorithm function |
output_path |
character giving the directory, where graphs should be saved |
height |
integer indicating the height of a single plot |
width |
integer indicating the width of a single plot |
Value
A list with ggplot or gtable object. Each plot is also saved in 'output_path' directory in separate pdfs files which include:
MainPlot.pdf - a simple summary plot with basic distribution visualization and capacity estimate
MainPlot_full.pdf - a comprehensive summary plot with distribution visualization and capacity estimate
capacity.pdf - a diagram presenting the capacity estimates
io_relation.pdf - a graph with input-output relation
kdensities.pdf - kernel density estimator of data distribution
histograms.pdf - histograms of data
boxplots.pdf - boxplots of data
violin.pdf - violin plots of data
Calculates Probability of pairwise discrimination
Description
Estimates probabilities of correct discrimination (PCDs) between each pair of input/signal values using a logistic regression model.
Usage
prob_discr_pairwise(
dataRaw,
signal = "input",
response = NULL,
side_variables = NULL,
formula_string = NULL,
output_path = NULL,
scale = TRUE,
lr_maxit = 1000,
MaxNWts = 5000,
diagnostics = TRUE
)
Arguments
dataRaw |
must be a data.frame object |
signal |
is a character object with names of columns of dataRaw to be treated as channel's input. |
response |
is a character vector with names of columns of dataRaw to be treated as channel's output |
side_variables |
(optional) is a character vector that indicates side variables' columns of data, if NULL no side variables are included |
formula_string |
(optional) is a character object that includes a formula syntax to use in logistic regression model. If NULL, a standard additive model of response variables is assumed. Only for advanced users. |
output_path |
is a directory where a pie chart with calculated probabilities will be saved. If NULL, the graph will not be created. |
scale |
is a logical indicating if the response variables should be scaled and centered before fitting logistic regression |
lr_maxit |
is a maximum number of iteration of fitting algorithm of logistic regression. Default is 1000. |
MaxNWts |
is a maximum acceptable number of weights in logistic regression algorithm. Default is 5000. |
diagnostics |
is a logical indicating if details of logistic regression fitting should be included in output list |
Details
In order to estimate PCDs, for a given pair of input values x_i
and x_j
, we propose to fit a logistic regression model
using response data corresponding to the two considered inputs, i.e. y^l_u
, for l\in\{i,j\}
and u
ranging from
1 to n_l
.
To ensure that both inputs have equal contribution to the calculated discriminability, equal probabilities should be assigned,
P(X) = (P(x_i),P(x_j))=(1/2,1/2)
. Once the regression model is fitted, probability of assigning a given cellular response,
y
,
to the correct input value is estimated as
\max \{ \hat{P}_{lr}(x_i|Y=y;P(X)), \hat{P}_{lr}(x_j|Y=y;P(X))\}.
Note that P(x_j|Y=y)=1-P(x_i|Y=y)
as well as \hat{P}_{lr}(x_j|Y=y;P(X))=1-\hat{P}_{lr}(x_i|Y=y;P(X))
The average of the above probabilities over all observations y^i_l
yields PCDs
PCD_{x_i,x_j}=\frac{1}{2}\frac{1}{n_i}\sum_{l=1}^{n_i}\max\{ \hat{P}_{lr}(x_i|Y=y_i^l;P(X)),\hat{P}_{lr}(x_i^l|Y=y;P(X))\} +
\frac{1}{2} \frac{1}{n_j} \sum_{l=1}^{n_j} \max \{ \hat{P}_{lr}(x_i|Y=y_j^l;P(X)), \hat{P}_{lr}(x_j|Y=y_j^l;P(X))\}.
Additional parameters: lr_maxit and maxNWts are the same as in definition of multinom function from nnet package. An alternative model formula (using formula_string arguments) should be provided if data are not suitable for description by logistic regression (recommended only for advanced users). Preliminary scaling of data (argument scale) should be used similarly as in other data-driven approaches, e.g. if response variables are comparable, scaling (scale=FALSE) can be omitted, while if they represent different phenomenon (varying by units and/or magnitude) scaling is recommended.
Value
a list with two elements:
output$prob_matr - a
n\times n
matrix, wheren
is the number of inputs, with probabilities of correct discrimination between pairs of input values.output$diagnostics - (if diagnostics=TRUE) a list corresponding to logistic regression models fitted for each pair of input values. Each element consists of three sub-elements: 1) nnet_model - nnet object summarising logistic regression model; 2) prob_lr - probabilities of assignment obtained from logistic regression model; 3) confusion_matrix - confusion matrix of classificator.
References
[1] Jetka T, Nienaltowski K, Winarski T, Blonski S, Komorowski M, Information-theoretic analysis of multivariate single-cell signaling responses using SLEMI, PLoS Comput Biol, 15(7): e1007132, 2019, https://doi.org/10.1371/journal.pcbi.1007132.
Examples
## Calculate probabilities of discrimination for nfkb dataset
it=21 # choose from 0, 3, 6, ..., 120 for measurements at other time points
output=prob_discr_pairwise(dataRaw=data_nfkb[data_nfkb$signal%in%c("0ng","1ng","100ng"),],
signal = "signal",
response = paste0("response_",it))
Sampling procedures used for testing capacity algorithm
Description
Internal, auxiliary functions
Usage
sampling_bootstrap(data, prob = 1, dataDiv)
sampling_shuffle(data, side_variables)
sampling_partition(data, dataDiv, partition_trainfrac)
Arguments
data |
is a data.frame to be resampled |
prob |
is numeric for the portion of data that should be sampled from the whole dataset (only in sampling_bootstrap) |
dataDiv |
a character indicating column of data, with respect to which, data should be split before bootstrap |
side_variables |
is a vector of characters indicating columns of data the will be reshuffled (only in sampling_shuffle) |
partition_trainfrac |
is a numeric for the portion of data that will be used as a training and testing datasets |
Details
These function allow to re-sample, bootstrap and divide initial dataset
Value
Function sampling_bootstrap returns a data.frame with the same structure as initial data object, but with prob proportion of observations for each dataDiv level. Function sampling_shuffle returns a data.frame with the same structure as initial data object with shuffled values of columns given in side_variables argument. Function sampling_partition returns a list of two data.frame objects - train and test that has the same structure as initial data argument with partition_trainfrac and 1-partition_trainfrac observations, respectively.
Examples
data=data_example1
dataBootstrap = SLEMI:::sampling_bootstrap(data=data,prob=0.8,data$signal)
dataShuffle = SLEMI:::sampling_shuffle(data=data,"sideVar")
dataTrainTest = SLEMI:::sampling_partition(data=data,dataDiv=data$signal,partition_trainfrac=0.6)