Help for package Dforest

Type:

Package

Title:

Decision Forest

Version:

0.4.2

Date:

2017-11-28

Author:

Leihong Wu <leihong.wu@fda.hhs.gov>, Weida Tong (Weida.tong@fda.hhs.gov)

Maintainer:

Leihong Wu <leihong.wu@fda.hhs.gov>

Depends:

R (≥ 3.0)

Imports:

rpart, ggplot2, methods, stats

Description:

Provides R-implementation of Decision forest algorithm, which combines the predictions of multiple independent decision tree models for a consensus decision. In particular, Decision Forest is a novel pattern-recognition method which can be used to analyze: (1) DNA microarray data; (2) Surface-Enhanced Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (SELDI-TOF-MS) data; and (3) Structure-Activity Relation (SAR) data. In this package, three fundamental functions are provided, as (1)DF_train, (2)DF_pred, and (3)DF_CV. run Dforest() to see more instructions. Weida Tong (2003) <doi:10.1021/ci020058s>.

License:

GPL-2

LazyLoad:

yes

LazyData:

yes

Encoding:

UTF-8

NeedsCompilation:

RoxygenNote:

6.0.1

Packaged:

2017-11-28 19:17:32 UTC; Leihong.Wu

Repository:

CRAN

Date/Publication:

2017-11-28 22:03:57 UTC

Construct Decision Tree model with pruning

Description

Construct Decision Tree model with pruning

Usage

Con_DT(X, Y, min_split = 10, cp = 0.01)

Arguments

X

dataset

Y

data_Labels

min_split

minimum number of node in each leaf

cp

pre-defined Complexity Parameter (CP) rpart program

Value

Decision Tree Model with pruning Implemented by rpart

Decision Forest algorithm: Model training with Cross-validation

Description

Decision Forest algorithm: Model training with Cross-validation Default is 5-fold cross-validation

Usage

DF_CV(X, Y, stop_step = 10, CV_fold = 5, Max_tree = 20, min_split = 10,
  cp = 0.1, Filter = F, p_val = 0.05, Method = "bACC", Quiet = T,
  Grace_val = 0.05, imp_accu_val = 0.01, imp_accu_criteria = F)

Arguments

X

Training Dataset

Y

Training data endpoint

stop_step

How many extra step would be processed when performance not improved, 1 means one extra step

CV_fold

Fold of cross-validation (Default = 5)

Max_tree

Maximum tree number in Forest

min_split

minimum leaves in tree nodes

cp

parameters to pruning decision tree, default is 0.1

Filter

doing feature selection before training

p_val

P-value threshold measured by t-test used in feature selection, default is 0.05

Method

Which is used for evaluating training process. MIS: Misclassification rate; ACC: accuracy

Quiet

if TRUE (default), don't show any message during the process

Grace_val

Grace Value in evaluation: the next model should have a performance (Accuracy, bACC, MCC) not bad than previous model with threshold

imp_accu_val

improvement in evaluation: adding new tree should improve the overall model performance (Accuracy, bACC, MCC) by threshold

imp_accu_criteria

if TRUE, model must have improvement in accumulated accuracy

Value

.$performance: Overall training accuracy (Cross-validation)

.$pred: Detailed training prediction (Cross-validation)

.$detail: Detailed usage of Decision tree Features/Models and their performances in all CVs

.$Method: pass evaluating Methods used in training

.$cp: pass cp value used in training decision trees

Examples

  ##data(iris)
  X = iris[,1:4]
  Y = iris[,5]
  names(Y)=rownames(X)

  random_seq=sample(nrow(X))
  split_rate=3
  split_sample = suppressWarnings(split(random_seq,1:split_rate))
  Train_X = X[-random_seq[split_sample[[1]]],]
  Train_Y = Y[-random_seq[split_sample[[1]]]]

  CV_result = DF_CV(Train_X, Train_Y)

output summary for Dforest Cross-validation results

Description

Draw plot for Dforest Cross-validation results

Usage

DF_CVsummary(CV_result, plot = T)

Arguments

CV_result

Training Dataset

plot

if TRUE (default), draw plot

Decision Forest algorithm: confidence level accumulated plot

Description

Draw accuracy curve according to the confidence level of predictions

Usage

DF_ConfPlot(Pred_result, Label, bin = 20, plot = T, smooth = F)

Arguments

Pred_result

Predictions

Label

known label for Test Dataset

bin

How many bins occurred in Conf Plot (Default is 20)

plot

Draw Plot if True, otherwise output the datamatrix

smooth

if TRUE, Fit the performance curve with smooth function (by ggplot2)

Value

ACC_Conf: return data Matrix ("ConfidenceLevel", "Accuracy", "Matched Samples") for confidence plot (no plot)

ConfPlot: Draw Confidence Plot if True, need install ggplot2

Decision Forest algorithm: confidence level accumulated plot (accumulated version)

Description

Draw accuracy curve according to the confidence level of predictions

Usage

DF_ConfPlot_accu(Pred_result, Label, bin = 20, plot = T, smooth = F)

Arguments

Pred_result

Predictions

Label

known label for Test Dataset

bin

How many bins occurred in Conf Plot (Default is 20)

plot

Draw Plot if True, otherwise output the datamatrix

smooth

if TRUE, Fit the performance curve with smooth function (by ggplot2)

Value

ACC_Conf: return data Matrix ("ConfidenceLevel", "Accuracy", "Matched Samples") for confidence plot (no plot)

ConfPlot: Draw Confidence Plot if True, need install ggplot2

output summary for Dforest test results

Description

Draw plot for Dforest test results

Usage

DF_Trainsummary(used_model, plot = T)

Arguments

used_model

Training result

plot

if TRUE (default), draw plot

Performance evaluation from Decision Tree Predictions

Description

Performance evaluation from Decision Tree Predictions

Usage

DF_acc(pred, label)

Arguments

pred

Predictions

label

Known-endpoint

Value

result$ACC: Predicting Accuracy

result$MIS: MisClassfication Counts

result$MCC: Matthew's Correlation Coefficients

result$bACC: balanced Accuracy

T-test for feature selection

Description

T-test for feature selection

Usage

DF_calp(X, Y)

Arguments

X

X variable matrix

Y

Y label

Decision Forest algorithm: Feature Selection in pre-processing

Description

Decision Forest algorithm: feature selection for two-class predictions, kept statistical significant features pass the t-test

Usage

DF_dataFs(X, Y, p_val = 0.05)

Arguments

X

Training Dataset

Y

Training Labels

p_val

Correlation Coefficient threshold to filter out high correlated features; default is 0.95

Value

Keep_feat: qualified features in data matrix after filtering

Examples

 ##data(iris)
  X = iris[iris[,5]!="setosa",1:4]
  Y = iris[iris[,5]!="setosa",5]
  used_feat = DF_dataFs(X, Y)

Decision Forest algorithm: Data pre-processing

Description

Decision Forest algorithm: Data pre-processing, remove All-Zero columns/features and high correlated features

Usage

DF_dataPre(X, thres = 0.95)

Arguments

X

Training Dataset

thres

Correlation Coefficient threshold to filter out high correlated features; default is 0.95

Value

Keep_feat: qualified features in data matrix after filtering

Examples

 ##data(iris)
  X = iris[,1:4]
  Keep_feat = DF_dataPre(X)

Simple pre-defined pipeline for Decision forest

Description

This is a script of decision forest for easy use t

Usage

DF_easy(Train_X, Train_Y, Test_X, Test_Y, mode = "default")

Arguments

Train_X

Training Dataset

Train_Y

Training data endpoint

Test_X

Testing Dataset

Test_Y

Testing data endpoint

mode

pre-defined modeling

Value

data_matrix training and testing result

Examples

  # data(demo_simple)
  X = iris[,1:4]
  Y = iris[,5]
  names(Y)=rownames(X)

  random_seq=sample(nrow(X))
  split_rate=3
  split_sample = suppressWarnings(split(random_seq,1:split_rate))
  Train_X = X[-random_seq[split_sample[[1]]],]
  Train_Y = Y[-random_seq[split_sample[[1]]]]
  Test_X = X[random_seq[split_sample[[1]]],]
  Test_Y = Y[random_seq[split_sample[[1]]]]

  Result = DF_easy(Train_X, Train_Y, Test_X, Test_Y)

performance evaluation between two factors

Description

performance evaluation between two factors

Usage

DF_perf(pred, label)

Arguments

pred

Predictions

label

Known-endpoint

Value

result$ACC: Predicting Accuracy

result$MIS: MisClassfication Counts

result$MCC: Matthew's Correlation Coefficients

result$bACC: balanced Accuracy

Decision Forest algorithm: Model prediction

Description

Decision Forest algorithm: Model prediction with constructed DF models. DT_models is a list of Decision Tree models (rpart.objects) generated by DF_train() DT_train_CV() is only designed for Cross-validation and won't generate models

Usage

DF_pred(DT_models, X, Y = NULL)

Arguments

DT_models

Constructed DF models

X

Test Dataset

Y

Test data endpoint

Value

.$accuracy: Overall test accuracy

.$predictions: Detailed test prediction

Examples

  # data(demo_simple)
  X = data_dili$X
  Y = data_dili$Y
  names(Y)=rownames(X)

  random_seq=sample(nrow(X))
  split_rate=3
  split_sample = suppressWarnings(split(random_seq,1:split_rate))
  Train_X = X[-random_seq[split_sample[[1]]],]
  Train_Y = Y[-random_seq[split_sample[[1]]]]
  Test_X = X[random_seq[split_sample[[1]]],]
  Test_Y = Y[random_seq[split_sample[[1]]]]

  used_model = DF_train(Train_X, Train_Y)
  Pred_result = DF_pred(used_model,Test_X,Test_Y)

Decision Forest algorithm: Model training

Description

Decision Forest algorithm: Model training

Usage

DF_train(X, Y, stop_step = 5, Max_tree = 20, min_split = 10, cp = 0.1,
  Filter = F, p_val = 0.05, Method = "bACC", Quiet = T,
  Grace_val = 0.05, imp_accu_val = 0.01, imp_accu_criteria = F)

Arguments

X

Training Dataset

Y

Training data endpoint

stop_step

How many extra step would be processed when performance not improved, 1 means one extra step

Max_tree

Maximum tree number in Forest

min_split

minimum leaves in tree nodes

cp

parameters to pruning decision tree, default is 0.1

Filter

doing feature selection before training

p_val

P-value threshold measured by t-test used in feature selection, default is 0.05

Method

Which is used for evaluating training process. MIS: Misclassification rate; ACC: accuracy

Quiet

if TRUE (default), don't show any message during the process

Grace_val

Grace Value in evaluation: the next model should have a performance (Accuracy, bACC, MCC) not bad than previous model with threshold

imp_accu_val

improvement in evaluation: adding new tree should improve the overall model performance (Accuracy, bACC, MCC) by threshold

imp_accu_criteria

if TRUE, model must have improvement in accumulated accuracy

Value

.$accuracy: Overall training accuracy

.$pred: Detailed training prediction (fitting)

.$detail: Detailed usage of Decision tree Features/Models and their performances

.$models: Constructed (list of) Decision tree models

.$Method: pass evaluating Methods used in training

.$cp: pass cp value used in training decision trees

Examples

  ##data(iris)
  X = iris[,1:4]
  Y = iris[,5]
  names(Y)=rownames(X)
  used_model = DF_train(X,factor(Y))

Demo script to lean Decision Forest package Demo data are located in data/ folder

Description

Demo script to lean Decision Forest package Demo data are located in data/ folder

Usage

Dforest()

Author(s)

Leihong.Wu

Examples

  Dforest()

Doing Prediction with Decision Tree model

Description

Doing Prediction with Decision Tree model

Usage

Pred_DT(model, X)

Arguments

model

Decision Tree Model

X

dataset

Value

Decision Tree Predictions Different endpoints presented in multiple columns

Source

rpart

Performance evaluation from other modeling algorithm Result

Description

Performance evaluation from other modeling algorithm Result

Usage

cal_MCC(pred, label)

Arguments

pred

Predictions

label

Known-endpoint

Value

result$ACC: Predicting Accuracy

result$MIS: MisClassfication Counts

result$MCC: Matthew's Correlation Coefficients

result$bACC: balanced Accuracy

QSAR dataset with DILI endpoint for demo

Description

This data set gives the DILI endpoint of various compounds (Most or No DILI-concern) with QSAR descriptors generated by MOLD2

Usage

rivers

Format

A List containing two vectors: X contains 958 observations and 777 variables. Y contains DILI endpoints of 958 observations

Source

In-house data

References

Minjun Chen (2011) FDA-approved drug labeling for the study of drug-induced liver injury. Drug discovery today

multiplot

Description

Multiple plot function

If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE), then plot 1 will go in the upper left, 2 will go in the upper right, and 3 will go all the way across the bottom.

Usage

multiplot(..., plotlist = NULL, cols = 1, layout = NULL)

Arguments

...

ggplot objects

plotlist

a list of ggplot objects

cols

Number of columns in layout

layout

A matrix specifying the layout. If present, 'cols' is ignored.

Construct Decision Tree model with pruning

Description

Usage

Arguments

Value

See Also

Decision Forest algorithm: Model training with Cross-validation

Description

Usage

Arguments

Value

Examples

output summary for Dforest Cross-validation results

Description

Usage

Arguments

Decision Forest algorithm: confidence level accumulated plot

Description

Usage

Arguments

Value

Decision Forest algorithm: confidence level accumulated plot (accumulated version)

Description

Usage

Arguments

Value

output summary for Dforest test results

Description

Usage

Arguments

Performance evaluation from Decision Tree Predictions

Description

Usage

Arguments

Value

T-test for feature selection

Description

Usage

Arguments

Decision Forest algorithm: Feature Selection in pre-processing

Description

Usage

Arguments

Value

Examples

Decision Forest algorithm: Data pre-processing

Description

Usage

Arguments

Value

Examples

Simple pre-defined pipeline for Decision forest

Description

Usage

Arguments

Value

Examples

performance evaluation between two factors

Description

Usage

Arguments

Value

Decision Forest algorithm: Model prediction

Description

Usage

Arguments

Value

Examples

Decision Forest algorithm: Model training

Description

Usage

Arguments

Value

Examples

Demo script to lean Decision Forest package Demo data are located in data/ folder

Description

Usage

Author(s)

Examples

Doing Prediction with Decision Tree model