Title: | Model Selection Based on Machine Learning (ML) |
Version: | 1.0.0.1 |
Description: | Model evaluation based on a modified version of the recursive feature elimination algorithm. This package is designed to determine the optimal model(s) by leveraging all available features. |
License: | GPL (≥ 3) |
URL: | https://github.com/mommy003/MSML |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.1 |
Depends: | R (≥ 2.10) |
Imports: | r2redux, R2ROC |
LazyData: | true |
NeedsCompilation: | no |
Packaged: | 2024-03-04 05:04:48 UTC; cvasu |
Author: | Hong Lee [aut, cph], Moksedul Momin [aut, cre, cph] |
Maintainer: | Moksedul Momin <cvasu.momin@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-03-04 05:20:02 UTC |
3 sets of covariates for training data set
Description
A dataset containing N sets of covariates (N=3 as an example here) intended for constant use across all model configurations (refer to the 'model_configuration2' function) when using a training dataset. Please note that if constant covariates are not required, this file is unnecessary (refer to the 'model_configuration' function).
Usage
cov_train
Format
A data frame for training dataset:
- V1
covariate 1
- V2
covariate 2
- V3
covariate 3
3 sets of covariates for validation data set
Description
A dataset containing N sets of covariates (N=3 as an example here) intended for constant use across all model configurations (refer to the 'model_configuration2' function) when using a validation dataset. Please note that if constant covariates are not required, this file is unnecessary (refer to the 'model_configuration' function).
Usage
cov_valid
Format
A data frame for validation dataset:
- V1
covariate 1
- V2
covariate 2
- V3
covariate 3
7 sets of PRSs for test dataset and target phenotype
Description
A dataset containing 7 sets of PRSs for test dataset and target phenotype
Usage
data_test
Format
A data frame for test dataset:
- V1
Feature 1 (or PRS1 constructed using the first subset of SNPs from GWAS summary statistics)
- V2
Feature 2 (or PRS2 constructed using the second subset of SNPs from GWAS summary statistics)
- V3
Feature 3 (or PRS3 constructed using the third subset of SNPs from GWAS summary statistics)
- V4
Feature 4 (or PRS4 constructed using the fourth subset of SNPs from GWAS summary statistics)
- V5
Feature 5 (or PRS5 constructed using the fifth subset of SNPs from GWAS summary statistics)
- V6
Feature 6 (or PRS6 constructed using the sixth subset of SNPs from GWAS summary statistics)
- V7
Feature 7 (or PRS7 constructed using the seventh subset of SNPs from GWAS summary statistics)
- phenotype
Phenotypic values
7 sets of PRSs for training data set and target phenotype
Description
A dataset containing 7 sets of PRSs for training data set and target phenotype
Usage
data_train
Format
A data frame for training dataset:
- V1
Feature 1 (or PRS1 constructed using the first subset of SNPs from GWAS summary statistics)
- V2
Feature 2 (or PRS2 constructed using the second subset of SNPs from GWAS summary statistics)
- V3
Feature 3 (or PRS3 constructed using the third subset of SNPs from GWAS summary statistics)
- V4
Feature 4 (or PRS4 constructed using the fourth subset of SNPs from GWAS summary statistics)
- V5
Feature 5 (or PRS5 constructed using the fifth subset of SNPs from GWAS summary statistics)
- V6
Feature 6 (or PRS6 constructed using the sixth subset of SNPs from GWAS summary statistics)
- V7
Feature 7 (or PRS7 constructed using the seventh subset of SNPs from GWAS summary statistics)
- phenotype
Phenotypic values
7 sets of PRSs for validation dataset and target phenotype
Description
A dataset containing 7 sets of PRSs for validation dataset and target phenotype
Usage
data_valid
Format
A data frame for validation dataset:
- V1
Feature 1 (or PRS1 constructed using the first subset of SNPs from GWAS summary statistics)
- V2
Feature 2 (or PRS2 constructed using the second subset of SNPs from GWAS summary statistics)
- V3
Feature 3 (or PRS3 constructed using the third subset of SNPs from GWAS summary statistics)
- V4
Feature 4 (or PRS4 constructed using the fourth subset of SNPs from GWAS summary statistics)
- V5
Feature 5 (or PRS5 constructed using the fifth subset of SNPs from GWAS summary statistics)
- V6
Feature 6 (or PRS6 constructed using the sixth subset of SNPs from GWAS summary statistics)
- V7
Feature 7 (or PRS7 constructed using the seventh subset of SNPs from GWAS summary statistics)
- phenotype
Phenotypic values
model_configuration function
Description
This function generates predicted values for the validation dataset by applying optimal weights to features, which were estimated in the training dataset for each model configuration. The total number of model configurations is determined by summing the combinations for each possible number of features, ranging from 1 to 'n' (C(n, k)), where 'n choose k' (C(n, k)) represents the binomial coefficient. Here, 'n' denotes the total number of features, and 'k' indicates the number of features included in each model. For example, with n=7, the total number of model configurations is 127.
Usage
model_configuration(data_train, data_valid, mv, model = "lm")
Arguments
data_train |
This includes the dataframe of the training dataset in a matrix format |
data_valid |
This includes the dataframe of the validation dataset in a matrix format |
mv |
The total number of columns in data_train/data_valid |
model |
This is the type of model (e.g. lm (default) or glm) |
Value
This function will generate all possible model outcomes for validation and test dataset
Examples
data_train <- data_train
data_valid <- data_valid
mv=8
out=model_configuration(data_train,data_valid,mv,model = "lm")
#This process will produce predicted values for the validation datasets,
#corresponding to each model configuration trained on the training dataset.
#The outcome of this function will yield variables named 'predict_validation'
#and 'total_model_configurations.
#To print the outcomes run out$predict_validation and out$total_model_configurations.
#For details (see https://github.com/mommy003/MSML).
model_configuration2 function
Description
This function is similar to the model_configuration function, with the added capability to maintain constant variables across models during training and prediction (see cov_train and cov_valid in page 2). Additionally, users have the option to select between linear or logistic regression models.
Usage
model_configuration2(
data_train,
data_valid,
mv,
cov_train,
cov_valid,
model = "lm"
)
Arguments
data_train |
This includes the dataframe of the training dataset in a matrix format |
data_valid |
This includes the dataframe of the validation dataset in a matrix format |
mv |
The total number of columns in data_train/data_valid |
cov_train |
This includes dataframe of covariates for training dataset in a matrix format |
cov_valid |
This includes dataframe of covariates for validation dataset in a matrix format |
model |
This is the type of model (e.g. lm (default) or glm (logistic regression)) |
Value
This function will generate all possible model outcomes for validation and test dataset
Examples
data_train <- data_train
data_valid <- data_valid
mv=8
cov_train <- cov_train
cov_valid <- cov_valid
out=model_configuration2(data_train,data_valid,mv,cov_train, cov_valid, model = "lm")
#This process will produce predicted values for the validation datasets,
#corresponding to each model configuration trained on the training dataset.
#The outcome of this function will yield variables named 'predict_validation'
#and 'total_model_configurations.
#To print the outcomes run out$predict_validation and out$total_model_configurations.
#For details (see https://github.com/mommy003/MSML).
#If a user intends to employ logistic regression without constant covariates,
#we advise preparing a covariate file where all values are set to 1.
model_evaluation function
Description
This function will identify the best model in the validation and test dataset.
Usage
model_evaluation(dat, mv, tn, prev, pthreshold = 0.05, method = "R2ROC")
Arguments
dat |
This is the dataframe for all the combinations of the model in a matrix format |
mv |
The total number of columns in data_train/data_valid |
tn |
The total number of best models to be identified |
prev |
The prevalence of disease in the data |
pthreshold |
The significance p value threshold when comparing models (default 0.05) |
method |
The methods to be used to evaluate models (e.g. R2ROC (default) or r2redux) |
Value
This function will generate all possible model outcomes for validation and test dataset
Examples
dat <- predict_validation
mv=8
tn=15
prev=0.047
out=model_evaluation(dat,mv,tn,prev)
#This process will generate three output files.
#out$out_all, contains AUC, p values for AUC, R2, and p values for R2,
#respectively for all models.
#out$out_start, contains AUC, p values for AUC, R2, and p values for R2,
#respectively for top tn models.
#out$out_selected, contains AUC, p values for AUC, R2, and p values for R2,
#respectively for best models. This also includes selected features for models
#For details (see https://github.com/mommy003/MSML).
target phenotype and 127 sets of model configurations based on validation dataset
Description
A dataset containing target phenotype and 127 sets of model configurations based on validation dataset
Usage
predict_validation
Format
A data frame for predicted values for target dataset from model configurations_test:
- V1
Phenotypic values in target dataset
- V2
predicted values for target dataset from model configuration1
- V3
predicted values for target dataset from model configuration2
- V4
predicted values for target dataset from model configuration3
- V5
predicted values for target dataset from model configuration4
- V6
predicted values for target dataset from model configuration5
- V7
predicted values for target dataset from model configuration6
- V8
predicted values for target dataset from model configuration7
- V9
predicted values for target dataset from model configuration8
- V10
predicted values for target dataset from model configuration9
- V11
predicted values for target dataset from model configuration10
- V12
predicted values for target dataset from model configuration11
- V13
predicted values for target dataset from model configuration12
- V14
predicted values for target dataset from model configuration13
- V15
predicted values for target dataset from model configuration14
- V16
predicted values for target dataset from model configuration15
- V17
predicted values for target dataset from model configuration16
- V18
predicted values for target dataset from model configuration17
- V19
predicted values for target dataset from model configuration18
- V20
predicted values for target dataset from model configuration19
- V21
predicted values for target dataset from model configuration10
- V22
predicted values for target dataset from model configuration21
- V23
predicted values for target dataset from model configuration22
- V24
predicted values for target dataset from model configuration23
- V25
predicted values for target dataset from model configuration24
- V26
predicted values for target dataset from model configuration25
- V27
predicted values for target dataset from model configuration26
- V28
predicted values for target dataset from model configuration27
- V29
predicted values for target dataset from model configuration28
- V30
predicted values for target dataset from model configuration29
- V31
predicted values for target dataset from model configuration30
- V32
predicted values for target dataset from model configuration31
- V33
predicted values for target dataset from model configuration32
- V34
predicted values for target dataset from model configuration33
- V35
predicted values for target dataset from model configuration34
- V36
predicted values for target dataset from model configuration35
- V37
predicted values for target dataset from model configuration36
- V38
predicted values for target dataset from model configuration37
- V39
predicted values for target dataset from model configuration38
- V40
predicted values for target dataset from model configuration39
- V41
predicted values for target dataset from model configuration40
- V42
predicted values for target dataset from model configuration41
- V43
predicted values for target dataset from model configuration42
- V44
predicted values for target dataset from model configuration43
- V45
predicted values for target dataset from model configuration44
- V46
predicted values for target dataset from model configuration45
- V47
predicted values for target dataset from model configuration46
- V48
predicted values for target dataset from model configuration47
- V49
predicted values for target dataset from model configuration48
- V50
predicted values for target dataset from model configuration49
- V51
predicted values for target dataset from model configuration50
- V52
predicted values for target dataset from model configuration51
- V53
predicted values for target dataset from model configuration52
- V54
predicted values for target dataset from model configuration53
- V55
predicted values for target dataset from model configuration54
- V56
predicted values for target dataset from model configuration55
- V57
predicted values for target dataset from model configuration56
- V58
predicted values for target dataset from model configuration57
- V59
predicted values for target dataset from model configuration58
- V60
predicted values for target dataset from model configuration59
- V61
predicted values for target dataset from model configuration60
- V62
predicted values for target dataset from model configuration61
- V63
predicted values for target dataset from model configuration62
- V64
predicted values for target dataset from model configuration63
- V65
predicted values for target dataset from model configuration64
- V66
predicted values for target dataset from model configuration65
- V67
predicted values for target dataset from model configuration66
- V68
predicted values for target dataset from model configuration67
- V69
predicted values for target dataset from model configuration68
- V70
predicted values for target dataset from model configuration69
- V71
predicted values for target dataset from model configuration70
- V72
predicted values for target dataset from model configuration71
- V73
predicted values for target dataset from model configuration72
- V74
predicted values for target dataset from model configuration73
- V75
predicted values for target dataset from model configuration74
- V76
predicted values for target dataset from model configuration75
- V77
predicted values for target dataset from model configuration76
- V78
predicted values for target dataset from model configuration77
- V79
predicted values for target dataset from model configuration78
- V80
predicted values for target dataset from model configuration79
- V81
predicted values for target dataset from model configuration80
- V82
predicted values for target dataset from model configuration81
- V83
predicted values for target dataset from model configuration82
- V84
predicted values for target dataset from model configuration83
- V85
predicted values for target dataset from model configuration84
- V86
predicted values for target dataset from model configuration85
- V87
predicted values for target dataset from model configuration86
- V88
predicted values for target dataset from model configuration87
- V89
predicted values for target dataset from model configuration88
- V90
predicted values for target dataset from model configuration89
- V91
predicted values for target dataset from model configuration90
- V92
predicted values for target dataset from model configuration91
- V93
predicted values for target dataset from model configuration92
- V94
predicted values for target dataset from model configuration93
- V95
predicted values for target dataset from model configuration94
- V96
predicted values for target dataset from model configuration95
- V97
predicted values for target dataset from model configuration96
- V98
predicted values for target dataset from model configuration97
- V99
predicted values for target dataset from model configuration98
- V100
predicted values for target dataset from model configuration99
- V101
predicted values for target dataset from model configuration100
- V102
predicted values for target dataset from model configuration101
- V103
predicted values for target dataset from model configuration102
- V104
predicted values for target dataset from model configuration103
- V105
predicted values for target dataset from model configuration104
- V106
predicted values for target dataset from model configuration105
- V107
predicted values for target dataset from model configuration106
- V108
predicted values for target dataset from model configuration107
- V109
predicted values for target dataset from model configuration108
- V110
predicted values for target dataset from model configuration109
- V111
predicted values for target dataset from model configuration110
- V112
predicted values for target dataset from model configuration111
- V113
predicted values for target dataset from model configuration112
- V114
predicted values for target dataset from model configuration113
- V115
predicted values for target dataset from model configuration114
- V116
predicted values for target dataset from model configuration115
- V117
predicted values for target dataset from model configuration116
- V118
predicted values for target dataset from model configuration117
- V119
predicted values for target dataset from model configuration118
- V120
predicted values for target dataset from model configuration119
- V121
predicted values for target dataset from model configuration120
- V122
predicted values for target dataset from model configuration121
- V123
predicted values for target dataset from model configuration122
- V124
predicted values for target dataset from model configuration123
- V125
predicted values for target dataset from model configuration124
- V126
predicted values for target dataset from model configuration125
- V127
predicted values for target dataset from model configuration126
- V128
predicted values for target dataset from model configuration127