Type: | Package |
Title: | Gene Set Analysis Toolkit WebGestaltR |
Version: | 0.4.6 |
Date: | 2023-05-31 |
Description: | The web version WebGestalt https://www.webgestalt.org supports 12 organisms, 354 gene identifiers and 321,251 function categories. Users can upload the data and functional categories with their own gene identifiers. In addition to the Over-Representation Analysis, WebGestalt also supports Gene Set Enrichment Analysis and Network Topology Analysis. The user-friendly output report allows interactive and efficient exploration of enrichment results. The WebGestaltR package not only supports all above functions but also can be integrated into other pipeline or simultaneously analyze multiple gene lists. |
License: | LGPL-2 | LGPL-2.1 | LGPL-3 [expanded from: LGPL] |
URL: | https://github.com/bzhanglab/WebGestaltR |
LazyLoad: | yes |
Depends: | R (≥ 3.3) |
Imports: | methods, dplyr, doRNG, readr, parallel (≥ 3.3.2), doParallel (≥ 1.0.10), foreach (≥ 1.4.0), jsonlite, httr, rlang, svglite, igraph, whisker, apcluster, Rcpp |
NeedsCompilation: | yes |
LinkingTo: | Rcpp |
RoxygenNote: | 7.2.3 |
Packaged: | 2023-06-01 15:42:39 UTC; Yuxing |
Author: | Jing Wang [aut], Yuxing Liao [aut, cre], Eric Jaehnig [ctb], Zhiao Shi [ctb], Quanhu Sheng [ctb] |
Maintainer: | Yuxing Liao <yuxingliao@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2023-06-01 16:00:02 UTC |
WebGestaltR: The R interface for enrichment analysis with WebGestalt.
Description
Main function for enrichment analysis
Usage
WebGestaltR(
enrichMethod = "ORA",
organism = "hsapiens",
enrichDatabase = NULL,
enrichDatabaseFile = NULL,
enrichDatabaseType = NULL,
enrichDatabaseDescriptionFile = NULL,
interestGeneFile = NULL,
interestGene = NULL,
interestGeneType = NULL,
collapseMethod = "mean",
referenceGeneFile = NULL,
referenceGene = NULL,
referenceGeneType = NULL,
referenceSet = NULL,
minNum = 10,
maxNum = 500,
sigMethod = "fdr",
fdrMethod = "BH",
fdrThr = 0.05,
topThr = 10,
reportNum = 20,
perNum = 1000,
gseaP = 1,
isOutput = TRUE,
outputDirectory = getwd(),
projectName = NULL,
dagColor = "continuous",
saveRawGseaResult = FALSE,
gseaPlotFormat = c("png", "svg"),
setCoverNum = 10,
networkConstructionMethod = NULL,
neighborNum = 10,
highlightType = "Seeds",
highlightSeedNum = 10,
nThreads = 1,
cache = NULL,
hostName = "https://www.webgestalt.org/",
...
)
WebGestaltRBatch(
interestGeneFolder = NULL,
enrichMethod = "ORA",
isParallel = FALSE,
nThreads = 3,
...
)
Arguments
enrichMethod |
Enrichment methods: |
organism |
Currently, WebGestaltR supports 12 organisms. Users can use the function
|
enrichDatabase |
The functional categories for the enrichment analysis. Users can use
the function |
enrichDatabaseFile |
Users can provide one or more GMT files as the functional
category for enrichment analysis. The extension of the file should be |
enrichDatabaseType |
The ID type of the genes in the |
enrichDatabaseDescriptionFile |
Users can also provide description files for the custom
|
interestGeneFile |
If |
interestGene |
Users can also use an R object as the input. If |
interestGeneType |
The ID type of the interesting gene list. The supported ID types of
WebGestaltR for the selected organism can be found by the function |
collapseMethod |
The method to collapse duplicate IDs with scores. |
referenceGeneFile |
For the ORA method, the users need to upload the reference gene
list. The extension of the |
referenceGene |
For the ORA method, users can also use an R object as the reference
gene list. |
referenceGeneType |
The ID type of the reference gene list. The supported ID types
of WebGestaltR for the selected organism can be found by the function |
referenceSet |
Users can directly select the reference set from existing platforms in
WebGestaltR and do not need to provide the reference set through |
minNum |
WebGestaltR will exclude the categories with the number of annotated genes
less than |
maxNum |
WebGestaltR will exclude the categories with the number of annotated genes
larger than |
sigMethod |
Two methods of significance are available in WebGestaltR: |
fdrMethod |
For the ORA method, WebGestaltR supports five FDR methods: |
fdrThr |
The significant threshold for the |
topThr |
The threshold for the |
reportNum |
The number of enriched categories visualized in the final report. The default
is |
perNum |
The number of permutations for the GSEA method. The default is |
gseaP |
The exponential scaling factor of the phenotype score. The default is |
isOutput |
If |
outputDirectory |
The output directory for the results. |
projectName |
The name of the project. If |
dagColor |
If |
saveRawGseaResult |
Whether the raw result from GSEA is saved as a RDS file, which can be
used for plotting. Defaults to
|
gseaPlotFormat |
The graphic format of GSEA enrichment plots. Either |
setCoverNum |
The number of expected gene sets after set cover to reduce redundancy.
It could get fewer sets if the coverage reaches 100%. The default is |
networkConstructionMethod |
Netowrk construction method for NTA. Either
|
neighborNum |
The number of neighbors to include in NTA Network Expansion method. |
highlightType |
The type of nodes to highlight in the NTA Network Expansion method,
either |
highlightSeedNum |
The number of top input seeds to highlight in NTA Network Retrieval & Prioritizaiton method. |
nThreads |
The number of cores to use for GSEA and set cover, and in batch function. |
cache |
A directory to save data cache for reuse. Defaults to |
hostName |
The server URL for accessing data. Mostly for development purposes. |
... |
In batch function, passes parameters to WebGestaltR function. Also handles backward compatibility for some parameters in old versions. |
interestGeneFolder |
Run WebGestaltR for gene list files in the folder. |
isParallel |
If jobs are run parallelly in the batch. |
Details
WebGestaltR function can perform three enrichment analyses: ORA (Over-Representation Analysis) and GSEA (Gene Set Enrichment Analysis).and NTA (Network Topology Analysis). Based on the user-uploaded gene list or gene list with scores, WebGestaltR function will first map the gene list to the entrez gene ids and then summarize the gene list based on the GO (Gene Ontology) Slim. After performing the enrichment analysis, WebGestaltR function also returns a user-friendly HTML report containing GO Slim summary and the enrichment analysis result. If functional categories have DAG (directed acyclic graph) structure or genes in the functional categories have network structure, those relationship can also be visualized in the report.
Value
The WebGestaltR function returns a data frame containing the enrichment analysis
result and also outputs an user-friendly HTML report if isOutput
is TRUE
.
The columns in the data frame depend on the enrichMethod
and they are the following:
- geneSet
ID of the gene set.
- description
Description of the gene set if available.
- link
Link to the data source.
- size
The number of genes in the set after filtering by
minNum
andmaxNum
.- overlap
The number of mapped input genes that are annotated in the gene set.
- expect
Expected number of input genes that are annotated in the gene set.
- enrichmentRatio
Enrichment ratio, overlap / expect.
- enrichmentScore
Enrichment score, the maximum running sum of scores for the ranked list.
- normalizedEnrichmentScore
Normalized enrichment score, normalized against the average enrichment score of all permutations.
- leadingEdgeNum
Number of genes/phosphosites in the leading edge.
- pValue
P-value from hypergeometric test for ORA. For GSEA, please refer to its original publication or online at https://software.broadinstitute.org/gsea/doc/GSEAUserGuideTEXT.htm.
- FDR
Corrected P-value for mulilple testing with
fdrMethod
for ORA.- overlapId
The gene/phosphosite IDs of
overlap
for ORA (entrez gene IDs or phosphosite sequence).- leadingEdgeId
Genes/phosphosites in the leading edge in entrez gene ID or phosphosite sequence.
- userId
The gene/phosphosite IDs of
overlap
for ORA orleadingEdgeId
for GSEA in User input IDs.- plotPath
Path of the GSEA enrichment plot.
- database
Name of the source database if multiple enrichment databases are given.
- goId
In NTA, like
geneSet
, the enriched GO terms of genes in the returned subnetwork.- interestGene
In NTA, the gene IDs in the subnetwork with 0/1 annotations indicating if it is from user input.
The WebGestaltRBatch function returns a list of enrichment results.
Examples
## Not run:
####### ORA example #########
geneFile <- system.file("extdata", "interestingGenes.txt", package="WebGestaltR")
refFile <- system.file("extdata", "referenceGenes.txt", package="WebGestaltR")
outputDirectory <- getwd()
enrichResult <- WebGestaltR(enrichMethod="ORA", organism="hsapiens",
enrichDatabase="pathway_KEGG", interestGeneFile=geneFile,
interestGeneType="genesymbol", referenceGeneFile=refFile,
referenceGeneType="genesymbol", isOutput=TRUE,
outputDirectory=outputDirectory, projectName=NULL)
####### GSEA example #########
rankFile <- system.file("extdata", "GeneRankList.rnk", package="WebGestaltR")
outputDirectory <- getwd()
enrichResult <- WebGestaltR(enrichMethod="GSEA", organism="hsapiens",
enrichDatabase="pathway_KEGG", interestGeneFile=rankFile,
interestGeneType="genesymbol", sigMethod="top", topThr=10, minNum=5,
outputDirectory=outputDirectory)
####### NTA example #########
enrichResult <- WebGestaltR(enrichMethod="NTA", organism="hsapiens",
enrichDatabase="network_PPI_BIOGRID", interestGeneFile=geneFile,
interestGeneType="genesymbol", sigMethod="top", topThr=10,
outputDirectory=getwd(), highlightSeedNum=10,
networkConstructionMethod="Network_Retrieval_Prioritization")
## End(Not run)
Affinity Propagation
Description
Use affinity propagation to cluster similar gene sets to reduce redundancy in report.
Usage
affinityPropagation(idsInSet, score)
Arguments
idsInSet |
A list of set names and their member IDs. |
score |
A vector of addible scores with the same length used to assign input preference; higher score has larger weight, i.e. -logP. |
Value
A list of clusters
and representatives
for each cluster.
- clusters
A list of character vectors of set IDs in each cluster.
- representatives
A character vector of representatives for each cluster.
Author(s)
Zhiao Shi, Yuxing Liao
cacheUrl
Description
Get data from a URL or cache and optionally save in cache for reuse
Usage
cacheUrl(dataUrl, cache = NULL, query = NULL)
Arguments
dataUrl |
The URL of data |
cache |
The cache directory. Defaults to |
query |
The list of queries passed on to httr methods |
Value
response object from httr request
Create HTML Report for NTA
Description
Create HTML Report for NTA
Usage
createNtaReport(
networkName,
method,
sigMethod,
fdrThr,
topThr,
highlightType,
outputDirectory,
projectDir,
projectName,
hostName
)
createReport
Description
Generate HTML report for ORA and GSEA
Usage
createReport(
hostName,
outputDirectory,
organism = "hsapiens",
projectName,
enrichMethod,
geneSet,
geneSetDes,
geneSetDag,
geneSetNet,
interestingGeneMap,
referenceGeneList,
enrichedSig,
geneTables,
clusters,
background,
enrichDatabase = NULL,
enrichDatabaseFile = NULL,
enrichDatabaseType = NULL,
enrichDatabaseDescriptionFile = NULL,
interestGeneFile = NULL,
interestGene = NULL,
interestGeneType = NULL,
collapseMethod = "mean",
referenceGeneFile = NULL,
referenceGene = NULL,
referenceGeneType = NULL,
referenceSet = NULL,
minNum = 10,
maxNum = 500,
fdrMethod = "BH",
sigMethod = "fdr",
fdrThr = 0.05,
topThr = 10,
reportNum = 20,
perNum = 1000,
p = 1,
dagColor = "binary"
)
enrichResultSection
Description
Conditionally render template of main result section. Actual work is carried out in front end
Usage
enrichResultSection(
enrichMethod,
enrichedSig,
geneSet,
geneSetDes,
geneSetDag,
geneSetNet,
clusters
)
Expand enriched GO IDs to include ancestors up to the root
Description
Returns expanded nodes and DAG tree edges
Usage
expandDag(goTermList, dagEdgeList)
Fill relation data frame for GSEA input
Description
Fill 1 for gene in gene set
Usage
fillInputDataFrame(gmt, genes, geneSets)
Arguments
gmt |
A Data Frame with geneSet and gene columns from the GMT file |
genes |
A vector of genes |
geneSets |
A vector of gene sets |
Value
A Data Frame with the first column of gene and 1 or 0 for other columns of gene sets.
Author(s)
Yuxing Liao
Check Format and Read Data
Description
Check Format and Read Data
Usage
formatCheck(dataType = "list", inputGeneFile = NULL, inputGene = NULL)
Arguments
dataType |
Type of data, either |
inputGeneFile |
The data file to be mapped. |
inputGene |
Or the input could be given as an R object.
GMT file should be read with |
Value
A list of data frame
GO Slim Summary
Description
Outputs a brief summary of input genes based on GO Slim data.
Usage
goSlimSummary(
organism = "hsapiens",
geneList,
outputFile,
outputType = "pdf",
isOutput = TRUE,
cache = NULL,
hostName = "https://www.webgestalt.org"
)
Arguments
organism |
Currently, WebGestaltR supports 12 organisms. Users can use the function
|
geneList |
A list of input genes. |
outputFile |
Output file name. |
outputType |
File format of the plot: |
isOutput |
Boolean if a plot is save to |
cache |
A directory to save data cache for reuse. Defaults to |
hostName |
The server URL for accessing data. Mostly for development purposes. |
Value
A list of the summary result.
Permutaion in GSEA algorithm
Description
Permutaion in GSEA algorithm
Usage
gseaPermutation(inset_scores, outset_scores, expression_value)
Arguments
inset_scores |
Scaled score matrix for genes in sets |
outset_scores |
Normalized score matrix for genes not in sets |
expression_value |
Vector of gene rank scores |
Value
A vector of concatenated random minimal,maimum and best running sum scores for each set.
Author(s)
Yuxing Liao
ID Mapping
Description
ID mapping utility with WebGestalt server.
Usage
idMapping(
organism = "hsapiens",
dataType = "list",
inputGeneFile = NULL,
inputGene = NULL,
sourceIdType,
targetIdType = NULL,
collapseMethod = "mean",
mappingOutput = FALSE,
outputFileName = "",
cache = NULL,
hostName = "https://www.webgestalt.org/"
)
idToSymbol(
organism = "hsapiens",
dataType = "list",
inputGeneFile = NULL,
inputGene = NULL,
sourceIdType = "ensembl_gene_id",
collapseMethod = "mean",
mappingOutput = FALSE,
outputFileName = NULL,
cache = NULL,
hostName = "https://www.webgestalt.org/"
)
Arguments
organism |
Currently, WebGestaltR supports 12 organisms. Users can use the function
|
dataType |
Type of data, either |
inputGeneFile |
The data file to be mapped. |
inputGene |
Or the input could be given as an R object.
GMT file should be read with |
sourceIdType |
The ID type of the data. |
targetIdType |
The ID type of the mapped data. |
collapseMethod |
The method to collapse duplicate IDs with scores. |
mappingOutput |
Boolean if the mapping output is written to file. |
outputFileName |
The output file name. |
cache |
A directory to save data cache for reuse. Defaults to |
hostName |
The server URL for accessing data. Mostly for development purposes. |
Value
A list of mapped
and unmapped
IDs.
Jaccard Similarity
Description
Calculate Jaccard Similarity.
Usage
jaccardSim(idsInSet, score)
Arguments
idsInSet |
A list of set names and their member IDs. |
score |
A vector of addible scores with the same length used to assign input preference; higher score has larger weight, i.e. -logP. |
Value
A list of similarity matrix sim.mat
and input preference vector ip.vec
.
Author(s)
Zhiao Shi, Yuxing Liao
keepRep
Description
Add representatives of redundancy-reduced clusters to topResult if they are missing.
Usage
keepRep(topResult, allResult, reps)
Modify the link to highlight the genes in the pathways
Description
Currently, we only have wikipathway and kegg pathways that need to modify the link
Usage
linkModification(enrichMethod, enrichPathwayLink, geneList, interestingGeneMap)
List WebGestalt Servers
Description
List available WebGestalt servers.
Usage
listArchiveUrl()
Value
A data frame of available servers.
List Gene Sets
Description
List available gene sets for the given organism on WebGestalt server.
Usage
listGeneSet(
organism = "hsapiens",
hostName = "https://www.webgestalt.org/",
cache = NULL
)
Arguments
organism |
Currently, WebGestaltR supports 12 organisms. Users can use the function
|
hostName |
The server URL for accessing data. Mostly for development purposes. |
cache |
A directory to save data cache for reuse. Defaults to |
Value
A data frame of available gene sets.
List ID Types
Description
List supported ID types for the given organism on WebGestalt server.
Usage
listIdType(
organism = "hsapiens",
hostName = "https://www.webgestalt.org/",
cache = NULL
)
Arguments
organism |
Currently, WebGestaltR supports 12 organisms. Users can use the function
|
hostName |
The server URL for accessing data. Mostly for development purposes. |
cache |
A directory to save data cache for reuse. Defaults to |
Value
A list of supported gene sets.
List Organisms
Description
List supported organisms on WebGestalt server.
Usage
listOrganism(hostName = "https://www.webgestalt.org/", cache = NULL)
Arguments
hostName |
The server URL for accessing data. Mostly for development purposes. |
cache |
A directory to save data cache for reuse. Defaults to |
Value
A list of supported organisms.
List Reference Sets
Description
List available reference sets for the given organism on WebGestalt server.
Usage
listReferenceSet(
organism = "hsapiens",
hostName = "https://www.webgestalt.org/",
cache = NULL
)
Arguments
organism |
Currently, WebGestaltR supports 12 organisms. Users can use the function
|
hostName |
The server URL for accessing data. Mostly for development purposes. |
cache |
A directory to save data cache for reuse. Defaults to |
Value
A list of reference sets.
Load gene set data
Description
Load gene set data
Usage
loadGeneSet(
organism = "hsapiens",
enrichDatabase = NULL,
enrichDatabaseFile = NULL,
enrichDatabaseType = NULL,
enrichDatabaseDescriptionFile = NULL,
cache = NULL,
hostName = "https://www.webgestalt.org/"
)
Arguments
organism |
Currently, WebGestaltR supports 12 organisms. Users can use the function
|
enrichDatabase |
The functional categories for the enrichment analysis. Users can use
the function |
enrichDatabaseFile |
Users can provide one or more GMT files as the functional
category for enrichment analysis. The extension of the file should be |
enrichDatabaseType |
The ID type of the genes in the |
enrichDatabaseDescriptionFile |
Users can also provide description files for the custom
|
cache |
A directory to save data cache for reuse. Defaults to |
hostName |
The server URL for accessing data. Mostly for development purposes. |
Value
A list of geneSet
, geneSetDes
, geneSetDag
, geneSetNet
, standardId
.
- geneSet
Gene set: A data frame with columns of "geneSet", "description", "genes"
- geneSetDes
Description: A data frame with columns of two columns of gene set ID and description
- geneSetDag
DAG: A edge list data frame of two columns of parent and child. Or a list of data frames if multilple databases are given.
- geneSetNet
Network: A edge list data frame of two columns connecting nodes. Or a list of data frames if multilple databases are given.
- standardId
The standard ID of the gene set
Prepare input for standard GSEA
Description
A helper to read files for performing standard GSEA.
Usage
prepareGseaInput(rankFile, gmtFile)
Arguments
rankFile |
Path of the rnk file |
gmtFile |
Path of the GMT file |
Value
a data frame to be used in swGsea
Prepare Input Matrix for GSEA
Description
Prepare Input Matrix for GSEA
Usage
prepareInputMatrixGsea(rank, gmt)
Arguments
rank |
A 2 column Data Frame of gene and score |
gmt |
3 column Data Frame of geneSet, description, and gene |
Value
A matrix used for input to swGsea
.
Read GMT File
Description
Read GMT File
Usage
readGmt(gmtFile, cache = NULL)
Arguments
gmtFile |
The file path or URL of the GMT file. |
cache |
A directory to save data cache for reuse. Defaults to |
Value
A data frame with columns of "geneSet", "description", "gene".
specificParameterSummaryGsea
Description
Render job summary section of GSEA specific parameters
Usage
specificParameterSummaryGsea(
organism,
interestingGeneMap,
geneSet,
minNum,
maxNum,
sigMethod,
fdrThr,
topThr,
perNum,
p,
enrichedSig,
reportNum,
repAdded
)
specificParameterSummaryOra
Description
Render job summary section of ORA specific parameters
Usage
specificParameterSummaryOra(
organism,
referenceGeneList,
geneSet,
referenceGeneFile,
referenceGene,
referenceGeneType,
referenceSet,
minNum,
maxNum,
sigMethod,
fdrThr,
topThr,
fdrMethod,
enrichedSig,
reportNum,
repAdded,
numAnnoRefUserId,
interestingGeneMap,
hostName
)
summaryDescription
Description
Render job summary section
Usage
summaryDescription(
projectName,
organism,
interestGeneFile,
interestGene,
interestGeneType,
enrichMethod,
enrichDatabase,
enrichDatabaseFile,
enrichDatabaseType,
enrichDatabaseDescriptionFile,
interestingGeneMap,
referenceGeneList,
referenceGeneFile,
referenceGene,
referenceGeneType,
referenceSet,
minNum,
maxNum,
sigMethod,
fdrThr,
topThr,
fdrMethod,
enrichedSig,
reportNum,
perNum,
p,
geneSet,
repAdded,
numAnnoRefUserId,
hostName
)
Site Weighted Gene Set Enrichment Analysis
Description
Performs site weighted gene set enrichment analysis or standard GSEA when
likelihood/weight columns in input_df
are 1 or 0, p=1
,
q=1
and thresh_type="val"
.
Usage
swGsea(
input_df,
thresh_type = "percentile",
thresh = 0.9,
thresh_action = "exclude",
min_set_size = 10,
max_set_size = 500,
max_score = "max",
min_score = "min",
psuedocount = 0.001,
perms = 1000,
p = 1,
q = 1,
nThreads = 1,
rng_seed = 1,
fork = FALSE
)
Arguments
input_df |
A data frame in which first column is name of item of interest (gene, protein, phosphosite, etc.), the second is the correlation of that item of interest with the phenotype (typically log ratio of expression for phenotype vs. normal), and the remaining columns are the scores for the likelihood that the item belongs in each set (one column per set). |
thresh_type |
The type of |
thresh |
Depends on |
thresh_action |
Either "include", "exclude (default)", or "adjust"; this specifies how to treat each set if it doesn't contain a minimum number of items or contains all of the items; this option cannot be used with predefined lists of items in sets (if the number of items in a given set doesn't meet requirements, that set will be skipped). |
min_set_size , max_set_size |
The minimum/maximum number of items each set needs for the analysis to proceed. |
max_score , min_score |
A optional numeric vector of minimum/maximum boundaries to clip scores for each set. |
psuedocount |
Psuedocount (pc) is used for rescaling set scores:
|
perms |
The number of permutations. |
p |
The exponential scaling factor of the phenotype score (second column in
|
q |
The exponential scaling factor of the likelihood score (weights). |
nThreads |
The number of threads to use in calculating permutaions. |
rng_seed |
Random seed. |
fork |
A boolean. Whether pass "fork" to |
Details
The formula for weighting is as follows
\frac{s_{j}^{q}|r_{j}|^{p}}{\sum s^{q}|r|^{p}}
Where r is log ratio score, s is likelihood score, j is the index of the gene.
Value
A list of Enrichment_Results
, Items_in_Set
and Running_Sums
.
- Enrichment_Results
A data frame with row names of gene set and columns of "ES", "NES", "p_val", "fdr".
- Items_in_Set
A list of one-column data frames. Describes genes and their ranks in each set.
- Running_Sums
Running sum scores along genes sorted by ranked scores, with gene sets as columns.
Author(s)
Eric Jaehnig
Weighted Set Cover
Description
Size constrained weighted set cover problem to find top N sets while maximizing the coverage of all elements.
Usage
weightedSetCover(idsInSet, costs, topN, nThreads = 4)
Arguments
idsInSet |
A list of set names and their member IDs. |
costs |
A vector of the same length to add weights for penalty, i.e. 1/-logP. |
topN |
The number of sets (or less when it completes early) to return. |
nThreads |
The number of processes to use. In Windows, it fallbacks to 1. |
Value
A list of topSets
and coverage
.
- topSets
A list of set IDs.
- coverage
The percentage of IDs covered in the top sets.
Author(s)
Zhiao Shi, Yuxing Liao