Title: | Analyze Results Generated by the 'SqueezeMeta' Pipeline |
Version: | 1.7.2 |
Description: | 'SqueezeMeta' is a versatile pipeline for the automated analysis of metagenomics/metatranscriptomics data (https://github.com/jtamames/SqueezeMeta). This package provides functions loading 'SqueezeMeta' results into R, filtering them based on different criteria, and visualizing the results using basic plots. The 'SqueezeMeta' project (and any subsets of it generated by the different filtering functions) is parsed into a single object, whose different components (e.g. tables with the taxonomic or functional composition across samples, contig/gene abundance profiles) can be easily analyzed using other R packages such as 'vegan' or 'DESeq2'. The methods in this package are further described in Puente-Sánchez et al., (2020) <doi:10.1186/s12859-020-03703-2>. |
Author: | Fernando Puente-Sánchez [aut, cre], Natalia García-García [aut] |
Maintainer: | Fernando Puente-Sánchez <fernando.puente.sanchez@slu.se> |
Depends: | R (≥ 3.5.0) |
LazyDataCompression: | xz |
Imports: | zip, reshape2, ggplot2, pathview, data.table, Biostrings |
Suggests: | vegan, microeco, phyloseq |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.2 |
BugReports: | https://github.com/jtamames/SqueezeMeta/issues |
URL: | https://github.com/jtamames/SqueezeMeta |
NeedsCompilation: | no |
Packaged: | 2025-06-13 07:29:23 UTC; fer |
Repository: | CRAN |
Date/Publication: | 2025-06-14 06:30:02 UTC |
CheckM reference markers for Prokaryotes
Description
List of Universal Single Copy Genes for Bacteria and Archaea.
Usage
data(CheckMProkaryote)
Format
List containing vectors of PFAMs, each vector corresponding to a different set of collocated markers.
References
Parks, Imelfort, Skennerton, Hugenholtz & Tyson (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes Genome Res. 25:1043-1055. (PubMed).
See Also
USiCGs
, MGOGs
and MGKOs
for an alternative set of single copy genes, and for examples on how to generate copy numbers.
Hadza hunter-gatherer gut metagenomes
Description
Subset of two bins (and the associated contigs and genes) generated by running SqueezeMeta on two gut metagenomic samples obtained from two hunter-gatherers of the Hadza ethnic group.
Usage
data(Hadza)
Format
A SQM object; see loadSQM
.
Source
References
Rampelli et al., 2015. Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota. Curr. biol. 25:1682-93 (PubMed).
Examples
data(Hadza)
plotTaxonomy(Hadza, "genus", rescale=TRUE)
plotFunctions(Hadza, "COG")
Single Copy Phylogenetic Marker Genes from Sunagawa's group (KOs)
Description
Lists of Single Copy Phylogenetic Marker Genes. These are useful for transforming coverages or tpms into copy numbers. This is an alternative way of normalizing data in order to be able to compare functional profiles in samples with different sequencing depths.
Usage
data(MGKOs)
Format
Character vector with the KEGG identifiers for 10 Single Copy Phylogenetic Marker Genes.
References
Salazar, G et al. (2019). Gene Expression Changes and Community Turnover Differentially Shape the Global Ocean Metatranscriptome Cell 179:1068-1083. (PubMed).
See Also
MGOGs
for an equivalent list using OGs instead of KOs; USiCGs
for an alternative set of single copy genes, and for examples on how to generate copy numbers.
Single Copy Phylogenetic Marker Genes from Sunagawa's group (OGs)
Description
Lists of Single Copy Phylogenetic Marker Genes. These are useful for transforming coverages or tpms into copy numbers. This is an alternative way of normalizing data in order to be able to compare functional profiles in samples with different sequencing depths.
Usage
data(MGOGs)
Format
Character vector with the COG identifiers for 10 Single Copy Phylogenetic Marker Genes.
References
Salazar, G et al. (2019). Gene Expression Changes and Community Turnover Differentially Shape the Global Ocean Metatranscriptome Cell 179:1068-1083. (PubMed).
See Also
MGKOs
for an equivalent list using KOs instead of OGs; USiCGs
for an alternative set of single copy genes, and for examples on how to generate copy numbers.
RecA/RadA recombinase
Description
The recombination protein RecA/RadA is essential for the repair and
maintenance of DNA, and has homologs in every bacteria and archaea.
By dividing the coverage of functions by the coverage of RecA, abundances
can be transformed into copy numbers, which can be used to compare
functional profiles in samples with different sequencing depths.
RecA-derived copy numbers are available in the SQM object
(SQM$functions$<annotation_type>$copy_number
).
Usage
data(RecA)
Format
Character vector with the COG identifier for RecA/RadA.
Source
Examples
data(Hadza)
data(RecA)
### Let's calculate the average copy number of each function in our samples.
# We do it for COG annotations here, but we could also do it for KEGG or PFAMs.
COG.coverage = Hadza$functions$COG$cov
COG.copynumber = t(t(COG.coverage) / COG.coverage[RecA,]) # Sample-wise division by RecA coverage.
Convert a SQM object into a microtable object from the microeco package
Description
This function will convert the selected features from a SQM object into an object of the microtable
class from the microeco package. When possible, it will also include the taxonomy of the included features (for functional classifications, the taxonomy table will instead include the description of each feature ID). Optionally, it accepts a meta table that will be passed as provided to microtable$new
.
Usage
SQM_to_microeco(
SQM,
features = "genus",
count = "abund",
md = NULL,
nocds = "treat_separately",
no_partial_classifications = FALSE,
ignore_unclassified = FALSE,
ignore_unmapped = FALSE,
bin_tax_source = "SQM",
include_seqs = FALSE
)
Arguments
SQM |
A SQM, SQMbunch or SQMlite object. |
features |
character. Either |
count |
character. Either |
md |
data.frame. A optional data.frame containing metadata for the samples in the SQM object. |
nocds |
character. Either |
no_partial_classifications |
logical. When |
ignore_unclassified |
logical. When |
ignore_unmapped |
logical. Don't include unmapped reads in the output (default |
bin_tax_source |
character. Source of taxonomy when |
include_seqs |
logical. Whether to include sequences or not if creating a microtable from contigs (default |
Value
A microtable
.
See Also
SQM_to_phyloseq
for exporting a SQM/SQMlite/SQM object as a phyloseq object.
Convert a SQM object into a phyloseq object from the phyloseq package
Description
This function will convert the selected features from a SQM object into a phyloseq object from the phyloseq package. When possible, it will also include the taxonomy of the included features (for functional classifications, the taxonomy table will instead include the description of each feature ID). Optionally, it accepts a meta table that will be passed as provided to the phyloseq
object constructor.
Usage
SQM_to_phyloseq(
SQM,
features = "genus",
count = "abund",
md = NULL,
nocds = "treat_separately",
no_partial_classifications = FALSE,
ignore_unclassified = FALSE,
ignore_unmapped = FALSE,
bin_tax_source = "SQM",
include_seqs = FALSE
)
Arguments
SQM |
A SQM, SQMbunch or SQMlite object. |
features |
character. Either |
count |
character. Either |
md |
data.frame. A optional data.frame containing metadata for the samples in the SQM object. |
nocds |
character. Either |
no_partial_classifications |
logical. When |
ignore_unclassified |
logical. When |
ignore_unmapped |
logical. Don't include unmapped reads in the output (default |
bin_tax_source |
character. Source of taxonomy when |
include_seqs |
logical. Whether to include sequences or not if creating a microtable from ORFs or contigs (default |
Value
A phyloseq object.
See Also
SQM_to_microeco
for exporting a SQM/SQMlite/SQM object as a microtable object.
Universal Single-Copy Genes
Description
Lists of Universal Single Copy Genes for Bacteria and Archaea. These are useful for transforming coverages or tpms into copy numbers. This is an alternative way of normalizing data in order to be able to compare functional profiles in samples with different sequencing depths.
Usage
data(USiCGs)
Format
Character vector with the KEGG identifiers for 15 Universal Single Copy Genes.
Source
References
Carr, Shen-Orr & Borenstein (2013). Reconstructing the Genomic Content of Microbiome Taxa through Shotgun Metagenomic Deconvolution PLoS Comput. Biol. 9:e1003292. (PubMed).
See Also
MGOGs
and MGKOs
for an alternative set of single copy genes, and for examples on how to generate copy numbers.
Examples
data(Hadza)
data(USiCGs)
### Let's look at the Universal Single Copy Gene distribution in our samples.
KEGG.tpm = Hadza$functions$KEGG$tpm
all(USiCGs %in% rownames(KEGG.tpm)) # Are all the USiCGs present in our dataset?
# Plot a boxplot of USiCGs tpms and calculate median USiCGs tpm.
# This looks weird in the test dataset because it contains only a small subset of the metagenomes.
# In a set of complete metagenomes USiCGs should have fairly similar TPM averages
# and low dispersion across samples.
boxplot(t(KEGG.tpm[USiCGs,]), names=USiCGs, ylab="TPM", col="slateblue2")
### Now let's calculate the average copy numbers of each function.
# We do it for KEGG annotations here, but we could also do it for COGs or PFAMs.
USiCGs.cov = apply(Hadza$functions$KEGG$cov[USiCGs,], 2, median)
# Sample-wise division by the median USiCG coverage.
KEGG.copynumber = t(t(Hadza$functions$KEGG$cov) / USiCGs.cov)
Combine several SQM objects
Description
Combine an arbitrary number of SQM objects into a single SQM object (if the input objects contain the same samples, i.e. they come from the same SqueezeMeta run) or a single SQMbunch object. For combining results from sqm_reads.pl or sqm_longreads.pl please check combineSQMlite
. The parameters below (other than ...) will take only effect if the input objects contain the same samples. Otherwise the input objects will be taken as they are, with no recalculation of taxonomy, function or rescaling,
Usage
combineSQM(
...,
tax_source = "orfs",
trusted_functions_only = FALSE,
ignore_unclassified_functions = FALSE,
rescale_tpm = TRUE,
rescale_copy_number = TRUE,
recalculate_bin_stats = TRUE
)
Arguments
... |
an arbitrary number of SQM objects. Alternatively, a single list containing an arbitrary number of SQM objects. |
tax_source |
character. Features used for calculating aggregated abundances at the different taxonomic ranks. Either |
trusted_functions_only |
logical. If |
ignore_unclassified_functions |
logical. If |
rescale_tpm |
logical. If |
rescale_copy_number |
logical. If |
recalculate_bin_stats |
logical. If |
Value
A SQM or SQMbunch object
See Also
subsetFun
, subsetTax
, combineSQMlite
Examples
data(Hadza)
# Select Carbohydrate metabolism ORFs in Bacteroidota,
# and Amino acid metabolism ORFs in Proteobacteria
bact = subsetTax(Hadza, "phylum", "Bacteroidota")
bact.carb = subsetFun(bact, "Carbohydrate metabolism")
baci = subsetTax(Hadza, "phylum", "Bacillota")
baci.amins = subsetFun(baci, "Amino acid metabolism")
bact.carb_proteo.amins = combineSQM(bact.carb, baci.amins, rescale_copy_number=FALSE)
Combine several SQM or SQMlite objects
Description
Combine an arbitrary number of SQM or SQMlite objects into a single SQMlite object. This function accepts objects originating from different projects (i.e. different SqueezeMeta runs).
Usage
combineSQMlite(...)
Arguments
... |
an arbitrary number of SQM or SQMlite objects. Alternatively, a single list containing an arbitrary number of SQMlite objects. |
Value
A SQMlite object
See Also
Examples
## Not run:
data(Hadza)
# Load data coming from a different run
other = loadSQMlite("/path/to/other/project/tables") # e.g. if the project was run using sqm_reads
# (We could also use loadSQM to load the data as long as the data comes from a SqueezeMeta run)
combined = combineSQMlite(Hadza, other)
# Now we can plot together the samples from Hadza and the second project
plotTaxonomy(combined, 'family')
## End(Not run)
Create a bin from a vector of contigs
Description
Create a bin from a vector of contigs
Usage
create_bin(SQM, bin, contigs, delete_overlapping_bins = FALSE)
Arguments
SQM |
A SQM object. |
bin |
character. Name of the bin to be created. |
contigs |
character. Vector with the names of the contigs that will be included in the new bin. |
delete_overlapping_bins |
logical. If |
Value
SQM object with the new binning information, including recalculated bin statistics if possible.
See Also
find_redundant_contigs
, remove_contigs_from_bin
Export the bins of a SQM object
Description
Export the bins of a SQM object
Usage
exportBins(SQM, output_dir = "")
Arguments
SQM |
A SQM object. |
output_dir |
Existing output directory to which to write the bins. |
Export the contigs of a SQM object
Description
Export the contigs of a SQM object
Usage
exportContigs(SQM, output_name = "")
Arguments
SQM |
A SQM object. |
output_name |
A connection, or a character string naming the file to print to. If "" (the default), sequences will be printed to the standard output connection. |
Export the taxonomy of a SQM object into a Krona Chart
Description
Generate a krona chart containing the full taxonomy from a SQM object.
Usage
exportKrona(SQM, output_name = NA)
Arguments
SQM |
A SQM, SQMbunch or SQMlite object. |
output_name |
character. Name of the output file containing the Krona charts in html format (default |
Details
Original code was kindly provided by Giuseppe D'Auria (dauria_giu@gva.es).
Value
No return value, but a krona chart is produced in the current working directory.
See Also
plotTaxonomy
for plotting the most abundant taxa of a SQM object.
Examples
data(Hadza)
# Check that kronatools is present.
ecode = system('ktImportText', ignore.stdout = TRUE, ignore.stderr = TRUE)
# If so, run.
if(ecode==0) { exportKrona(Hadza, output_name = file.path(tempdir(), "krona.html")) }
Export the ORFs of a SQM object
Description
Export the ORFs of a SQM object
Usage
exportORFs(SQM, output_name = "")
Arguments
SQM |
A SQM object. |
output_name |
A connection, or a character string naming the file to print to. If "" (the default), sequences will be printed to the standard output connection. |
Export the functions of a SQM object into KEGG pathway maps
Description
This function is a wrapper for the pathview package (Luo et al., 2017. Nucleic acids research, 45:W501-W508). It will generate annotated KEGG pathway maps showing which reactions are present in the different samples. It will also generate legends with the color scales for each sample in separate png files.
Usage
exportPathway(
SQM,
pathway_id,
count = "copy_number",
samples = NULL,
split_samples = FALSE,
sample_colors = NULL,
log_scale = FALSE,
fold_change_groups = NULL,
fold_change_colors = NULL,
max_scale_value = NULL,
color_bins = 10,
rescale_percent = FALSE,
output_dir = ".",
output_suffix = "pathview"
)
Arguments
SQM |
A SQM, SQMbunch or SQMlite object. |
pathway_id |
character. The five-number KEGG pathway identifier. A list of all pathway identifiers can be found in https://www.genome.jp/kegg/pathway.html. |
count |
character. Either |
samples |
character. An optional vector with the names of the samples to export. If absent, all samples will be exported (default |
split_samples |
logical. Generate a different output file for each sample (default |
sample_colors |
character. An optional vector with the plotting colors for each sample (default |
log_scale |
logical. Use a base 10 logarithmic transformation for the color scale. Will have no effect if |
fold_change_groups |
list. An optional list containing two vectors of samples. If provided, the function will generate a single plot displaying the log2 fold-change between the median abundances of both groups of samples ( log(second group / first group) ) (default |
fold_change_colors |
character. An optional vector with the plotting colors of both groups in the fold-change plot. Will be ignored if |
max_scale_value |
numeric. Maximum value to include in the color scale. By default it is the maximum value in the selected samples (if plotting abundances in samples) or the maximum absolute log2 fold-change (if plotting fold changes) (default |
color_bins |
numeric. Number of bins used to generate the gradient in the color scale (default |
rescale_percent |
logical. Calculate percent counts over the number of reads in the input object, instead of over the total number of reads in the original project (default |
output_dir |
character. Directory in which to write the output files (default |
output_suffix |
character. Suffix to be added to the output files (default |
Value
A ggplot
if split_samples = FALSE
and the ggpattern package is installed, otherwise nothing. Additionally, Pathview figures will be written in the directory specified by output_dir
.
See Also
plotFunctions
for plotting the most functions taxa of a SQM object.
Examples
data(Hadza)
exportPathway(Hadza, "00910", count = 'copy_number',
output_dir = tempdir(),
output_suffix = "nitrogen_metabolism",
sample_colors = c("red", "blue"))
exportPathway(Hadza, "00250", count = 'tpm',
output_dir = tempdir(),
output_suffix = "ala_asp_glu_metabolism_FoldChange",
fold_change_groups = list(c("H1"), c("H12")), max_scale_value=2)
Export results in tabular format
Description
This function is a wrapper for R's write.table function.
Usage
exportTable(table, output_name)
Arguments
table |
vector, matrix or data.frame. The table to be written. |
output_name |
Either a character string naming a file or a connection open for writing. ‘""’ indicates output to the console. |
Examples
data(Hadza)
Hadza.iron = subsetFun(Hadza, "iron")
# Write the taxonomic distribution at the genus level of all the genes related to iron.
exportTable(Hadza.iron$taxa$genus$percent, file.path(tempdir(), "Hadza.ironGenes.genus.tsv"))
# Now write the distribution of the different iron-related COGs
# (Clusters of Orthologous Groups) across samples.
exportTable(Hadza.iron$functions$COG$tpm, file.path(tempdir(), "Hadza.ironGenes.COG.tsv"))
# Now write all the information contained in the ORF table.
exportTable(Hadza.iron$orfs$table, file.path(tempdir(), "Hadza.ironGenes.orftable.tsv"))
Find redundant contigs within a bin
Description
Find contigs with overlapping marker genes in a given bin, and suggest contigs to be removed in order to reduce contamination without affecting completeness. Note that this can give a quick idea of the contigs that are sources of contamination within a bin, but is not a replacement for proper bin refininement with other tools such as anvi\'o.
Usage
find_redundant_contigs(SQM, bin, minimum_overlap_for_removal = 1)
Arguments
SQM |
A SQM object. |
bin |
character. Name of the bin to be created. |
minimum_overlap_for_removal |
numeric. Fraction of marker genes in the contigs present in another contig needed to suggest it for removal. If set to |
Value
A character vector with the contigs deemed to be redundant. A heatmap showing how marker genes overlap over different contigs will also be produced.
See Also
create_bin
, remove_contigs_from_bin
Examples
data(Hadza)
bin_name = "Hadza2merged.concoct.28.fa.contigs"
# Get redundant contigs that could be removed from our bin
candidates_for_removal = find_redundant_contigs(Hadza, bin_name)
# We can now remove them from the bin
Hadza.new.1 = remove_contigs_from_bin(Hadza, bin_name, candidates_for_removal)
# Or we can create a new bin out of them
# which will also remove them from the original bin
Hadza.new.2 = create_bin(Hadza, "new_bin_name", candidates_for_removal)
Load a SqueezeMeta project into R
Description
This function takes the path to a project directory generated by SqueezeMeta (whose name is specified in the -p
parameter of the SqueezeMeta.pl script) and parses the results into a SQM object. Alternatively, it can load the project data from a zip file produced by sqm2zip.py
.
Usage
loadSQM(
project_path,
tax_mode = "prokfilter",
trusted_functions_only = FALSE,
single_copy_genes = "MGOGs",
load_sequences = TRUE,
engine = "data.table"
)
Arguments
project_path |
character, a vector of project directories generated by SqueezeMeta, and/or zip files generated by |
tax_mode |
character, which taxonomic classification should be loaded? SqueezeMeta applies the identity thresholds described in Luo et al., 2014. Use |
trusted_functions_only |
logical. If |
single_copy_genes |
character, source of single copy genes for copy number normalization, either |
load_sequences |
logical. If |
engine |
character. Engine used to load the ORFs and contigs tables. Either |
Value
SQM object containing the parsed project. If more than one path is provided in project_path
this function will return a SQMbunch object instead. The structure of this object is similar to that of a SQMlite object (see loadSQMlite
) but with an extra entry named projects
that contains one SQM object for input project. SQM and SQMbunch objects will otherwise behave similarly when used with the subset and plot functions from this package.
Prerequisites
Run SqueezeMeta! An example call for running it would be:
/path/to/SqueezeMeta/scripts/SqueezeMeta.pl
-m coassembly -f fastq_dir -s samples_file -p project_dir
The SQM object structure
The SQM object is a nested list which contains the following information:
lvl1 | lvl2 | lvl3 | type | rows/names | columns | data |
$orfs | $table | dataframe | orfs | misc. data | misc. data | |
$abund | numeric matrix | orfs | samples | abundances (reads) | ||
$bases | numeric matrix | orfs | samples | abundances (bases) | ||
$cov | numeric matrix | orfs | samples | coverages | ||
$cpm | numeric matrix | orfs | samples | covs. / 10^6 reads | ||
$tpm | numeric matrix | orfs | samples | tpm | ||
$seqs | character vector | orfs | (n/a) | sequences | ||
$tax | character matrix | orfs | tax. ranks | taxonomy | ||
$tax16S | character vector | orfs | (n/a) | 16S rRNA taxonomy | ||
$markers | list | orfs | (n/a) | CheckM1 markers | ||
$contigs | $table | dataframe | contigs | misc. data | misc. data | |
$abund | numeric matrix | contigs | samples | abundances (reads) | ||
$bases | numeric matrix | contigs | samples | abundances (bases) | ||
$cov | numeric matrix | contigs | samples | coverages | ||
$cpm | numeric matrix | contigs | samples | covs. / 10^6 reads | ||
$tpm | numeric matrix | contigs | samples | tpm | ||
$seqs | character vector | contigs | (n/a) | sequences | ||
$tax | character matrix | contigs | tax. ranks | taxonomies | ||
$bins | character matrix | contigs | bin. methods | bins | ||
$bins | $table | dataframe | bins | misc. data | misc. data | |
$length | numeric vector | bins | (n/a) | length | ||
$abund | numeric matrix | bins | samples | abundances (reads) | ||
$percent | numeric matrix | bins | samples | abundances (reads) | ||
$bases | numeric matrix | bins | samples | abundances (bases) | ||
$cov | numeric matrix | bins | samples | coverages | ||
$cpm | numeric matrix | bins | samples | covs. / 10^6 reads | ||
$tax | character matrix | bins | tax. ranks | taxonomy | ||
$tax_gtdb | character matrix | bins | tax. ranks | GTDB taxonomy | ||
$taxa | $superkingdom | $abund | numeric matrix | superkingdoms | samples | abundances (reads) |
$percent | numeric matrix | superkingdoms | samples | percentages | ||
$phylum | $abund | numeric matrix | phyla | samples | abundances (reads) | |
$percent | numeric matrix | phyla | samples | percentages | ||
$class | $abund | numeric matrix | classes | samples | abundances (reads) | |
$percent | numeric matrix | classes | samples | percentages | ||
$order | $abund | numeric matrix | orders | samples | abundances (reads) | |
$percent | numeric matrix | orders | samples | percentages | ||
$family | $abund | numeric matrix | families | samples | abundances (reads) | |
$percent | numeric matrix | families | samples | percentages | ||
$genus | $abund | numeric matrix | genera | samples | abundances (reads) | |
$percent | numeric matrix | genera | samples | percentages | ||
$species | $abund | numeric matrix | species | samples | abundances (reads) | |
$percent | numeric matrix | species | samples | percentages | ||
$functions | $KEGG | $abund | numeric matrix | KEGG ids | samples | abundances (reads) |
$bases | numeric matrix | KEGG ids | samples | abundances (bases) | ||
$cov | numeric matrix | KEGG ids | samples | coverages | ||
$cpm | numeric matrix | KEGG ids | samples | covs. / 10^6 reads | ||
$tpm | numeric matrix | KEGG ids | samples | tpm | ||
$copy_number | numeric matrix | KEGG ids | samples | avg. copies | ||
$COG | $abund | numeric matrix | COG ids | samples | abundances (reads) | |
$bases | numeric matrix | COG ids | samples | abundances (bases) | ||
$cov | numeric matrix | COG ids | samples | coverages | ||
$cpm | numeric matrix | COG ids | samples | covs. / 10^6 reads | ||
$tpm | numeric matrix | COG ids | samples | tpm | ||
$copy_number | numeric matrix | COG ids | samples | avg. copies | ||
$PFAM | $abund | numeric matrix | PFAM ids | samples | abundances (reads) | |
$bases | numeric matrix | PFAM ids | samples | abundances (bases) | ||
$cov | numeric matrix | PFAM ids | samples | coverages | ||
$cpm | numeric matrix | PFAM ids | samples | covs. / 10^6 reads | ||
$tpm | numeric matrix | PFAM ids | samples | tpm | ||
$copy_number | numeric matrix | PFAM ids | samples | avg. copies | ||
$total_reads | numeric vector | samples | (n/a) | total reads | ||
$misc | $project_name | character vector | (empty) | (n/a) | project name | |
$samples | character vector | (empty) | (n/a) | samples | ||
$tax_names_long | $superkingdom | character vector | short names | (n/a) | full names | |
$phylum | character vector | short names | (n/a) | full names | ||
$class | character vector | short names | (n/a) | full names | ||
$order | character vector | short names | (n/a) | full names | ||
$family | character vector | short names | (n/a) | full names | ||
$genus | character vector | short names | (n/a) | full names | ||
$species | character vector | short names | (n/a) | full names | ||
$tax_names_short | character vector | full names | (n/a) | short names | ||
$KEGG_names | character vector | KEGG ids | (n/a) | KEGG names | ||
$KEGG_paths | character vector | KEGG ids | (n/a) | KEGG hiararchy | ||
$COG_names | character vector | COG ids | (n/a) | COG names | ||
$COG_paths | character vector | COG ids | (n/a) | COG hierarchy | ||
$ext_annot_sources | character vector | COG ids | (n/a) | external databases | ||
If external databases for functional classification were provided to SqueezeMeta via the -extdb
argument, the corresponding abundance (reads and bases), coverages, tpm and copy number profiles will be present in SQM$functions
(e.g. results for the CAZy database would be present in SQM$functions$CAZy
). Additionally, the extended names of the features present in the external database will be present in SQM$misc
(e.g. SQM$misc$CAZy_names
).
Examples
## Not run:
## (outside R)
## Run SqueezeMeta on the test data.
/path/to/SqueezeMeta/scripts/SqueezeMeta.pl -p Hadza -f raw -m coassembly -s test.samples
## Now go into R.
library(SQMtools)
Hadza = loadSQM("Hadza") # Where Hadza is the path to the SqueezeMeta output directory.
## End(Not run)
data(Hadza) # We will illustrate the structure of the SQM object on the test data
# Which are the ten most abundant KEGG IDs in our data?
topKEGG = names(sort(rowSums(Hadza$functions$KEGG$tpm), decreasing=TRUE))[1:11]
topKEGG = topKEGG[topKEGG!="Unclassified"]
# Which functions do those KEGG IDs represent?
Hadza$misc$KEGG_names[topKEGG]
# What is the relative abundance of the Negativicutes class across samples?
Hadza$taxa$class$percent["Negativicutes",]
# Which information is stored in the orf, contig and bin tables?
colnames(Hadza$orfs$table)
colnames(Hadza$contigs$table)
colnames(Hadza$bins$table)
# What is the GC content distribution of my metagenome?
boxplot(Hadza$contigs$table[,"GC perc"]) # Not weighted by contig length or abundance!
Load tables generated by sqm2tables.py
, sqmreads2tables.py
or combine-sqm-tables.py
into R.
Description
This function takes the path to the output directory generated by sqm2tables.py
, sqmreads2tables.py
or combine-sqm-tables.py
a SQMlite object.
The SQMlite object will contain taxonomic and functional profiles, but no detailed information on ORFs, contigs or bins. However, it will also have a much smaller memory footprint.
A SQMlite object can be used for plotting and exporting, but it can not be subsetted.
Usage
loadSQMlite(tables_path, tax_mode = "allfilter")
Arguments
tables_path |
character, tables directory generated by |
tax_mode |
character, which taxonomic classification should be loaded? SqueezeMeta applies the identity thresholds described in Luo et al., 2014. Use |
Value
SQMlite object containing the parsed tables.
The SQMlite object structure
The SQMlite object is a nested list which contains the following information:
lvl1 | lvl2 | lvl3 | type | rows/names | columns | data |
$taxa | $superkingdom | $abund | numeric matrix | superkingdoms | samples | abundances |
$percent | numeric matrix | superkingdoms | samples | percentages | ||
$phylum | $abund | numeric matrix | phyla | samples | abundances | |
$percent | numeric matrix | phyla | samples | percentages | ||
$class | $abund | numeric matrix | classes | samples | abundances | |
$percent | numeric matrix | classes | samples | percentages | ||
$order | $abund | numeric matrix | orders | samples | abundances | |
$percent | numeric matrix | orders | samples | percentages | ||
$family | $abund | numeric matrix | families | samples | abundances | |
$percent | numeric matrix | families | samples | percentages | ||
$genus | $abund | numeric matrix | genera | samples | abundances | |
$percent | numeric matrix | genera | samples | percentages | ||
$species | $abund | numeric matrix | species | samples | abundances | |
$percent | numeric matrix | species | samples | percentages | ||
$functions | $KEGG | $abund | numeric matrix | KEGG ids | samples | abundances (reads) |
$bases | numeric matrix | KEGG ids | samples | abundances (bases) | ||
$tpm | numeric matrix | KEGG ids | samples | tpm | ||
$copy_number | numeric matrix | KEGG ids | samples | avg. copies | ||
$COG | $abund | numeric matrix | COG ids | samples | abundances (reads) | |
$bases | numeric matrix | COG ids | samples | abundances (bases) | ||
$tpm | numeric matrix | COG ids | samples | tpm | ||
$copy_number | numeric matrix | COG ids | samples | avg. copies | ||
$PFAM | $abund | numeric matrix | PFAM ids | samples | abundances (reads) | |
$bases | numeric matrix | PFAM ids | samples | abundances (bases) | ||
$tpm | numeric matrix | PFAM ids | samples | tpm | ||
$copy_number | numeric matrix | PFAM ids | samples | avg. copies | ||
$total_reads | numeric vector | samples | (n/a) | total reads | ||
$misc | $project_name | character vector | (empty) | (n/a) | project name | |
$samples | character vector | (empty) | (n/a) | samples | ||
$tax_names_long | $superkingdom | character vector | short names | (n/a) | full names | |
$phylum | character vector | short names | (n/a) | full names | ||
$class | character vector | short names | (n/a) | full names | ||
$order | character vector | short names | (n/a) | full names | ||
$family | character vector | short names | (n/a) | full names | ||
$genus | character vector | short names | (n/a) | full names | ||
$species | character vector | short names | (n/a) | full names | ||
$tax_names_short | character vector | full names | (n/a) | short names | ||
$KEGG_names | character vector | KEGG ids | (n/a) | KEGG names | ||
$KEGG_paths | character vector | KEGG ids | (n/a) | KEGG hiararchy | ||
$COG_names | character vector | COG ids | (n/a) | COG names | ||
$COG_paths | character vector | COG ids | (n/a) | COG hierarchy | ||
$ext_annot_sources | character vector | (empty) | (n/a) | external databases | ||
If external databases for functional classification were provided to SqueezeMeta or SqueezeMeta_reads via the -extdb
argument, the corresponding abundance, tpm and copy number profiles will be present in SQM$functions
(e.g. results for the CAZy database would be present in SQM$functions$CAZy
). Additionally, the extended names of the features present in the external database will be present in SQM$misc
(e.g. SQM$misc$CAZy_names
). Note that results generated by SqueezeMeta_reads will contain only read abundances, but not bases, tpm or copy number estimations.
See Also
plotBars
and plotFunctions
will plot the most abundant taxa and functions in a SQMlite object. exportKrona
will generate Krona charts reporting the taxonomy in a SQMlite object.
Examples
## Not run:
## (outside R)
## Run SqueezeMeta on the test data.
/path/to/SqueezeMeta/scripts/SqueezeMeta.pl -p Hadza -f raw -m coassembly -s test.samples
## Generate the tabular outputs!
/path/to/SqueezeMeta/utils/sqm2tables.py Hadza Hadza/results/tables
## Now go into R.
library(SQMtools)
Hadza = loadSQMlite("Hadza/results/tables")
# Where Hadza is the path to the SqueezeMeta output directory.
# Note that this is not the whole SQM project, just the directory containing the tables.
# It would also work with tables generated by sqmreads2tables.py, or combine-sqm-tables.py
plotTaxonomy(Hadza)
plotFunctions(Hadza)
exportKrona(Hadza, 'myKronaTest.html')
## End(Not run)
Get the N most abundant rows (or columns) from a numeric table
Description
Return a subset of an input matrix or data frame, containing only the N most abundant rows (or columns), sorted. Alternatively, a custom set of rows can be returned.
Usage
mostAbundant(
data,
N = 10,
items = NULL,
others = FALSE,
rescale = FALSE,
bycol = FALSE
)
Arguments
data |
numeric matrix or data frame |
N |
integer Number of rows to return (default |
items |
Character vector. Custom row names to return. If provided, it will override |
others |
logical. If |
rescale |
logical. Scale result to percentages column-wise (default |
bycol |
logical. Operate on columns instead of rows (default |
Value
A matrix or data frame (same as input) with the selected rows (or columns).
Examples
data(Hadza)
Hadza.carb = subsetFun(Hadza, "Carbohydrate metabolism")
# Which are the 20 most abundant KEGG functions in the ORFs related to carbohydrate metabolism?
topCarb = mostAbundant(Hadza.carb$functions$KEGG$tpm, N=20)
# Now print them with nice names.
rownames(topCarb) = paste(rownames(topCarb),
Hadza.carb$misc$KEGG_names[rownames(topCarb)], sep="; ")
topCarb
# We can pass this to any R function.
heatmap(topCarb)
# But for convenience we provide wrappers for plotting ggplot2 heatmaps and barplots.
plotHeatmap(topCarb, label_y="TPM")
plotBars(topCarb, label_y="TPM")
Get the N most variable rows (or columns) from a numeric table
Description
Return a subset of an input matrix or data frame, containing only the N most variable rows (or columns), sorted. Variability is calculated as the Coefficient of Variation (sd/mean).
Usage
mostVariable(data, N = 10, bycol = FALSE)
Arguments
data |
numeric matrix or data frame |
N |
integer Number of rows to return (default |
bycol |
logical. Operate on columns instead of rows (default |
Value
A matrix or data frame (same as input) with the selected rows or columns.
Examples
data(Hadza)
Hadza.carb = subsetFun(Hadza, "Carbohydrate metabolism")
# Which are the 20 most variable KEGG functions in the ORFs related to carbohydrate metabolism?
topCarb = mostVariable(Hadza.carb$functions$KEGG$tpm, N=20)
# Now print them with nice names
rownames(topCarb) = paste(rownames(topCarb),
Hadza.carb$misc$KEGG_names[rownames(topCarb)], sep="; ")
topCarb
# We can pass this to any R function
heatmap(topCarb)
# But for convenience we provide wrappers for plotting ggplot2 heatmaps and barplots
plotHeatmap(topCarb, label_y="TPM")
plotBars(topCarb, label_y="TPM")
Plot a barplot using ggplot2
Description
Plot a ggplot2 barplot from a matrix or data frame. The data should be in tabular format (e.g. features in rows and samples in columns).
Usage
plotBars(
data,
label_x = "Samples",
label_y = "Abundances",
label_fill = "Features",
color = NULL,
base_size = 11,
max_scale_value = NULL,
metadata_groups = NULL
)
Arguments
data |
Numeric matrix or data frame. |
label_x |
character Label for the x axis (default |
label_y |
character Label for the y axis (default |
label_fill |
character Label for color categories (default |
color |
Vector with custom colors for the different features. If empty, the default ggplot2 palette will be used (default |
base_size |
numeric. Base font size (default |
max_scale_value |
numeric. Maximum value to include in the y axis. By default it is handled automatically by ggplot2 (default |
metadata_groups |
list. Split the plot into groups defined by the user: list('G1' = c('sample1', sample2'), 'G2' = c('sample3', 'sample4')) default |
Value
a ggplot2 plot object.
See Also
plotTaxonomy
for plotting the most abundant taxa of a SQM object; plotHeatmap
for plotting a heatmap with arbitrary data; mostAbundant
for selecting the most abundant rows in a dataframe or matrix.
Examples
data(Hadza)
sk = Hadza$taxa$superkingdom$abund
plotBars(sk, label_y = "Raw reads", label_fill = "Superkingdom")
Barplot of the most abundant bins in a SQM object
Description
This function selects the most abundant bins across all samples in a SQM object and represents their abundances in a barplot. Alternatively, a custom set of bins can be represented.
Usage
plotBins(
SQM,
count = "percent",
N = 15,
bins = NULL,
others = TRUE,
samples = NULL,
ignore_unmapped = FALSE,
ignore_nobin = FALSE,
rescale = FALSE,
color = NULL,
base_size = 11,
max_scale_value = NULL,
metadata_groups = NULL
)
Arguments
SQM |
A SQM object. |
count |
character. Either |
N |
integer Plot the |
bins |
character. Custom bins to plot. If provided, it will override |
others |
logical. Collapse the abundances of least abundant bins, and include the result in the plot (default |
samples |
character. Character vector with the names of the samples to include in the plot. Can also be used to plot the samples in a custom order. If not provided, all samples will be plotted (default |
ignore_unmapped |
logical. Don't include unmapped reads in the plot (default |
ignore_nobin |
logical. Don't include reads which are not in a bin in the plot (default |
rescale |
logical. Re-scale results to percentages (default |
color |
Vector with custom colors for the different features. If empty, we will use our own hand-picked pallete if N<=15, and the default ggplot2 palette otherwise (default |
base_size |
numeric. Base font size (default |
max_scale_value |
numeric. Maximum value to include in the y axis. By default it is handled automatically by ggplot2 (default |
metadata_groups |
list. Split the plot into groups defined by the user: list('G1' = c('sample1', sample2'), 'G2' = c('sample3', 'sample4')) default |
Value
a ggplot2 plot object.
See Also
plotTaxonomy
for plotting the most abundant taxa of a SQM object; plotBars
and plotHeatmap
for plotting barplots or heatmaps with arbitrary data.
Examples
data(Hadza)
# Bins distribution.
plotBins(Hadza)
Heatmap of the most abundant functions in a SQM object
Description
This function selects the most abundant functions across all samples in a SQM object and represents their abundances in a heatmap. Alternatively, a custom set of functions can be represented.
Usage
plotFunctions(
SQM,
fun_level = "KEGG",
count = "copy_number",
N = 25,
fun = NULL,
samples = NULL,
display_function_names = TRUE,
ignore_unmapped = TRUE,
ignore_unclassified = TRUE,
gradient_col = c("ghostwhite", "dodgerblue4"),
rescale_percent = FALSE,
base_size = 11,
metadata_groups = NULL
)
Arguments
SQM |
A SQM, SQMbunch or SQMlite object. |
fun_level |
character. Either |
count |
character. Either |
N |
integer Plot the |
fun |
character. Custom functions to plot. If provided, it will override |
samples |
character. Character vector with the names of the samples to include in the plot. Can also be used to plot the samples in a custom order. If not provided, all samples will be plotted (default |
display_function_names |
logical. Plot function names alongside function IDs, if available (default |
ignore_unmapped |
logical. Don't include unmapped reads in the plot (default |
ignore_unclassified |
logical. Don't include unclassified ORFs in the plot (default |
gradient_col |
A vector of two colors representing the low and high ends of the color gradient (default |
rescale_percent |
logical. Calculate percent counts over the number of reads in the input object, instead of over the total number of reads in the original project (default |
base_size |
numeric. Base font size (default |
metadata_groups |
list. Split the plot into groups defined by the user: list('G1' = c('sample1', sample2'), 'G2' = c('sample3', 'sample4')) default |
Value
a ggplot2 plot object.
See Also
plotTaxonomy
for plotting the most abundant taxa of a SQM object; plotBars
and plotHeatmap
for plotting barplots or heatmaps with arbitrary data.
Examples
data(Hadza)
plotFunctions(Hadza)
Plot a heatmap using ggplot2
Description
Plot a ggplot2 heatmap from a matrix or data frame. The data should be in tabular format (e.g. features in rows and samples in columns).
Usage
plotHeatmap(
data,
label_x = "Samples",
label_y = "Features",
label_fill = "Abundance",
gradient_col = c("ghostwhite", "dodgerblue4"),
base_size = 11,
metadata_groups = NULL
)
Arguments
data |
numeric matrix or data frame. |
label_x |
character Label for the x axis (default |
label_y |
character Label for the y axis (default |
label_fill |
character Label for color scale (default |
gradient_col |
A vector of two colors representing the low and high ends of the color gradient (default |
base_size |
numeric. Base font size (default |
metadata_groups |
list. Split the plot into groups defined by the user: list('G1' = c('sample1', sample2'), 'G2' = c('sample3', 'sample4')) default |
Value
A ggplot2 plot object.
See Also
plotFunctions
for plotting the top functional categories of a SQM object; plotBars
for plotting a barplot with arbitrary data; mostAbundant
for selecting the most abundant rows in a dataframe or matrix.
Examples
data(Hadza)
topPFAM = mostAbundant(Hadza$functions$PFAM$tpm)
topPFAM = topPFAM[rownames(topPFAM) != "Unclassified",] # Take out the Unclassified ORFs.
plotHeatmap(topPFAM, label_x = "Samples", label_y = "PFAMs", label_fill = "TPM")
Barplot of the most abundant taxa in a SQM object
Description
This function selects the most abundant taxa across all samples in a SQM object and represents their abundances in a barplot. Alternatively, a custom set of taxa can be represented.
Usage
plotTaxonomy(
SQM,
rank = "phylum",
count = "percent",
N = 15,
tax = NULL,
others = TRUE,
samples = NULL,
nocds = "treat_separately",
ignore_unmapped = FALSE,
ignore_unclassified = FALSE,
no_partial_classifications = FALSE,
rescale = FALSE,
color = NULL,
base_size = 11,
max_scale_value = NULL,
metadata_groups = NULL
)
Arguments
SQM |
A SQM, SQMbunch or a SQMlite object. |
rank |
Taxonomic rank to plot (default |
count |
character. Either |
N |
integer Plot the |
tax |
character. Custom taxa to plot. If provided, it will override |
others |
logical. Collapse the abundances of least abundant taxa, and include the result in the plot (default |
samples |
character. Character vector with the names of the samples to include in the plot. Can also be used to plot the samples in a custom order. If not provided, all samples will be plotted (default |
nocds |
character. Either |
ignore_unmapped |
logical. Don't include unmapped reads in the plot (default |
ignore_unclassified |
logical. Don't include unclassified reads in the plot (default |
no_partial_classifications |
logical. Treat reads not fully classified at the requested level (e.g. "Unclassified Bacteroidota" at the class level or below) as fully unclassified. This takes effect before |
rescale |
logical. Re-scale results to percentages (default |
color |
Vector with custom colors for the different features. If empty, we will use our own hand-picked pallete if N<=15, and the default ggplot2 palette otherwise (default |
base_size |
numeric. Base font size (default |
max_scale_value |
numeric. Maximum value to include in the y axis. By default it is handled automatically by ggplot2 (default |
metadata_groups |
list. Split the plot into groups defined by the user: list('G1' = c('sample1', sample2'), 'G2' = c('sample3', 'sample4')) default |
Value
a ggplot2 plot object.
See Also
plotFunctions
for plotting the most abundant functions of a SQM object; plotBars
and plotHeatmap
for plotting barplots or heatmaps with arbitrary data.
Examples
data(Hadza)
Hadza.amin = subsetFun(Hadza, "Amino acid metabolism")
# Taxonomic distribution of amino acid metabolism ORFs at the family level.
plotTaxonomy(Hadza.amin, "family")
Remove contigs from a given bin
Description
Remove contigs from a given bin
Usage
remove_contigs_from_bin(SQM, bin, contigs)
Arguments
SQM |
A SQM object. |
bin |
character. Name of the bin from which the contigs will be removed. |
contigs |
character. Vector with the names of the contigs that will be removed from the new bin. |
Value
SQM object with the new binning information, including recalculated bin statistics if possible.
See Also
find_redundant_contigs
, create_bin
Return a vector with the row-wise maxima of a matrix or dataframe.
Description
Return a vector with the row-wise maxima of a matrix or dataframe.
Usage
rowMaxs(table)
Arguments
table |
matrix or dataframe. |
Value
a vector with the row-wise maxima.
Return a vector with the row-wise minima of a matrix or dataframe.
Description
Return a vector with the row-wise minima of a matrix or dataframe.
Usage
rowMins(table)
Arguments
table |
matrix or dataframe. |
Value
a vector with the row-wise minima.
Print a named vector of sequences as a fasta-formatted string
Description
Print a named vector of sequences as a fasta-formatted string
Usage
seqvec2fasta(seqvec, output_name = "")
Arguments
seqvec |
vector. The vector to be written as a fasta string. |
output_name |
A connection, or a character string naming the file to print to. If "" (the default), sequences will be printed to the standard output connection. |
Examples
data(Hadza)
seqvec2fasta(Hadza$orfs$seqs[1:10])
Create a SQM object containing only the requested bins, and the contigs and ORFs contained in them.
Description
Create a SQM object containing only the requested bins, and the contigs and ORFs contained in them.
Usage
subsetBins(
SQM,
bins,
trusted_functions_only = FALSE,
ignore_unclassified_functions = FALSE,
rescale_tpm = TRUE,
rescale_copy_number = TRUE,
allow_empty = FALSE
)
Arguments
SQM |
SQM object to be subsetted. |
bins |
character. Vector of bins to be selected. |
trusted_functions_only |
logical. If |
ignore_unclassified_functions |
logical. If |
rescale_tpm |
logical. If |
rescale_copy_number |
logical. If |
allow_empty |
(internal use only). |
Value
SQM object containing only the requested bins.
See Also
Examples
data(Hadza)
# Which are the most complete bins?
topBinNames = rownames(Hadza$bins$table)[order(Hadza$bins$table[,"Completeness"],
decreasing=TRUE)][1:2]
# Subset with the most complete bin.
topBin = subsetBins(Hadza, topBinNames[1])
Select contigs
Description
Create a SQM object containing only the requested contigs, the ORFs contained in them and the bins that contain them.
Usage
subsetContigs(
SQM,
contigs,
trusted_functions_only = FALSE,
ignore_unclassified_functions = FALSE,
rescale_tpm = FALSE,
rescale_copy_number = FALSE,
recalculate_bin_stats = TRUE,
allow_empty = FALSE
)
Arguments
SQM |
SQM object to be subsetted. |
contigs |
character. Vector of contigs to be selected. |
trusted_functions_only |
logical. If |
ignore_unclassified_functions |
logical. If |
rescale_tpm |
logical. If |
rescale_copy_number |
logical. If |
recalculate_bin_stats |
logical. If |
allow_empty |
(internal use only). |
Value
SQM object containing only the selected contigs.
See Also
Examples
data(Hadza)
# Which contigs have a GC content below 40?
lowGCcontigNames = rownames(Hadza$contigs$table[Hadza$contigs$table[,"GC perc"]<40,])
lowGCcontigs = subsetContigs(Hadza, lowGCcontigNames)
hist(lowGCcontigs$contigs$table[,"GC perc"])
Filter results by function
Description
Create a SQM or SQMbunch object containing only the ORFs with a given function, and the contigs and bins that contain them.
Usage
subsetFun(
SQM,
fun,
columns = NULL,
ignore_case = TRUE,
fixed = FALSE,
trusted_functions_only = FALSE,
ignore_unclassified_functions = FALSE,
rescale_tpm = FALSE,
rescale_copy_number = FALSE,
recalculate_bin_stats = FALSE,
allow_empty = FALSE
)
Arguments
SQM |
SQM or SQMbunch object to be subsetted. |
fun |
character. Pattern to search for in the different functional classifications. |
columns |
character. Restrict the search to the provided column names from |
ignore_case |
logical Make pattern matching case-insensitive (default |
fixed |
logical. If |
trusted_functions_only |
logical. If |
ignore_unclassified_functions |
logical. If |
rescale_tpm |
logical. If |
rescale_copy_number |
logical. If |
recalculate_bin_stats |
logical. If |
allow_empty |
(internal use only). |
Value
SQM or SQMbunch object containing only the requested function.
See Also
subsetTax
, subsetORFs
, subsetSamples
, combineSQM
. The most abundant items of a particular table contained in a SQM object can be selected with mostAbundant
.
Examples
data(Hadza)
Hadza.iron = subsetFun(Hadza, "iron")
Hadza.carb = subsetFun(Hadza, "Carbohydrate metabolism")
# Search for multiple patterns using regular expressions
Hadza.twoKOs = subsetFun(Hadza, "K00812|K00813", fixed=FALSE)
Select ORFs
Description
Create a SQM object containing only the requested ORFs, and the contigs and bins that contain them. Internally, all the other subset functions in this package end up calling subsetORFs
to do the work for them.
Usage
subsetORFs(
SQM,
orfs,
tax_source = "orfs",
trusted_functions_only = FALSE,
ignore_unclassified_functions = FALSE,
rescale_tpm = FALSE,
rescale_copy_number = FALSE,
recalculate_bin_stats = TRUE,
contigs_override = NULL,
allow_empty = FALSE
)
Arguments
SQM |
SQM object to be subsetted. |
orfs |
character. Vector of ORFs to be selected. |
tax_source |
character. Features used for calculating aggregated abundances at the different taxonomic ranks. Either |
trusted_functions_only |
logical. If |
ignore_unclassified_functions |
logical. If |
rescale_tpm |
logical. If |
rescale_copy_number |
logical. If |
recalculate_bin_stats |
logical. If |
contigs_override |
character. Optional vector of contigs to be included in the subsetted object. |
allow_empty |
(internal use only). |
Value
SQM object containing the requested ORFs.
A note on contig/bins subsetting
While this function selects the contigs and bins that contain the desired orfs, it DOES NOT recalculate contig abundance and statistics based on the selected ORFs only. This means that the abundances presented in tables such as SQM$contig$abund
will still refer to the complete contigs, regardless of whether only a fraction of their ORFs are actually present in the returned SQM object. This is also true for the statistics presented in SQM$contigs$table
. Bin statistics may be recalculated if rescale_copy_number
is set to TRUE
, but recalculation will be based on contigs, not ORFs.
Examples
data(Hadza)
# Select the 100 most abundant ORFs in our dataset.
mostAbundantORFnames = names(sort(rowSums(Hadza$orfs$tpm), decreasing=TRUE))[1:100]
mostAbundantORFs = subsetORFs(Hadza, mostAbundantORFnames)
Select random ORFs
Description
Create a random subset of a SQM object.
Usage
subsetRand(SQM, N)
Arguments
SQM |
SQM object to be subsetted. |
N |
numeric. number of random ORFs to select. |
Value
SQM object containing a random subset of ORFs.
See Also
Filter results by sample
Description
Create a SQM object containing only samples specified by the user, and the ORFs, contigs, bins, taxa and functions present in those samples.
Usage
subsetSamples(SQM, samples, remove_missing = TRUE)
Arguments
SQM |
SQM object to be subsetted. |
samples |
character. Samples to be included in the subset. |
remove_missing |
bool. If |
Value
SQM object containing only the requested samples.
See Also
subsetTax
, subsetFun
, subsetORFs
, combineSQM
. The most abundant items of a particular table contained in a SQM object can be selected with mostAbundant
.
Filter results by taxonomy
Description
Create a SQM or SQMbunch object containing only the contigs with a given consensus taxonomy, the ORFs contained in them and the bins that contain them.
Usage
subsetTax(
SQM,
rank,
tax,
trusted_functions_only = FALSE,
ignore_unclassified_functions = FALSE,
rescale_tpm = TRUE,
rescale_copy_number = TRUE,
recalculate_bin_stats = FALSE,
allow_empty = FALSE
)
Arguments
SQM |
SQM object to be subsetted. |
rank |
character. The taxonomic rank from which to select the desired taxa ( |
tax |
character. A taxon or vector of taxa to be selected. |
trusted_functions_only |
logical. If |
ignore_unclassified_functions |
logical. If |
rescale_tpm |
logical. If |
rescale_copy_number |
logical. If |
recalculate_bin_stats |
logical. If |
allow_empty |
(internal use only). |
Value
SQM or SQMbunch object containing only the requested taxon.
See Also
subsetFun
, subsetContigs
, subsetSamples
, combineSQM
. The most abundant items of a particular table contained in a SQM object can be selected with mostAbundant
.
Examples
data(Hadza)
Hadza.Prevotella = subsetTax(Hadza, "genus", "Prevotella")
Hadza.Bacteroidota = subsetTax(Hadza, "phylum", "Bacteroidota")
summary method for class SQM
Description
Computes different statistics of the data contained in the SQM object.
Usage
## S3 method for class 'SQM'
summary(object, ...)
Arguments
object |
SQM object to be summarized. |
... |
Additional parameters (ignored). |
Value
A list of summary statistics.
summary method for class SQMbunch
Description
Computes different statistics of the data contained in the SQMbunch object.
Usage
## S3 method for class 'SQMbunch'
summary(object, ...)
Arguments
object |
SQMbunch object to be summarized. |
... |
Additional parameters (ignored). |
Value
A list of summary statistics.
summary method for class SQMlite
Description
Computes different statistics of the data contained in the SQMlite object.
Usage
## S3 method for class 'SQMlite'
summary(object, ...)
Arguments
object |
SQMlite object to be summarized. |
... |
Additional parameters (ignored). |
Value
A list of summary statistics.