Type: | Package |
Title: | Comprehensive Analysis of 'GENCODE' Annotations and Splice Site Motifs |
Version: | 1.0.2 |
Maintainer: | Monah Abou Alezz <aboualezz.monah@hsr.it> |
Description: | A comprehensive suite of helper functions designed to facilitate the analysis of genomic annotations from the 'GENCODE' database https://www.gencodegenes.org/, supporting both human and mouse genomes. This toolkit enables users to extract, filter, and analyze a wide range of annotation features including genes, transcripts, exons, and introns across different 'GENCODE' releases. It provides functionality for cross-version comparisons, allowing researchers to systematically track annotation updates, structural changes, and feature-level differences between releases. In addition, the package can generate high-quality FASTA files containing donor and acceptor splice site motifs, which are formatted for direct input into the 'MaxEntScan' tool (Yeo and Burge, 2004 <doi:10.1089/1066527041410418>), enabling accurate calculation of splice site strength scores. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
Language: | en-US |
RoxygenNote: | 7.3.2 |
URL: | https://github.com/monahton/GencoDymo2, https://monahton.github.io/GencoDymo2/ |
BugReports: | https://github.com/monahton/GencoDymo2/issues |
Imports: | Biostrings (≥ 2.72.1), BSgenome (≥ 1.66.3), data.table (≥ 1.17.0), dplyr (≥ 1.1.4), GenomicRanges (≥ 1.56.2), methods (≥ 4.4.0), plotrix (≥ 3.8.4), progress (≥ 1.2.3), RCurl (≥ 1.98.1.17), rtracklayer (≥ 1.64.0), tidyr (≥ 1.3.1), IRanges (≥ 2.38.1), tools (≥ 4.4.0), utils (≥ 4.4.0) |
Suggests: | devtools (≥ 2.4.5), BSgenome.Hsapiens.UCSC.hg38 (≥ 1.4.5), knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2025-07-14 11:58:19 UTC; monahton |
Author: | Monah Abou Alezz |
Repository: | CRAN |
Date/Publication: | 2025-07-14 12:40:02 UTC |
Assign intron donor and acceptor splice sites consensus
Description
This function takes a data frame of intron coordinates and a genome sequence (ideally human or mouse) and returns a data frame with two additional columns for the donor and acceptor splice site consensus sequences. It prepares the donor and acceptor sequences based on the provided intron coordinates and the specified genome (e.g., human hg38), making it useful for downstream analysis of splicing events.
Usage
assign_splice_sites(input, genome, verbose = TRUE)
Arguments
input |
A data frame containing intron coordinates. |
genome |
A |
verbose |
Logical. If TRUE, the function prints progress messages while preparing the splice site data. Default is TRUE. |
Value
A data frame containing the original intron data, with two additional columns:
-
donor_ss
: The donor splice site consensus sequence for each intron. -
acceptor_ss
: The acceptor splice site consensus sequence for each intron.
See Also
extract_introns
, find_cryptic_splice_sites
Examples
## Not run:
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) {
genome <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
introns_df <- extract_introns(gtf_v1)
result <- assign_splice_sites(introns_df, genome)
}
## End(Not run)
Calculate GC Content of Genomic Features
Description
Computes the GC content percentage for exons or introns using genome sequence data from a BSgenome object.
Usage
calculate_gc_content(input, genome, verbose = TRUE)
Arguments
input |
A data frame containing genomic features (exons/introns) with |
genome |
A BSgenome object representing the reference genome (e.g., |
verbose |
A logical indicating whether to print progress messages. Defaults to |
Value
The input data frame with an additional gc_content
column containing GC percentages for each feature.
Examples
## Not run:
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) {
genome <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
gtf_with_gc <- calculate_gc_content(gtf_v1, genome = genome, verbose = FALSE)
}
## End(Not run)
Classify Exons by Their Relative Transcript Position
Description
Categorizes each exon in a transcript as single, first, inner, or last based on its position. This classification helps in analyzing transcript structure and splicing patterns.
Usage
classify_exons(input, verbose = TRUE)
Arguments
input |
A character string specifying the path to a GTF/GFF3 file or a data frame containing GTF-formatted data. The file should follow the GENCODE format conventions. |
verbose |
A logical indicating whether to display progress messages and exon counts. Defaults to |
Details
The function processes the input GTF data to:
Load the GTF file (if a path is provided) or use the provided data frame.
Filter entries to include only exons.
For each transcript, count the total exons and determine each exon's position.
Classify exons based on their position and the total count.
If verbose = TRUE
, the function prints counts of each exon type.
Value
A data frame identical to the input but with an additional column EXON_CLASSIFICATION
. This column contains one of the following values for each exon:
-
"single_exon"
: The transcript contains only one exon. -
"first_exon"
: The first exon in a multi-exon transcript. -
"last_exon"
: The last exon in a multi-exon transcript. -
"inner_exon"
: Exons between the first and last in multi-exon transcripts.
Examples
# Example 1: Using the provided example GTF files
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- classify_exons(file_v1)
# Example 2: Using a pre-loaded data frame
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
classified_data_v1 <- classify_exons(gtf_v1)
Compare Annotation Counts Between Two GENCODE Releases
Description
Compares the number of annotated genomic elements (genes, transcripts, exons, introns) between two specified GENCODE releases. The function accepts input data as either file paths (in GTF/GFF format) or pre-loaded data frames. It computes the absolute difference (delta), the percentage change relative to a chosen baseline, and determines the direction of change (increase or decrease).
Usage
compare_release(input1, input2, type, gene_type = NULL, baseline = "count2")
Arguments
input1 |
A character string specifying the file path to a GTF/GFF file from the first GENCODE release, or a data frame containing the annotation data. |
input2 |
A character string specifying the file path to a GTF/GFF file from the second GENCODE release, or a data frame containing the annotation data. |
type |
A character string indicating the type of genomic element to compare. Valid options are |
gene_type |
An optional character string specifying a particular gene biotype to filter comparisons (e.g., |
baseline |
A character string defining the baseline for calculating percentage change. Options include:
|
Details
This function processes two GENCODE releases to compare annotation counts for a specified genomic element type. Key steps include:
Input Handling: If inputs are file paths, they are loaded into data frames using the
load_file
function. Data frames are used directly.Element Filtering: If
gene_type
is specified, annotations are filtered to include only that gene biotype.Count Calculation: The number of elements (genes, transcripts, etc.) of the specified type is counted in each release.
Delta and Percentage: The absolute difference (delta) and percentage change are calculated based on the chosen baseline.
Direction Determination: The direction of change is determined by comparing counts between the two releases.
The function provides both numerical results and a formatted console output highlighting key metrics.
Value
A list with the following elements:
-
delta
: The absolute difference in the number of annotations. -
percentage
: The percentage change relative to the selected baseline. -
direction
: A string indicating the direction of the change ("increase", "decrease", or "no change").
See Also
Examples
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
file_v2 <- system.file("extdata", "gencode.v2.example.gtf.gz", package = "GencoDymo2")
# Example 1: Using data frames with the provided example GTF files
gtf_v1 <- load_file(file_v1)
gtf_v2 <- load_file(file_v2)
comparison <- compare_release(gtf_v1, gtf_v2, type = "gene")
Convert Data Frame to FASTA File
Description
Converts a data frame containing sequence IDs and sequences into a FASTA-formatted file, optionally compressed as gzip.
Usage
df_to_fasta(df, id_col, seq_col, output_file = NULL, gzip = TRUE, verbose = TRUE)
Arguments
df |
A data frame with at least two columns: one for sequence IDs and one for sequences. |
id_col |
A character string specifying the column name containing sequence IDs. |
seq_col |
A character string specifying the column name containing sequence data. |
output_file |
A character string specifying the output file path. If NULL, the function will stop with an informative message. |
gzip |
A logical indicating whether to compress the output as a gzip file. Defaults to |
verbose |
A logical indicating whether to print progress messages. Defaults to |
Details
This function efficiently writes large sequence datasets to FASTA format, handling compression and progress reporting. It validates input columns and manages memory by processing data in chunks.
Value
No return value. Writes a FASTA file to the specified path.
Examples
temp_dir <- tempdir()
temp_output <- file.path(temp_dir, "output.fa.gz")
seq_data <- data.frame(
transcript_id = c("ENST0001", "ENST0002"),
sequence = c("ATGCTAGCTAG", "GCTAGCTAGCT")
)
df_to_fasta(seq_data, "transcript_id", "sequence", temp_output)
Eliminate Redundant Genomic Elements
Description
Removes redundant genomic elements (exons or introns) from a data frame, ensuring each element is uniquely represented. Redundancy is determined by genomic coordinates and gene ID.
Usage
eliminate_redundant_elements(input, element_type = "exon")
Arguments
input |
A data frame containing genomic element coordinates (exons or introns) with columns |
element_type |
The type of genomic element to process. Valid options are |
Details
This function uses genomic coordinates (chromosome, start, end) and gene ID to identify and remove duplicate entries. For exons, coordinates are directly compared. For introns, coordinates are derived from intron_start
and intron_end
columns (check extract_introns
function for more details)
Value
A data frame with redundant elements removed, retaining only unique entries based on coordinates and gene ID.
Examples
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
# Eliminate redundant exons
nonredundant_exons <- eliminate_redundant_elements(gtf_v1, element_type = "exon")
Extract Coding Sequences (CDS) from GTF Annotations
Description
Extracts CDS regions from a GTF annotation file or data frame using genomic coordinates and retrieves corresponding DNA sequences from a BSgenome reference.
Usage
extract_cds_sequences(input, genome, save_fasta, output_file, verbose)
Arguments
input |
A character string (GTF file path) or GRanges object containing CDS annotations. |
genome |
A BSgenome object for the relevant genome (e.g., BSgenome.Hsapiens.UCSC.hg38). |
save_fasta |
Logical indicating whether to save sequences to a FASTA file. |
output_file |
Character string specifying the FASTA output path. |
verbose |
Logical indicating whether to print progress messages. |
Value
A data frame containing CDS annotations with corresponding sequences. If save_fasta = TRUE
, also writes a FASTA file.
Examples
## Not run:
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) {
suppressPackageStartupMessages(library(GenomicRanges))
genome <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1) # Should return GRanges
cds_seqs <- extract_cds_sequences(gtf_v1, genome, save_fasta = FALSE)
}
## End(Not run)
Extract Genomic Elements by Strand
Description
Filters a data frame or GTF/GFF3 file to extract specific genomic elements (genes, transcripts, exons, or introns) located on a specified DNA strand (+ or -).
Usage
extract_element_by_strand(input, type, strand, verbose = TRUE)
Arguments
input |
A data frame derived from a GTF/GFF3 file or containing genomic annotations. |
type |
A character string specifying the type of element to extract. Valid options: |
strand |
A character string indicating the DNA strand to filter. Valid options: |
verbose |
A logical indicating whether to print the count of extracted elements. Defaults to |
Details
This function filters the input data based on the type
and strand
columns. It is useful for strand-specific analyses, such as studying antisense transcripts or strand-biased genomic features.
Value
A data frame containing only the elements of the specified type located on the chosen strand.
Examples
# Example 1: Extract genes on the forward strand using gencode.v1
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
forward_genes_v1 <- extract_element_by_strand(gtf_v1, type = "gene", strand = "+")
# Example 2: Extract exons on the reverse strand using gencode.v1
reverse_exons_v1 <- extract_element_by_strand(gtf_v1, type = "exon", strand = "-")
Extract Intron Coordinates from GENCODE Annotations
Description
Processes a GTF file or data frame to extract intron coordinates, including their genomic positions, transcript associations, and metadata. The function handles both positive and negative strands, ensuring correct orientation of intron boundaries. It is designed to work with GENCODE-formatted annotations.
Usage
extract_introns(input, verbose = TRUE)
Arguments
input |
A character string specifying the file path to a GTF/GFF3 file, or a data frame containing GTF data. The input must include columns: |
verbose |
Logical. If |
Details
Input Handling: Loads the GTF file if a path is provided; uses the data frame directly if supplied.
Exon Filtering: Extracts exon records and classifies them (e.g., first/last exon) if needed.
Multi-Exon Transcripts: Removes single-exon transcripts to focus on spliced transcripts.
Intron Calculation: Determines intron coordinates by identifying gaps between consecutive exons, adjusting for strand orientation.
Output: Returns a data frame with intron coordinates and metadata, sorted by gene and position.
Value
A data frame with the following columns:
-
seqnames
: Chromosome or scaffold name. -
intron_start
,intron_end
: Genomic start/end positions of the intron. -
width
: Length of the intron (intron_end - intron_start + 1). -
gene_id
,transcript_id
,intron_number
: Gene/transcript identifiers and intron position within the transcript. Additional metadata columns from the original GTF (e.g.,
gene_name
).
See Also
Examples
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
# From a pre-loaded data frame
gtf_v1 <- load_file(file_v1)
introns <- extract_introns(gtf_v1)
Identify Single-Exon Genes/Transcripts in GENCODE Data
Description
This function identifies single-exon genes or transcripts from a genomic annotation dataset, typically obtained from the GENCODE database. It processes input data, either from a file or an already loaded data frame, and returns either single-exon genes or single-exon transcripts based on the user's selection. The function works seamlessly with GTF/GFF files or data frames that include the standard GENCODE annotation fields. This is especially useful for identifying genes or transcripts that do not undergo splicing.
Usage
extract_single_exon(input, level = "gene", output_file = NULL)
Arguments
input |
Either a file path to a GTF/GFF3 file or a data frame representing genomic annotations. The input data should contain at least the columns |
level |
A character string specifying the level of analysis. It should be either |
output_file |
Optional. A file path to save the results as a tab-separated file. If not provided, the results will not be saved. The file will include the IDs of the single-exon genes or transcripts along with their respective exon IDs. This feature is useful for exporting results for further analysis or reporting purposes. |
Details
Input Validation: Checks for required columns and valid
level
specification.Exon Counting: Groups exons by gene/transcript and counts occurrences.
Single-Exon Filtering: Retains only entities with exactly one exon.
Output: Returns filtered results; optionally writes to file.
Value
A data frame containing the IDs of single-exon genes or transcripts based on the selected level. The returned data frame contains the following columns:
-
gene_id
ortranscript_id
: The IDs of the single-exon genes or transcripts. -
exon_count
: The count of exons for each entity (always 1 for single-exon entities). -
exon_id
: The ID of the exon associated with the single-exon gene or transcript.
The data frame can then be used for downstream analysis, such as identifying unique single-exon genes or transcripts, or for exporting to a file.
See Also
Examples
# Example input data frame
input <- data.frame(
type = c("exon", "exon", "exon", "exon", "exon"),
gene_id = c("gene1", "gene1", "gene2", "gene3", "gene4"),
transcript_id = c("tx1", "tx1", "tx2", "tx3", "tx4"),
exon_number = c(1, 2, 1, 1, 1),
exon_id = c("e1", "e2", "e1", "e1", "e1")
)
# Identify single-exon genes
single_exon_genes <- extract_single_exon(input, level = "gene")
print(single_exon_genes)
# Identify single-exon transcripts
single_exon_transcripts <- extract_single_exon(input, level = "transcript")
print(single_exon_transcripts)
Extract Splice Site Motifs for MaxEntScan Analysis (5' or 3')
Description
Extracts donor (5') or acceptor (3') splice site motifs from a genomic dataset using BSgenome sequences.
Usage
extract_ss_motif(input, genome, type, verbose, save_fasta, output_file)
Arguments
input |
A data frame with columns: |
genome |
A BSgenome object (e.g., from BSgenome.Hsapiens.UCSC.hg38). |
type |
One of |
verbose |
Logical; print progress messages. |
save_fasta |
Logical; write a FASTA file of extracted motifs. |
output_file |
Name/path of the output FASTA file. |
Value
A data frame including extracted splice site motifs.
See Also
assign_splice_sites
, df_to_fasta
Examples
## Not run:
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) {
genome <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
introns <- extract_introns(gtf_v1)
motifs_df <- extract_ss_motif(introns, genome, "5ss", verbose = FALSE)
}
## End(Not run)
Identify Potential Cryptic Splice Sites.
Description
This function identifies potential cryptic splice sites by comparing sequence motifs in introns to canonical splice site motifs (donor and acceptor). Cryptic splice sites are those that do not match the canonical donor (GT) or acceptor motifs (AG). It compares the identified splice sites with the provided canonical motifs and flags the sites that differ from the canonical patterns, making it useful for studying aberrant splicing events.
Usage
find_cryptic_splice_sites(input, genome, canonical_donor, canonical_acceptor, verbose)
Arguments
input |
A data frame containing intron coordinates, ideally generated
by |
genome |
A BSgenome object representing the genome sequence. This is used to extract the sequence for each intron to identify splice sites. |
canonical_donor |
A character vector of canonical donor splice site motifs.
Default is |
canonical_acceptor |
A character vector of canonical acceptor splice site motifs.
Default is |
verbose |
Logical; if |
Details
This function performs the following steps:
It assigns donor and acceptor splice sites to each intron using the
assign_splice_sites
function.It compares the identified donor and acceptor splice sites against the provided canonical motifs (
GT
for donor andAG
for acceptor by default). If the splice site sequences do not match the canonical motifs, they are flagged as cryptic.The function returns a data frame with the same intron information, including additional columns
cryptic_donor
andcryptic_acceptor
indicating whether the splice sites are cryptic.The progress of the function is printed if the
verbose
argument is set toTRUE
, showing also the total number of cryptic donor and acceptor sites and their respective percentages.
Value
The input data frame with two logical columns:
-
cryptic_donor
:TRUE
if donor site is non-canonical. -
cryptic_acceptor
:TRUE
if acceptor site is non-canonical.
See Also
assign_splice_sites
, extract_ss_motif
Examples
## Not run:
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)) {
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
introns_df <- extract_introns(gtf_v1)
introns_ss <- assign_splice_sites(introns_df, genome = BSgenome.Hsapiens.UCSC.hg38)
cryptic_sites <- find_cryptic_splice_sites(introns_ss, BSgenome.Hsapiens.UCSC.hg38)
head(cryptic_sites)
}
## End(Not run)
Download GFF3 File from the GENCODE Database
Description
Downloads a GFF3 file for a specified species, release version, and annotation type from the GENCODE database. The file is saved to a user-specified directory or the current working directory by default.
Usage
get_gff3(species, release_version, annotation_type, dest_folder)
Arguments
species |
A character string indicating the species. Supported values are:
|
release_version |
A character string specifying the release version. Options include:
|
annotation_type |
A character string specifying the annotation type. Supported values include:
|
dest_folder |
A character string specifying the destination folder. Defaults to the current working directory. |
Details
The function dynamically determines the correct file URL based on the provided parameters and downloads the GFF3 file to the desired location.
If "latest_release" is specified for release_version
, the function will first determine the latest available release using get_latest_release()
.
Value
A character string specifying the full path of the downloaded GFF3 file.
Examples
## Not run:
# Download the latest human GTF file with primary assembly annotations into a temp directory
temp_dir <- tempdir()
gff3_file <- get_gff3(
species = "human",
release_version = "latest_release",
annotation_type = "primary_assembly.basic.annotation.gff3.gz",
dest_folder = temp_dir
)
print(gff3_file)
# Download a specific mouse release with long noncoding RNA annotations into a temp directory
temp_dir <- tempdir()
gff3_file <- get_gff3(
species = "mouse",
release_version = "release_M36",
annotation_type = "long_noncoding_RNAs.gff3.gz",
dest_folder = temp_dir
)
print(gff3_file)
## End(Not run)
Download GTF File from the GENCODE Database
Description
Downloads a GTF file for a specified species, release version, and annotation type from the GENCODE database. The file is saved to a user-specified directory or the current working directory by default.
Usage
get_gtf(species, release_version, annotation_type, dest_folder)
Arguments
species |
A character string indicating the species. Supported values are:
|
release_version |
A character string specifying the release version. Options include:
|
annotation_type |
A character string specifying the annotation type. Valid options are:
|
dest_folder |
A character string specifying the destination folder where the file will be downloaded. Defaults to the current working directory. |
Details
The function dynamically determines the correct file URL based on the provided parameters and downloads the GTF file to the desired location.
If "latest_release" is specified for release_version
, the function will first determine the latest available release using get_latest_release()
.
Value
A character string specifying the path to the downloaded GTF file.
Examples
## Not run:
# Download the latest human GTF file with primary assembly annotations into a temp directory
temp_dir <- tempdir()
gtf_file <- get_gtf(
species = "human",
release_version = "latest_release",
annotation_type = "primary_assembly.basic.annotation.gtf.gz",
dest_folder = temp_dir
)
print(gtf_file)
# Download a specific mouse release with long noncoding RNA annotations into a temp directory
temp_dir <- tempdir()
gtf_file <- get_gtf(
species = "mouse",
release_version = "release_M36",
annotation_type = "long_noncoding_RNAs.gtf.gz",
dest_folder = temp_dir
)
print(gtf_file)
## End(Not run)
Get the Latest Gencode Release Dynamically
Description
This function retrieves the latest available release of the Gencode database for a given species (human or mouse) by querying the relevant FTP directory. It automates the process of identifying the latest release of annotations for human or mouse.
Usage
get_latest_release(species, verbose = TRUE)
Arguments
species |
A character string indicating the species. Supported values are:
|
verbose |
Logical. If TRUE (default), the function prints the latest release. |
Details
The function accesses the GENCODE FTP directory and parses the available folders to determine the latest release. It is particularly useful for bioinformatics workflows that require up-to-date annotation files without manual checks.
Value
A character string representing the latest release version for the specified species (e.g., "release_42" for human or "release_36" for mouse).
Examples
## Not run:
# Retrieve the latest release version for human
human_release <- get_latest_release(species = "human", verbose = FALSE)
cat("Latest human GENCODE release: release_47")
# Get the latest release for mouse
mouse_release <- get_latest_release("mouse", verbose = TRUE)
## End(Not run)
Load a GTF or GFF3 file from GENCODE as a data frame.
Description
This function imports a GTF or GFF3 file (commonly from the GENCODE website) and converts it into a data frame. The function provides flexibility for users to work with genomic feature files easily in the R environment.
Usage
load_file(filename)
Arguments
filename |
A character string representing the path to the GTF or GFF3 file (e.g., "gencode.vM36.annotation.gtf.gz"). The file could be in GTF or GFF3 format and must be downloaded from a reliable source like GENCODE. |
Details
The function uses the rtracklayer
package to import the GTF or GFF3 file and returns it as a data frame.
The user should ensure that the input file is properly formatted and accessible from the specified path.
Files larger than a few hundred MBs may take longer to load and process.
Value
A data frame containing the parsed content of the GTF or GFF3 file. The data frame includes standard columns such as 'seqnames', 'start', 'end', 'strand', 'feature', and 'gene_id', among others.
Examples
# Load example GTF files from the package
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
head(gtf_v1)
Calculate Spliced Transcript Lengths
Description
Computes the spliced length of transcripts by summing the lengths of their constituent exons. This represents the mature RNA length after intron removal.
Usage
spliced_trans_length(input)
Arguments
input |
A data frame containing GENCODE annotations in GTF format, including |
Details
This function processes the input data to:
Filter entries to include only exons.
Group exons by transcript ID.
Sum the widths of all exons per transcript.
The result provides the total length of the mature spliced RNA for each transcript.
Value
A data frame with two columns: transcript_id
and transcript_length
, where the latter is the sum of exon widths for each transcript.
Examples
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
spliced_lengths <- spliced_trans_length(gtf_v1)
Generate Summary Statistics for Genomic Elements
Description
Calculates descriptive summary statistics (mean, median, standard deviation, etc.) for the lengths of exons or introns, grouped by classification (e.g., first exon, inner intron).
Usage
stat_summary(input, type, verbose = TRUE)
Arguments
input |
A data frame containing classified exons or introns. For exons, must include |
type |
A character string specifying the element type. Valid options: |
verbose |
A logical indicating whether to print progress messages. Defaults to |
Details
For exons, statistics are grouped by EXON_CLASSIFICATION
. For introns, groups include first_intron
, inner_intron
, and splice site types (e.g., gc_intron
).
Value
A data frame with summary statistics for each element group, including mean, median, standard deviation, standard error, quartiles, and sample size.
Examples
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
# Exon statistics
exon_stats <- stat_summary(gtf_v1, type = "exon")
Tiny example GTF files
Description
These are minimal GTF files used for examples and testing within the package.
-
gencode.v1.example.gtf.gz contains two genes:
GeneA: a single-exon, unspliced gene.
GeneB: a spliced gene with two transcripts and multiple exons.
-
gencode.v2.example.gtf.gz contains the same two genes as in
gencode.v1.example.gtf.gz
plus:GeneC: a new spliced gene with multiple transcripts and many exons.
These files are stored in the inst/extdata/
directory and can be accessed using
system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
or
system.file("extdata", "gencode.v2.example.gtf.gz", package = "GencoDymo2")
.
Format
Two external GTF files.
Source
Generated manually for testing purposes.
See Also
Examples
tiny_v1_path <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
tiny_v2_path <- system.file("extdata", "gencode.v2.example.gtf.gz", package = "GencoDymo2")
gtf1 <- load_file(tiny_v1_path)
head(gtf1)
gtf2 <- load_file(tiny_v2_path)
head(gtf2)