| Title: | Toolkit for the 'Entrez' API |
| Version: | 0.1.0 |
| Description: | Interact with the 'Entrez' API hosted by the National Center for Biotechnology Information (NCBI), https://www.ncbi.nlm.nih.gov/books/NBK25499/. This package is focused on working with sequence metadata and links. It handles pagination and compensates for some API limitations to simplify these tasks. API calls are printed to the console to highlight how high-level queries are translated into individual HTTP requests. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Collate: | 'count.R' 'process.R' 'fetch.R' 'id_set.R' 'id_set_validate.R' 'info.R' 'jentre-package.R' 'link.R' 'post.R' 'request.R' 'search.R' 'utils.R' |
| Depends: | R (≥ 4.1.0) |
| Imports: | cli, glue, httr2, purrr, rlang (≥ 1.1.0), vctrs (≥ 0.7.0), xml2 |
| Suggests: | httpuv, testthat (≥ 3.0.0), tibble, withr |
| Config/testthat/edition: | 3 |
| URL: | https://github.com/cidm-ph/jentre, https://cidm-ph.github.io/jentre/ |
| BugReports: | https://github.com/cidm-ph/jentre/issues |
| NeedsCompilation: | no |
| Packaged: | 2026-03-25 06:10:45 UTC; csus5157 |
| Author: | Carl Suster |
| Maintainer: | Carl Suster <carl.suster@sydney.edu.au> |
| Repository: | CRAN |
| Date/Publication: | 2026-03-30 08:50:13 UTC |
jentre: Toolkit for the 'Entrez' API
Description
To use this package effectively, you should have some understanding of the design of the Entrez API which is documented at https://www.ncbi.nlm.nih.gov/books/NBK25500/. Helper functions will make it easier to avoid common pitfalls, and to make use of features like pagination, but you'll still need to understand how to structure your requests efficiently to avoid undue load.
Details
Entrez API usage is subject to guidelines that are available at the URL above. Entrez datasets are also subject to copyright. Refer to the NCBI policies at https://www.ncbi.nlm.nih.gov/home/about/policies/ for details.
Author(s)
Maintainer: Carl Suster carl.suster@sydney.edu.au (ORCID) [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/cidm-ph/jentre/issues
Check ID set is well formed
Description
Check ID set is well formed
Usage
check_id_set(
x,
database = NULL,
arg = rlang::caller_arg(x),
call = rlang::caller_env()
)
check_id_list(x, arg = rlang::caller_arg(x), call = rlang::caller_env())
check_web_history(x, arg = rlang::caller_arg(x), call = rlang::caller_env())
entrez_database(x)
Arguments
x |
ID set object. |
database |
name of intended database.
If |
arg |
name of argument to use in error reporting. |
call |
execution environment, for error reporting.
See rlang::topic-error-call and the |
Value
For
check_*, these function raise an error if the check fails.For
entrez_database()the name of the database.
Fetch records from Entrez
Description
Fetching can be slow, and Entrez will time out requests that take too long.
This helper supports pagination if you specify retmax.
Usage
efetch(
id_set,
...,
retstart = 0L,
retmax = NA,
retmode = "xml",
rettype = NULL,
.method = NA,
.cookies = NA,
.paginate = 200L,
.process = NA,
.progress = "Fetching",
.path = NULL,
.call = rlang::current_env()
)
Arguments
id_set |
ID set object. |
... |
additional API parameters (refer to Entrez documentation).
Any set to |
retstart |
integer: index of first result (starts from 0). |
retmax |
integer: maximum number of results to return.
When |
retmode |
character: requested document file format. |
rettype |
character: requested document type. |
.method |
HTTP verb. If |
.cookies |
path to persist cookies.
If |
.paginate |
controls how multiple API requests are used to complete the call.
Pagination is performed using the |
.process |
function that processes the API results.
Can be a function or builtin processor as described in
|
.progress |
controls progress bar; see the |
.path |
path specification for saving raw responses.
See |
.call |
call environment to use in error messages/traces.
See rlang::topic-error-call and the |
Value
Combined output of .process from each page of results.
For the default where .process does nothing, this will be a list of XML documents.
For other choices, it can be a vector, list, or data frame.
See Also
Other API methods:
einfo(),
elink(),
entrez_validate(),
epost(),
esearch(),
esummary()
Examples
library(xml2)
id_set <- id_list("sra", c("39889350", "39889348", "39889347"))
## Not run:
efetch(id_set)
# -> efetch db="sra" retstart="0" retmax="3" retmode="xml" * id="39889350,...,39889347"[3]
# [[1]]
# {xml_document}
# <EXPERIMENT_PACKAGE_SET>
# [1] <EXPERIMENT_PACKAGE>\n <EXPERIMENT accession="SRX29833825" alias="24-MYP-0283_50325"> ...
# [2] <EXPERIMENT_PACKAGE>\n <EXPERIMENT accession="SRX29833823" alias="24-MYP-0273_50325"> ...
# [3] <EXPERIMENT_PACKAGE>\n <EXPERIMENT accession="SRX29833822" alias="24-MYP-0270_50325"> ...
extract_alias <- function(document) {
xml_find_all(document, "//EXPERIMENT/@alias") |> xml_text()
}
efetch(id_set, .process = extract_alias)
# -> efetch db="sra" retstart="0" retmax="3" retmode="xml" * id="39889350,...,39889347"[3]
# [1] "24-MYP-0283_50325" "24-MYP-0273_50325" "24-MYP-0270_50325"
## End(Not run)
Get details about Entrez databases
Description
These functions call the EInfo endpoint. einfo() provides the number
of entries in the databases, the name and description, list of terms
usable in the query syntax, and list of link names usable with the ELink
endpoint.
Usage
einfo(db, ..., retmode = "xml", version = "2.0", .call = rlang::current_env())
einfo_databases(..., retmode = "xml", .call = rlang::current_env())
Arguments
db |
name of database to provide information about. |
... |
additional API parameters (refer to Entrez documentation).
Any set to |
retmode |
response format. |
version |
response format version. |
.call |
call environment to use in error messages/traces.
See rlang::topic-error-call and the |
Value
Character vector of database names for einfo_databases().
An XML document with root node <eInfoResult> for einfo().
See Also
Other API methods:
efetch(),
elink(),
entrez_validate(),
epost(),
esearch(),
esummary()
Examples
library(xml2)
## Not run:
einfo("sra") |> xml_find_first("//Description") |> xml_text()
# [1] "SRA Database"
## End(Not run)
ELink API for fetching links between databases
Description
elink() offers direct access to the ELink API endpoint, which has many different
input and output formats depending on parameters. If you just want a one-to-one
mapping of neighbor links, use elink_map(), which handles this for you.
Usage
elink(
id_set,
db,
...,
retmode = "xml",
cmd = NA,
.paginate = 100L,
.process = NA,
.method = NA,
.multi = "explode",
.progress = TRUE,
.cookies = NA,
.path = NULL,
.call = current_env()
)
elink_map(id_set, db, ..., .cookies = NA, .path = NULL, .call = current_env())
Arguments
id_set |
ID set object. |
db |
target database name. |
... |
additional API parameters (refer to Entrez documentation).
Any set to |
retmode |
response format. |
cmd |
ELink command.
If |
.paginate |
maximum number of UIDs to submit per request.
|
.process |
function that processes the API results.
Can be a function or builtin processor as described in |
.method |
HTTP verb.
For |
.multi |
controls how repeated params are handled (see |
.progress |
controls progress bar; see the |
.cookies |
path to persist cookies.
If |
.path |
path specification for saving raw responses.
See |
.call |
call environment to use in error messages/traces.
See rlang::topic-error-call and the |
Value
concatenated output of .process.
For elink(.process = "sets") a data frame with columns
fromSource link set.
toTarget link set.
linknameLink name (see https://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html or
einfo).
For elink_map() and elink(.process = "flat") a data frame with columns
db_fromSource database name.
id_fromSource identifier. Can be a list column depending on how
elinkwas called.db_toTarget database name.
linknameLink name (see https://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html or
einfo).id_toTarget identifier. In general this will be a list column.
One-to-one mapping
Note that some ways of calling this API on multiple UIDs result in the one-to-one
association of the input and output sets getting lost. The way around this is to
specify each ID as a separate parameter rather than a single comma-separated param.
This is handled by the default choice of .multi = "explode". When using a web
history token as input, there is no corresponding way to ensure one-to-one mapping.
To ensure that the result is always one-to-one, use elink_map(), which may make
several API requests to achieve the result.
See Also
Other API methods:
efetch(),
einfo(),
entrez_validate(),
epost(),
esearch(),
esummary()
Examples
id_set <- id_list("sra", c("39889350", "39889348", "39889347"))
## Not run:
links <- elink(id_set, "bioproject", linkname = "sra_bioproject")
# -> elink db="bioproject" dbfrom="sra" retmode="xml" cmd="neighbor" linkname="sra_bioproject"
# * id="39889350" id="39889348" id="39889347"
links
# # A tibble: 3 x 5
# db_from id_from db_to linkname id_to
# <chr> <list> <chr> <chr> <list>
# 1 sra <chr [1]> bioproject sra_bioproject <chr [1]>
# 2 sra <chr [1]> bioproject sra_bioproject <chr [1]>
# 3 sra <chr [1]> bioproject sra_bioproject <chr [1]>
links[c("id_from", "id_to")] |> igraph::graph_from_data_frame()
# IGRAPH a807b82 DN-- 4 3 --
# + attr: name (v/c)
# + edges from a807b82 (vertex names):
# [1] 39889350->1241475 39889348->1241475 39889347->1241475
## End(Not run)
Count the number of entries in an ID set
Description
If id_set is an id_list then this is equivalent to length().
If it is a web_history, this may involve an Entrez API call to get the
number of entries. In this case the result is cached so that subsequent
calls don't hit the API again.
Usage
entrez_count(id_set, .call = current_env())
Arguments
id_set |
an ID set object. |
.call |
call environment to use in error messages/traces.
See rlang::topic-error-call and the |
Value
integer number of entries.
Examples
id_set <- id_list("sra", c("39889350", "39889348", "39889347"))
entrez_count(id_set)
Construct a request to the Entrez API
Description
This is a low-level helper that builds a request object but does not
perform the request. In general you'll likely use higher-level methods
like efetch() instead.
Usage
entrez_request(
endpoint,
...,
.method = "GET",
.multi = "comma",
.cookies = NULL,
.verbose = getOption("jentre.verbose", default = TRUE),
.call = current_env()
)
entrez_api_key(default = NULL)
Arguments
endpoint |
Entrez endpoint name (e.g. |
... |
additional API parameters (refer to Entrez documentation).
Any set to |
.method |
HTTP verb.
For |
.multi |
controls how repeated params are handled (see |
.cookies |
path to persist cookies.
If |
.verbose |
logical: when TRUE logs all API requests as messages in a compact format.
This uses a summarised format that does not include the request body for POST.
Use normal httr verbosity controls (e.g. |
.call |
call environment to use in error messages/traces.
See rlang::topic-error-call and the |
default |
default value to return if no global configuration is found. |
Details
email, tool, and api_key have default values but these can be
overridden, or can be removed by setting them to NULL.
Value
for
entrez_request()anhttr2::requestobject.for
entrez_api_key(), the API key as a character, ordefaultif no global config exists.
API limits
The Entrez APIs are rate limited.
Requests in this package respect the API headers returned by Entrez.
Without an API key you will be rate limited more aggressively, so it is
recommended to obtain an API key.
jentre searches for the API key in the following order:
the API parameter
entrez_keyprovided to any API request function,the option
"jentre.api_key", thenthe environment variable
ENTREZ_KEY.
You can check the value is found properly using entrez_api_key().
If no API key is set, a warning will be displayed. This can be suppressed
by setting the option "jentre.silence_api_warning" to TRUE.
Examples
library(httr2)
req <- entrez_request("esearch.fcgi", db = "nucleotide", term = "biomol+trna[prop]")
## Not run:
# You'll need to perform the request with httr2 and parse it yourself:
req_perform(req) |> resp_body_xml()
## End(Not run)
Look up accessions and other IDs on Entrez
Description
Passes the provided IDs through Entrez which has the effect of normalising the
accepted UIDs, and removing invalid UIDs.
For web history lists, this forces results to be freshly downloaded
(unlike as_id_list() which can use cached results).
Usage
entrez_validate(id_set, .paginate = 5000L, .path = NULL, .call = current_env())
Arguments
id_set |
an |
.paginate |
controls how multiple API requests are used to complete the call.
Pagination is performed using the |
.path |
path specification for saving raw responses. |
.call |
call environment to use in error messages/traces.
See rlang::topic-error-call and the |
Value
id_list object
See Also
Other API methods:
efetch(),
einfo(),
elink(),
epost(),
esearch(),
esummary()
Examples
id_set <- id_list("sra", c("SRX29833825", "SRX29833823", "SRX29833822"))
## Not run:
entrez_validate(id_set)
# <entrez/sra[3]>
# [1] 39889350 39889348 39889347
## End(Not run)
Register UIDs with the Entrez history server
Description
Register UIDs with the Entrez history server
Usage
epost(id_set, ..., WebEnv = NULL, .path = NULL, .call = rlang::current_env())
Arguments
id_set |
an |
... |
additional API parameters (refer to Entrez documentation).
Any set to |
WebEnv |
either a character to pass on as-is, or a |
.path |
path specification for saving raw responses. |
.call |
call environment to use in error messages/traces.
See rlang::topic-error-call and the |
Value
A web_history object usable with other API functions.
See Also
Other API methods:
efetch(),
einfo(),
elink(),
entrez_validate(),
esearch(),
esummary()
Examples
id_set <- id_list("sra", c("39889350", "39889348", "39889347"))
## Not run:
epost(id_set)
# -> epost db="sra" * id="39889350,...,39889347"[3]
# <entrez@/sra[1]>
# [1] MCID_69c36.#1[3]
## End(Not run)
Search Entrez databases
Description
The search term field names are documented in the EInfo API endpoint:
see einfo().
Usage
esearch(
term,
db,
...,
retstart = 0L,
retmax = NA,
retmode = "xml",
rettype = "uilist",
usehistory = is.null(retmax) || is.na(retmax),
WebEnv = NULL,
query_key = NULL,
.cookies = NA,
.paginate = 10000L,
.progress = "ESearch",
.path = NULL,
.verbose = getOption("jentre.verbose", default = TRUE),
.call = current_env()
)
Arguments
term |
search query. |
db |
Entrez database name. |
... |
additional API parameters (refer to Entrez documentation).
Any set to |
retstart |
integer: index of first result (starts from 0).
Ignored when |
retmax |
integer: maximum number of results to return.
When |
retmode |
character: currently only |
rettype |
character: currently only |
usehistory |
logical: when |
WebEnv, query_key |
either characters to pass on as-is, or |
.cookies |
path to persist cookies.
If |
.paginate |
controls how multiple API requests are used to complete the call.
Pagination is performed using the |
.progress |
controls progress bar; see the |
.path |
path specification for saving raw responses.
See |
.verbose |
logical: when TRUE logs all API requests as messages in a compact format.
This uses a summarised format that does not include the request body for POST.
Use normal httr verbosity controls (e.g. |
.call |
call environment to use in error messages/traces.
See rlang::topic-error-call and the |
Value
An id set object (either a web_history or an id_list).
See Also
https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
Other API methods:
efetch(),
einfo(),
elink(),
entrez_validate(),
epost(),
esummary()
Examples
## Not run:
esearch("mpox virus[orgn]", "biosample")
# -> esearch db="biosample" term="mpox virus[orgn]" retmode="xml" rettype="uilist" usehistory="y"
# i eSearch query "\"Monkeypox virus\"[Organism]" has 7189 results
# <entrez@/biosample[1]>
# [1] MCID_69c36.#1[7189]
## End(Not run)
Fetch document summaries from Entrez
Description
ESummary is faster than EFetch because it only interacts with the frontend rather than the full database. It contains more limited information.
Usage
esummary(
id_set,
...,
retstart = 0L,
retmax = NA,
retmode = "xml",
version = "2.0",
.method = NA,
.cookies = NA,
.paginate = 5000L,
.process = "identity",
.progress = "Fetching summaries",
.path = NULL,
.call = rlang::current_env()
)
Arguments
id_set |
ID set object. |
... |
additional API parameters (refer to Entrez documentation).
Any set to |
retstart |
integer: index of first result (starts from 0). |
retmax |
integer: maximum number of results to return.
When |
retmode |
character: requested document file format. |
version |
character: requested format version. |
.method |
HTTP verb. If |
.cookies |
path to persist cookies.
If |
.paginate |
controls how multiple API requests are used to complete the call.
Pagination is performed using the |
.process |
function that processes the API results.
Can be a function or builtin processor as described in
|
.progress |
controls progress bar; see the |
.path |
path specification for saving raw responses.
See |
.call |
call environment to use in error messages/traces.
See rlang::topic-error-call and the |
Value
Combined output of .process from each page of results.
For the default where .process does nothing, this will be a list of XML documents.
For other choices, it can be a vector, list, or data frame.
See Also
Other API methods:
efetch(),
einfo(),
elink(),
entrez_validate(),
epost(),
esearch()
Entrez identifier sets
Description
Many Entrez APIs accept either a UID list or tokens that point to a result stored on its history server. The classes here wrap these and keep track of the database name that the identifiers belong to. Most of the API helpers in this package are generic over the type of ID set and so can be used the same way with either type.
Usage
id_list(db, ids = character())
web_history(db, WebEnv, query_key, length = NA)
is_id_set(x)
is_id_list(x)
is_web_history(x)
as_id_list(x, .paginate = 5000L, .path = NULL, .call = current_env())
Arguments
db |
name of the associated Entrez database (e.g. |
ids |
UIDs, coercible to a character vector (can be accessions or GI numbers). |
query_key, WebEnv |
history server tokens returned by another Entrez API call. |
length |
number of UIDs in the set, if known. |
x |
object to test or convert. |
.paginate |
controls how multiple API requests are used to complete the call.
Pagination is performed using the |
.path |
path specification for saving raw responses.
See |
.call |
call environment to use in error messages/traces.
See rlang::topic-error-call and the |
Details
It usually will not make sense to create web_history() objects directly - they
are short-lived pointers to results on the Entrez history server and are created
by other API calls.
id_list is a vector and can be manipulated to take subsets (e.g. id_set[1:10] or
tail(id_set)).
web_history is an opaque reference to an ID list stored on the Entrez
history server. Through the course of API calls, information about the length or
the actual list of IDs may be discovered and cached, avoiding subsequent API calls.
as_id_list() can be used to extract the list of IDs.
Convert id_list to web_history with epost().
Convert web_history to id_list with as_id_list().
Value
For
id_list()andas_id_list()anid_listvector.For
web_history()aweb_historyobject.For
is_id_set(),is_id_list(), andis_web_history()a logical.
See Also
entrez_validate() and entrez_count()
Examples
bioprojects <- id_list("bioproject", c("1241475"))
Process API results
Description
Function to turn the parsed response document into meaningful data.
It must accept one argument, doc, the parsed response document.
The return value must be compatible with vctrs::list_combine(),
e.g. a vector, list, or data frame.
Details
API results are parsed based on the retmode parameter. XML documents will
be parsed into xml2::xml_document objects and an error will be raised if
it contains an <ERROR> node.
Builtin processors can be referred to by name instead of specifying your own function. Some helpers provide additional processors, but these are always available:
-
"identity": Puts the parsed output document into a list. Where multiple requests are made (e.g. using the batched APIs likeefetch()) these will then be concatenated into a single list.