Type: | Package |
Title: | Geographic and Taxonomic Occurrence R-Based Scrubbing |
Version: | 1.0.2 |
Date: | 2024-05-16 |
Imports: | ridigbio, dplyr (≥ 1.1.0), rgbif, magrittr, CoordinateCleaner (≥ 3.0.1), terra, spThin, stringr, leaflet, parsedate, spatstat.geom |
Encoding: | UTF-8 |
VignetteBuilder: | knitr |
LazyData: | true |
RoxygenNote: | 7.2.3 |
Suggests: | knitr, rmarkdown, testthat |
License: | GPL-3 |
Description: | Streamlines downloading and cleaning biodiversity data from Integrated Digitized Biocollections (iDigBio) and the Global Biodiversity Information Facility (GBIF). |
Maintainer: | Natalie N. Patten <natalienpatten@gmail.com> |
URL: | https://nataliepatten.github.io/gatoRs/, https://github.com/nataliepatten/gatoRs |
BugReports: | https://github.com/nataliepatten/gatoRs/issues |
Depends: | R (≥ 3.5.0) |
NeedsCompilation: | no |
Packaged: | 2024-05-17 00:34:15 UTC; Natalie Patten |
Author: | Natalie N. Patten |
Repository: | CRAN |
Date/Publication: | 2024-05-17 04:40:03 UTC |
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Arguments
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Value
The result of calling rhs(lhs)
.
Locality Cleaning - Remove missing and improbable coordinates
Description
The basic_locality_clean()
function cleans locality by removing missing or impossible coordinates and correcting precision.
This function requires columns named 'latitude' and 'longitude'. These columns should be of type 'numeric'.
Usage
basic_locality_clean(
df,
latitude = "latitude",
longitude = "longitude",
remove.zero = TRUE,
precision = TRUE,
digits = 2,
remove.skewed = TRUE,
info.withheld = "informationWithheld"
)
Arguments
df |
Data frame of occurrence records returned from |
latitude |
Default = "latitude". The name of the latitude column in the data frame. |
longitude |
Default = "longitude". The name of the longitude column in the data frame. |
remove.zero |
Default = TRUE. Indicates that points at (0.00, 0.00) should be removed. |
precision |
Default = TRUE. Indicates that coordinates should be rounded to match the coordinate uncertainty. |
digits |
Default = 2. Indicates digits to round coordinates to when |
remove.skewed |
Default = TRUE. Utilizes the |
info.withheld |
Default = "informationWithheld". The name of the information withheld column in the data frame. |
Details
This function removes any records with missing coordinates, impossible coordinates,
coordinates at (0,0), and any that are flagged as skewed.
These skewed records are identified with the remove_skewed()
function which identifies rows where the ‘InformationWitheld’ column
includes the string "Coordinate uncertainty increased".
We also provide the option to round the provided latitude and longitude values to a specified number of decimal places.
This function requires no additional packages.
Value
Return data frame with specimen removed that had missing or improper coordinate values.
Information about the columns in the returned data frame can be found in the documentation for gators_download()
.
Examples
cleaned_data <- basic_locality_clean(data)
Basis Cleaning - Removes records with certain record basis
Description
The basis_clean()
function removes records based on basisOfRecord column.
Usage
basis_clean(df, basis.list = NA, basis.of.record = "basisOfRecord")
Arguments
df |
Data frame of occurrence records returned from |
basis.list |
A list of basis to keep. If a list is not supplied, the filter will be interactive and users must respond to the function. |
basis.of.record |
Default = "basisOfRecord". The name of the basis of record column in the data frame. |
Details
With the interactive method, the function will print all unique basisOfRecord values in the current data set and then ask the user to respond in the console to prompts regarding which records, if any, should be removed based on their basisOfRecord. This function requires no additional packages.
Value
Returns a data frame with records of desired record basis.
Information about the columns in the returned data frame can be found in the documentation for gators_download()
.
Examples
cleaned_data <- basis_clean(data, basis.list = c("Preserved Specimen","Physical specimen"))
Cite Data - Get GBIF citations
Description
The citation_bellow
function retrieves and returns the citation information
for records from GBIF, or where aggregator = "GBIF".
Usage
citation_bellow(df, id = "ID", aggregator = "aggregator")
Arguments
df |
Data frame of occurrence records returned from |
id |
Default = "ID". The name of the id column in the data frame, which contains unique IDs defined from GBIF (keys) or iDigBio (UUID). |
aggregator |
Default = "aggregator". The name of the column in the data frame that identifies the aggregator that provided the record. This is equal to iDigBio or GBIF. |
Details
This function requires the rgbif package.
Value
Returns a list with citation information for the GBIF data downloaded.
Examples
citations <- citation_bellow(data)
gatoRs Download - Correct classes of data frame columns
Description
The correct_class()
function corrects the classes of each column in a data frame of your queried species.
This function requires no additional packages.
Usage
correct_class(
df,
scientific.name = "scientificName",
genus = "genus",
species = "specificEpithet",
infraspecific.epithet = "infraspecificEpithet",
id = "ID",
occ.id = "occurrenceID",
basis.of.record = "basisOfRecord",
event.date = "eventDate",
year = "year",
month = "month",
day = "day",
inst.code = "institutionCode",
recorded.by = "recordedBy",
country = "country",
county = "county",
state = "stateProvince",
locality = "locality",
latitude = "latitude",
longitude = "longitude",
coord.uncertainty = "coordinateUncertaintyInMeters",
info.withheld = "informationWithheld",
habitat = "habitat",
aggregator = "aggregator"
)
Arguments
df |
Data frame returned by |
scientific.name |
Default = "scientificName". The name of the scientificName column in the data frame. |
genus |
Default = "genus". The name of the genus column in the data frame. |
species |
Default = "specificEpithet". The name of the specificEpithet column in the data frame. |
infraspecific.epithet |
Default = "infraspecificEpithet". The name of the infraspecificEpithet column in the data frame. |
id |
Default = "ID". The name of the id column in the data frame, which contains unique IDs defined from GBIF (keys) or iDigBio (UUID). |
occ.id |
Default = "occurrenceID". The name of the occurrenceID column in the data frame. |
basis.of.record |
Default = "basisOfRecord". The name of the basis of record column in the data frame. |
event.date |
Default = "eventDate". The name of the event date column in the data frame. |
year |
Default = "year". The name of the year column in the data frame. |
month |
Default = "month". The name of the month column in the data frame. |
day |
Default = "day". The name of the day column in the data frame. |
inst.code |
Default = "institutionCode". The name of the institution code column in the data frame. |
recorded.by |
Default = "recordedBy". The recordedBy column in the data frame. |
country |
Default = "country". The name of the country column in the data frame. |
county |
Default = "county". The name of the county column in the data frame. |
state |
Default = "stateProvince". The name of the state/province column in the data frame. |
locality |
Default = "locality". The name of the locality column in the data frame. |
latitude |
Default = "latitude". The name of the latitude column in the data frame. |
longitude |
Default = "longitude". The name of the longitude column in the data frame. |
coord.uncertainty |
Default = "coordinateUncertaintyInMeters". The name of the coordinate uncertainty column in the data frame. |
info.withheld |
Default = "informationWithheld". The name of the information withheld column in the data frame. |
habitat |
Default = "habitat". The name of the habitat column in the data frame. |
aggregator |
Default = "aggregator". The name of the column in the data frame that identifies the aggregator that provided the record. This is equal to iDigBio or GBIF. |
Details
"aggregator", "infraspecific.epithet", and "recorded.by" columns will be skipped if they do not exist. This function requires the dplyr package.
Value
Returns data frame with corrected classes of each column.
Downloaded data from gators_download() for Galax urceolata with default settings and 'limit' set to 5: data <- gators_download(synonyms.list = c("Galax urceolata", "Galax aphylla"), limit = 5)
Description
Downloaded data from gators_download() for Galax urceolata with default settings and 'limit' set to 5: data <- gators_download(synonyms.list = c("Galax urceolata", "Galax aphylla"), limit = 5)
Usage
data
Format
A data frame with 17 rows and 23 columns
Subset Data - Get species, longitude, and latitude columns
Description
The data_chomp()
function "chomps" (subsets) a data frame of occurrence records to only
contain the following columns: "species", "longitude", and "latitude". After using this
function data will be ready for use in Maxent, for example.
Usage
data_chomp(
df,
accepted.name = NA,
longitude = "longitude",
latitude = "latitude"
)
Arguments
df |
Data frame of occurrence records returned from |
accepted.name |
The accepted species name for the records. |
longitude |
Default = "longitude". The name of the longitude column in the data frame. |
latitude |
Default = "latitude". The name of the latitude column in the data frame. |
Details
This function requires the package dplyr.
Value
Returns data frame with three columns: "species", "longitude", and "latitude". The "species" column is set by the accepted.name input. This data frame is ready for downstream applications such as Maxent.
Examples
chomped_data <- data_chomp(data, accepted.name = "Galax urceolata")
Used in gators_download() - Filter iDigBio results by scientific name
Description
The filter_fix_names()
function filters a data frame for relevant results, based on the scientific name given.
Some downloaded results from iDigBio might contain occurrences of other species that have "notes" or "locality" strings that mentioning
the desired species. Here we only retain those where the scientificName column is found to be a fuzzy match to a value in the user-provided list containing the scientific name and applicable synonym
Hence, this function looks for relevant results that are actually occurrences of the
desired species.
Usage
filter_fix_names(
df,
synonyms.list,
filter = "fuzzy",
scientific.name = "scientificName",
accepted.name = NA
)
Arguments
df |
Data frame with name column to be fixed. |
synonyms.list |
A list of synonyms for a species. |
filter |
Default = "fuzzy". Indicates the type of filter to be used–either "exact" or "fuzzy". |
scientific.name |
Default = "scientificName". The name of the scientificName column in the data frame. |
accepted.name |
The accepted scientific name for the species. If provided, an additional column will be added to the data frame with the accepted name for further manual comparison. |
Details
This function requires no additional packages.
Value
Returns data frame with filtered results.
Examples
cleaned_data <- filter_fix_names(data, c("Galax urceolata", "Galax aphylla"), filter = "exact")
cleaned_data <- filter_fix_names(data, c("Galax urceolata", "Galax aphylla"),
accepted.name = "Galax urceolata")
Fix taxonomic capitalization of a species name when there are periods involved.
Description
The fixAfterPeriod()
function fixes taxonomic capitalization of a string when there are periods in the species name.
Usage
fixAfterPeriod(substring)
Arguments
substring |
A substring from the name column of the data frame to be fixed. |
Details
Requires package stringr.
Value
Returns the substring with fixed capitalization.
Used in gators_download() - Fill out taxonomic name columns
Description
The fix_columns()
function fills out the taxonomic name columns based on available information in the data set.
For example, if a row has a name provided for the scientificName column, this information will be used
to generate the respective genus, specificEpithet, and infraspecificEpithet columns for that row.
Usage
fix_columns(
df,
scientific.name = "scientificName",
genus = "genus",
species = "specificEpithet",
infraspecific.epithet = "infraspecificEpithet"
)
Arguments
df |
Data frame of occurrence records. |
scientific.name |
Default = "scientificName". The name of the scientificName column in the data frame. |
genus |
Default = "genus". The name of the genus column in the data frame. |
species |
Default = "specificEpithet". The name of the specificEpithet column in the data frame. |
infraspecific.epithet |
Default = "infraspecificEpithet". The name of the infraspecificEpithet column in the data frame. |
Details
This function requires package stringr.
Value
Returns the original data frame with the specified columns.
Examples
fixed_data <- fix_columns(data)
Used in gators_download() - Fix taxonomic name capitalization
Description
The fix_names()
function fixes the capitalization of species names in the data frame provided to align
with accepted capitalization standards.
Usage
fix_names(df, scientific.name = "scientificName")
Arguments
df |
Data frame with name column to be fixed. |
scientific.name |
Default = "scientificName". The name of the scientificName column in the data frame. |
Details
This function uses the fixAfterPeriod()
function. This function requires package stringr.
Value
Returns df with fixed capitalization in name column.
Examples
fixed_data <- fix_names(data)
Full Cleaning - Wrapper function to speed clean
Description
The full_clean()
function performs automated cleaning steps, including options for: removing
duplicate data points, checking locality precision, removing points with skewed coordinates,
removing plain zero records, removing records based on basis of record, and spatially thinning collection points.
This function also provides the option to interactively inspect and remove types of basis of record.
Usage
full_clean(
df,
synonyms.list,
event.date = "eventDate",
year = "year",
month = "month",
day = "day",
occ.id = "occurrenceID",
remove.NA.occ.id = FALSE,
remove.NA.date = FALSE,
aggregator = "aggregator",
id = "ID",
taxa.filter = "fuzzy",
scientific.name = "scientificName",
accepted.name = NA,
remove.zero = TRUE,
precision = TRUE,
digits = 2,
remove.skewed = TRUE,
basis.list = NA,
basis.of.record = "basisOfRecord",
latitude = "latitude",
longitude = "longitude",
remove.flagged = TRUE,
thin.points = TRUE,
distance = 5,
reps = 100,
one.point.per.pixel = TRUE,
raster = NA,
resolution = 0.5,
remove.duplicates = TRUE
)
Arguments
df |
Data frame of occurrence records. |
synonyms.list |
A list of synonyms for a species. |
event.date |
Default = "eventDate". The name of the event date column in the data frame. |
year |
Default = "year". The name of the year column in the data frame. |
month |
Default = "month". The name of the month column in the data frame. |
day |
Default = "day". The name of the day column in the data frame. |
occ.id |
Default = "occurrenceID". The name of the occurrenceID column in the data frame. |
remove.NA.occ.id |
Default = FALSE. This will remove records with missing occurrence IDs when set to |
remove.NA.date |
Default = FALSE. This will remove records with missing event dates when set to |
aggregator |
Default = "aggregator". The name of the column in the data frame that identifies the aggregator that provided the record. This is equal to iDigBio or GBIF. |
id |
Default = "ID". The name of the id column in the data frame, which contains unique IDs defined from GBIF (keys) or iDigBio (UUID). |
taxa.filter |
The type of filter to be used–either "exact", "fuzzy", or "interactive". |
scientific.name |
Default = "scientificName". The name of the scientificName column in the data frame. |
accepted.name |
The accepted scientific name for the species. If provided, an additional column will be added to the data frame with the accepted name for further manual comparison. |
remove.zero |
Default = TRUE. Indicates that points at (0.00, 0.00) should be removed. |
precision |
Default = TRUE. Indicates that coordinates should be rounded to match the coordinate uncertainty. |
digits |
Default = 2. Indicates digits to round coordinates to when |
remove.skewed |
Default = TRUE. Utilizes the |
basis.list |
A list of basis to keep. If a list is not supplied, this filter will not occur. |
basis.of.record |
Default = "basisOfRecord". The name of the basis of record column in the data frame. |
latitude |
Default = "latitude". The name of the latitude column in the data frame. |
longitude |
Default = "longitude". The name of the longitude column in the data frame. |
remove.flagged |
Default = TRUE. An option to remove points with problematic locality information. |
thin.points |
Default = TRUE. An option to spatially thin occurrence records. |
distance |
Default = 5. Distance in km to separate records. |
reps |
Default = 100. Number of times to perform thinning algorithm. |
one.point.per.pixel |
Default = TRUE. An option to only retain one point per pixel. |
raster |
Raster object which will be used for ecological niche comparisons. A SpatRaster should be used. |
resolution |
Default = 0.5. Options - 0.5, 2.5, 5, and 10 (in min of a degree). 0.5 min of a degree is equal to 30 arc sec. |
remove.duplicates |
Default = TRUE. An option to remove duplicate points. |
Details
This function is entirely automated and thus does not take advantage of the interactive options provided in the individual cleaning functions. Using this wrapper is recommended for data processing that does not require interactive/manual cleaning and inspection. All cleaning steps, except taxonomic harmonization, can be bypassed by setting their associated input variables to FALSE. This function requires packages dplyr, magrittr, and raster.
Value
df is a data frame with the cleaned data.
Information about the columns in the returned data frame can be found in the documentation for gators_download()
. An additional column named "accepted_name" will be returned if an accepted.name was provided.
Examples
cleaned_data <- full_clean(data, synonyms.list = c("Galax urceolata", "Galax aphylla"),
digits = 3, basis.list = c("Preserved Specimen","Physical specimen"),
accepted.name = "Galax urceolata", remove.flagged = FALSE)
Download - Download specimen data from both iDigBio and GBIF
Description
The gators_download()
function downloads data from GBIF and iDigBio for your desired species.
Usage
gators_download(
synonyms.list,
write.file = FALSE,
filename = NA,
gbif.match = "fuzzy",
gbif.prov = FALSE,
idigbio.filter = TRUE,
limit = 1e+05
)
Arguments
synonyms.list |
A list of scientific names including the accepted scientific name and any synonyms for your desired species. For example, |
write.file |
A parameter to choose whether to produce a .csv file containing search results. This parameter is not required and is assigned FALSE by default. |
filename |
The path and file name for the retrieved data. Note that this parameter should include the ".csv"
extension as well. For example, |
gbif.match |
A parameter to select either search by fuzzy matching of scientific name or to search by species code.
For example, |
gbif.prov |
A parameter to obtain the provider/verbatim columns from GBIF. This parameter is optional
and is assigned |
idigbio.filter |
A parameter to remove less relevant search results from iDigBio. Based on the search input, results may
include data points for a different species that mention the desired species in the locality information, for example.
Choosing |
limit |
Default = 100,000 (maximum). Set limit to the number of records requested for each element in synonyms.list from each aggregator. |
Details
With gators_download()
you can obtain biodiversity records for your species of interest from both GBIF and iDigBio.
This function is innovative in how it searches iDigBio. Unlike spocc::occ()
, we do not query the iDigBio API using
the scientific name field, as this will only return exact matches.
Instead, we designed a “pseudo-fuzzy match” to search all fields for
partial matches to the supplied scientific names.
This function uses the get_idigbio()
, get_gbif()
, fix_columns()
, fix_names()
, and filter_fix_names()
functions.
This function requires packages magrittr, rgbif, dplyr, ridigbio, and stringr.
Value
Returns a data frame and writes a csv file as specified in the input. This csv file will contain search results for the desired species from the GBIF and iDigBio databases. The columns are as follows:
ID (contains unique IDs defined from GBIF or iDigBio)
aggregator (either GBIF or iDigBio)
Examples
df <- gators_download(synonyms.list = c("Galax urceolata", "Galax aphylla"), limit = 10)
df <- gators_download(synonyms.list = "Galax urceolata", gbif.match = "code",
idigbio.filter = FALSE, limit = 10)
Merge Retained Data - Combined original data set with georeferenced or retained records.
Description
The gators_merge()
function combines two data sets with identical column names and returns a single data set.
Usage
gators_merge(df1, df2)
Arguments
df1 |
A data frame downloaded with |
df2 |
A data frame with the same columns as df1, but with observations generated through georeferencing or through data requests. |
Details
Prior to combining a data set with georeferenced or retrieved data, please use the remove_missing()
function to limit duplicate records. This function requires no additional packages.
Value
A combined data set.
Examples
removed_missing <- remove_missing(data)
needs_geo <- need_to_georeference(data)
# fill in manually georeferenced data into needs_geo...
merged_data <- gators_merge(removed_missing, needs_geo)
needs_data <- needed_records(data)
# fill in missing information with a data request...
merged_data <- gators_merge(merged_data, needs_data)
Used in gators_download() - Download data from the Global Biodiversity Information Facility
Description
The get_gbif()
function queries the Global Biodiversity Information Facility (GBIF) for your desired species.
Limited to 100,000 record downloads.
Usage
get_gbif(synonyms.list, gbif.match = "fuzzy", gbif.prov = FALSE, limit = 1e+05)
Arguments
synonyms.list |
A list of affiliated names for your query. |
gbif.match |
Default = "fuzzy". Either "fuzzy" for fuzzy matching of name or "code" to search by species code. |
gbif.prov |
Default = FALSE. A parameter to obtain the provider/verbatim columns from GBIF. |
limit |
Default = 100,000 (maximum). Set limit to the number of records requested for each element in synonyms.list. |
Details
This function uses the correct_class()
function.
This function requires the packages rgbif, magrittr, and dplyr.
Value
Returns a data frame with desired columns from GBIF.
Examples
df <- get_gbif(c("Galax urceolata", "Galax aphylla"), limit = 5)
df <- get_gbif(c("Galax urceolata", "Galax aphylla"),
gbif.match = "code", limit = 5)
df <- get_gbif(c("Galax urceolata", "Galax aphylla"), gbif.prov = TRUE, limit = 5)
Used in gators_download() - Download data from Integrated Digitized Biocollections
Description
The get_idigbio()
function queries iDigBio for your desired species.
Limited to 100,000 record downloads.
Usage
get_idigbio(synonyms.list, limit = 1e+05)
Arguments
synonyms.list |
A list of affiliated names for your query. |
limit |
Default = 100,000 (maximum). Set limit to the number of records requested for each element in synonyms.list. |
Details
This function uses the correct_class()
function.
This function requires the packages ridigbio, magrittr, and dplyr.
Value
A data frame with desired columns from iDigBio.
Examples
df <- get_idigbio(c("Galax urceolata", "Galax aphylla"), limit = 100)
matchColClasses
Description
This function allows you to match the column classes of two data frames. This function requires package base. This function does not require user input. Source: https://stackoverflow.com/questions/49215193/r-error-cant-join-on-because-of-incompatible-types
Usage
matchColClasses(df1, df2)
Arguments
df1 |
A data frame |
df2 |
A data frame |
Value
df2 with corrected column classes
Identify Missing Information - Find records which lack coordinate information
Description
The need_to_georeference()
function allows you to find records that are missing coordinates
but contain locality information. These records can then be manually georeferenced.
Usage
need_to_georeference(
df,
longitude = "longitude",
latitude = "latitude",
locality = "locality"
)
Arguments
df |
A data frame downloaded with |
longitude |
Default = "longitude". The name of the longitude column in the data frame. |
latitude |
Default = "latitude". The name of the latitude column in the data frame. |
locality |
Default = "locality". The name of the locality column in the data frame. |
Details
This function requires no additional packages.
Value
Returns a data frame of the points that need to be georeferenced.
Information about the columns in the returned data frame can be found in the documentation for gators_download()
.
Examples
need_coords <- need_to_georeference(data)
Identify Missing Information - Find records with redacted or missing data
Description
The needed_records()
function identifies records with flags, therefore withheld. This indicates that information
is withheld from these records due to endangered species status, for example. Accessing this information may
require a permit. Or, these records can be removed from the data set with remove_redacted()
.
Usage
needed_records(df, info.withheld = "informationWithheld")
Arguments
df |
A data frame downloaded with |
info.withheld |
Default = "informationWithheld". The name of the information withheld column in the data frame. |
Details
This function requires no additional packages.
Value
A data frame with only records for which locality was flagged as redacted or missing.
Information about the columns in the returned data frame can be found in the documentation for gators_download()
.
Examples
need_info <- needed_records(data)
Spatial Correction - One point per pixel
Description
The one_point_per_pixel
function retains only one point per raster pixel. This function is useful
when creating present-absent models.
Usage
one_point_per_pixel(
df,
raster = NA,
resolution = 0.5,
precision = TRUE,
digits = 2,
longitude = "longitude",
latitude = "latitude"
)
Arguments
df |
Data frame of occurrence records. |
raster |
Raster object which will be used for ecological niche comparisons. A SpatRaster should be used. |
resolution |
Default = 0.5. Options - 0.5, 2.5, 5, and 10 (in min of a degree). 0.5 min of a degree is equal to 30 arc sec. |
precision |
Default = TRUE. Indicates that coordinates should be rounded to match the coordinate uncertainty. |
digits |
Default = 2. Indicates digits to round coordinates to when |
longitude |
Default = "longitude". The name of the longitude column in the data frame. |
latitude |
Default = "latitude". The name of the latitude column in the data frame. |
Details
This function requires package raster and spatstat.geom.
Value
df is a data frame with only one point per pixel.
Information about the columns in the returned data frame can be found in the documentation for gators_download()
.
Examples
ready_data <- one_point_per_pixel(data)
Locality Cleaning - Find possibly problematic occurrence records
Description
The process_flagged()
function allows you to visualize and inspect possible problematic points, as well as
manually remove these points, if desired. By default, this function is interactive. When running the function interactively you can
hover over a point to see the record's scientific name, and click on a point to see the record's coordinates.The interactive option plots flagged points in red and non-flagged points in blue.
Usage
process_flagged(
df,
interactive = TRUE,
latitude = "latitude",
longitude = "longitude",
scientific.name = "scientificName"
)
Arguments
df |
Data frame of occurrence records returned from |
interactive |
Default = TRUE. The interactive option allows for a visual display
of possible problematic points and the ability to manually remove these points.
Setting |
latitude |
Default = "latitude". The name of the latitude column in the data frame. |
longitude |
Default = "longitude". The name of the longitude column in the data frame. |
scientific.name |
Default = "scientificName". The name of the scientificName column in the data frame. |
Details
This function is a wrapper to visualize results for the CoordinateCleaner::clean_coordinates()
function.
Briefly, CoordinateCleaner::clean_coordinates()
flags records with coordinates that are unlikely valid,
spatial outliers, or in certain locations including the ocean, state capitals, country centroids,
the GBIF headquarters, and biodiversity institutions (including botanical gardens, museums, herbaria, etc.).
This function requires packages CoordinateCleaner, leaflet, and magrittr.
This function requires interactive user input.
Value
Return cleaned data frame.
Information about the columns in the returned data frame can be found in the documentation for gators_download()
.
Examples
cleaned_data <- process_flagged(data, interactive = FALSE)
Remove Duplicates - Remove records with identical event dates and coordinates
Description
The remove_duplicates()
function removes records with identical event dates and occurrence IDs.
Prior to utilizing this function, longitude and latitude columns should be rounded to match the
coordinate uncertainty using the basic_locality_clean()
function.
Usage
remove_duplicates(
df,
event.date = "eventDate",
aggregator = "aggregator",
id = "ID",
occ.id = "occurrenceID",
year = "year",
month = "month",
day = "day",
latitude = "latitude",
longitude = "longitude",
remove.NA.occ.id = FALSE,
remove.NA.date = FALSE,
remove.unparseable = FALSE
)
Arguments
df |
Data frame of occurrence records returned from |
event.date |
Default = "eventDate". The name of the event date column in the data frame. |
aggregator |
Default = "aggregator". The name of the column in the data frame that identifies the aggregator that provided the record. This is equal to iDigBio or GBIF. |
id |
Default = "ID". The name of the id column in the data frame, which contains unique IDs defined from GBIF (keys) or iDigBio (UUID). |
occ.id |
Default = "occurrenceID". The name of the occurrenceID column in the data frame. |
year |
Default = "year". The name of the year column in the data frame. |
month |
Default = "month". The name of the month column in the data frame. |
day |
Default = "day". The name of the day column in the data frame. |
latitude |
Default = "latitude". The name of the latitude column in the data frame. |
longitude |
Default = "longitude". The name of the longitude column in the data frame. |
remove.NA.occ.id |
Default = FALSE. This will remove records with missing occurrence IDs when set to |
remove.NA.date |
Default = FALSE. This will remove records with missing event dates when set to |
remove.unparseable |
Default = FALSE. If we cannot parse the event date into individual year, month, day categories the user can manually specify. Otherwise, if set to TRUE, these rows will simply be removed. |
Details
Here we identify and remove both (1) specimen duplicates and (2) aggregator duplicates based on each specimens coordinates,
occurrenceID, and eventDate. To leverage all date information
available, set remove.unparseable = FALSE
to manually populate
the year, month, and day columnsl. Dates are parsed based on ISO 8601 which only includes time since the Unix epoch, or January 1st, 1970, therefore dates that occur
before 1970 will not be automatically parsed. If we are unable to parse the included date for particular records,
users can choose to manually enter the year, month, and day for these records when prompted.
If the user chooses to manually enter the event date, the records eventDate will be printed
and the user will be asked to manually enter the year, month, and day of this eventDate into the console.
Users are only prompted to manually parse event dates for records where year, month, and day are absent,
but eventDate is present and cannot be parsed.
This function also we also confirm all ID (UUID and key) are unique to remove any within-aggregator duplicates that may accumulate due to processing errors.
This function requires the parsedate and dplyr packages. Warning, this function will ignore missing occurrence ID
and year, month, day columns if not provided in the data set.
Value
Return data frame with duplicates removed.
Information about the columns in the returned data frame can be found in the documentation for gators_download()
.
Examples
cleaned_data <- remove_duplicates(data)
cleaned_data <- remove_duplicates(data, remove.NA.occ.id = TRUE, remove.NA.date = TRUE)
cleaned_data <- remove_duplicates(data, remove.unparseable = TRUE)
Remove Missing Information - Prepare to merge a data frame with georeferenced and retrieved records
Description
The remove_missing()
function identifies and removes records identified with the need_to_georeference()
and needed_records()
functions. This function should be utilized prior to merging georeferenced or retrieved records.
Usage
remove_missing(
df,
remove.type = "both",
info.withheld = "informationWithheld",
longitude = "longitude",
latitude = "latitude",
locality = "locality",
id = "ID"
)
Arguments
df |
A data frame downloaded with |
remove.type |
Default equal to "both" indicating records identified with the |
info.withheld |
Default = "informationWithheld". The name of the information withheld column in the data frame. |
longitude |
Default = "longitude". The name of the longitude column in the data frame. |
latitude |
Default = "latitude". The name of the latitude column in the data frame. |
locality |
Default = "locality". The name of the locality column in the data frame. |
id |
Default = "ID". The name of the id column in the data frame, which contains unique IDs defined from GBIF (keys) or iDigBio (UUID). |
Details
This function requires no additional packages.
Value
A data frame with records containing missing information removed.
Information about the columns in the returned data frame can be found in the documentation for gators_download()
.
Examples
cleaned_data <- remove_missing(data)
Remove Redacted Information - Remove protected or private records prior to publication
Description
The remove_redacted()
function identifies and removes records where 'aggregator' is not equal to iDigBio or GBIF.
Usage
remove_redacted(df, aggregator = "aggregator")
Arguments
df |
A data frame downloaded with |
aggregator |
Default = "aggregator". The name of the column in the data frame that identifies the aggregator that provided the record. This is equal to iDigBio or GBIF. |
Details
This function requires no additional packages.
Value
A data frame with redacted records removed.
Information about the columns in the returned data frame can be found in the documentation for gators_download()
.
Examples
cleaned_data <- remove_redacted(data)
Used in basic_locality_clean() - Removed skewed locality
Description
The remove_skewed()
function identifies and removes records where locality has been skewed. Records are considered skewed if informationWithheld contains the string "Coordinate uncertainty increased".
Usage
remove_skewed(df, info.withheld = "informationWithheld")
Arguments
df |
A data frame downloaded with |
info.withheld |
Default = "informationWithheld". The name of the information withheld column in the data frame. |
Details
This function requires no additional packages.
Value
A data frame with records remove only records for which locality was skewed.
Examples
cleaned_data <- remove_skewed(data)
Suppress print statements and messages
Description
The suppress_output()
function suppresses print statements and messages from a function. This
does not suppress warnings or errors.
Usage
suppress_output(input)
Arguments
input |
Details
This function requires no additional packages.
Value
output Output is the output of the passed function call.
Taxonomic Cleaning - Filter and resolve taxon names
Description
The taxa_clean()
function filters a data frame for relevant results, based on the scientific name given.
Filtering can be done with scripts by exact or fuzzy match. Or, for a more controlled approach, this function
provides interactive filtering by providing the user with prompts. The interactive method allows the user
to manually determine whether they wish to keep results containing certain scientific names.
Usage
taxa_clean(
df,
synonyms.list,
taxa.filter = "fuzzy",
scientific.name = "scientificName",
accepted.name = NA
)
Arguments
df |
Data frame of occurrence records returned from |
synonyms.list |
A list of synonyms for a species. |
taxa.filter |
The type of filter to be used–either "exact", "fuzzy", or "interactive". |
scientific.name |
Default = "scientificName". The name of the scientificName column in the data frame. |
accepted.name |
The accepted scientific name for the species. If provided, an additional column will be added to the data frame with the accepted name for further manual comparison. |
Details
If users select the interactive approach, the function will first print all unique scientific names in the current data set and then ask the user to respond in the console to prompts regarding which records, if any, should be removed based on their scientific name. After filtering, based on a user-provided taxonomy, an accepted name column can be defined with an optional argument. This function relies on the user-provided taxonomy, we do not utilize any taxonomic backbone. Additionally, this function requires no additional packages.
Value
Returns data frame with filtered results and new column with the accepted name labeled as "accepted_name".
Information about the columns in the returned data frame can be found in the documentation for gators_download()
. An additional column named "accepted_name" will be returned if an accepted.name was provided.
Examples
cleaned_data <- taxa_clean(data, c("Galax urceolata", "Galax aphylla"), taxa.filter = "exact")
cleaned_data <- taxa_clean(data, c("Galax urceolata", "Galax aphylla"),
accepted.name = "Galax urceolata")
Spatial Correction - Spatially thin records
Description
The thin_points
function returns records based on coordinate thinning based on a minimum nearest neighbor distance approach.
Usage
thin_points(
df,
accepted.name = NA,
distance = 5,
reps = 100,
latitude = "latitude",
longitude = "longitude"
)
Arguments
df |
Data frame of occurrence records. |
accepted.name |
Accepted name of your species. This argument is not required if the data frame already contains an accepted_name column. |
distance |
Default = 5. Distance in km to separate records. |
reps |
Default = 100. Number of times to perform thinning algorithm. |
latitude |
Default = "latitude". The name of the latitude column in the data frame. |
longitude |
Default = "longitude". The name of the longitude column in the data frame. |
Details
This function is a wrapper for spatial thinning using the spThin package (Aiello-Lammens et al., 2015) In summary, the thinning algorithm provided by spThin calculates the pairwise distances between data points, then randomly samples a single point from all points less than or equal to the set minimum nearest neighbor distance. This process is repeated until the pairwise distances among points do not fall below the minimum nearest neighbor distance.
Value
df is a data frame with the cleaned data.
Information about the columns in the returned data frame can be found in the documentation for gators_download()
.
Examples
thinned_data <- thin_points(data, accepted.name = "Galax urceolata")