Type: | Package |
Title: | Parse and Test Robots Exclusion Protocol Files and Rules |
Version: | 0.2.5 |
Date: | 2023-02-07 |
Author: | Bob Rudis (bob@rud.is) [aut, cre], SEOmoz, Inc [aut] |
Maintainer: | Bob Rudis <bob@rud.is> |
Description: | The 'Robots Exclusion Protocol' https://www.robotstxt.org/orig.html documents a set of standards for allowing or excluding robot/spider crawling of different areas of site content. Tools are provided which wrap The 'rep-cpp' https://github.com/seomoz/rep-cpp C++ library for processing these 'robots.txt' files. |
NeedsCompilation: | yes |
URL: | https://github.com/hrbrmstr/spiderbar |
BugReports: | https://github.com/hrbrmstr/spiderbar/issues |
License: | MIT + file LICENSE |
Suggests: | covr, robotstxt, tinytest |
Depends: | R (≥ 3.2.0) |
Encoding: | UTF-8 |
Imports: | Rcpp |
RoxygenNote: | 7.2.3 |
LinkingTo: | Rcpp |
Packaged: | 2023-02-09 16:08:55 UTC; hrbrmstr |
Repository: | CRAN |
Date/Publication: | 2023-02-11 10:20:02 UTC |
Test URL paths against a robxp
robots.txt
object
Description
Provide a character vector of URL paths plus optional user agent and this function will return a logical vector indicating whether you have permission to fetch the content at the respective path.
Usage
can_fetch(obj, path = "/", user_agent = "*")
Arguments
obj |
|
path |
path to test |
user_agent |
user agent to test |
Value
logical vector indicating whether you have permission to fetch the content
Examples
gh <- paste0(readLines(system.file("extdata", "github-robots.txt",
package="spiderbar")), collapse="\n")
gh_rt <- robxp(gh)
can_fetch(gh_rt, "/humans.txt", "*") # TRUE
can_fetch(gh_rt, "/login", "*") # FALSE
can_fetch(gh_rt, "/oembed", "CCBot") # FALSE
can_fetch(gh_rt, c("/humans.txt", "/login", "/oembed"))
Retrieve all agent crawl delay values in a robxp
robots.txt
object
Description
Retrieve all agent crawl delay values in a robxp
robots.txt
object
Usage
crawl_delays(obj)
Arguments
obj |
|
Value
data frame of agents and their crawl delays
Note
-1
will be returned for any listed agent without a crawl delay setting
Examples
gh <- paste0(readLines(system.file("extdata", "github-robots.txt",
package="spiderbar")), collapse="\n")
gh_rt <- robxp(gh)
crawl_delays(gh_rt)
imdb <- paste0(readLines(system.file("extdata", "imdb-robots.txt",
package="spiderbar")), collapse="\n")
imdb_rt <- robxp(imdb)
crawl_delays(imdb_rt)
Custom printer for 'robxp“ objects
Description
Custom printer for 'robxp“ objects
Usage
## S3 method for class 'robxp'
print(x, ...)
Arguments
x |
object to print |
... |
unused |
Parse a 'robots.txt' file & create a 'robxp' object
Description
This function takes in a single element character vector and parses it into a 'robxp' object.
Usage
robxp(x)
Arguments
x |
either an atomic character vector containing a complete 'robots.txt“ file _or_ a length >1 character vector that will concatenated into a single string _or_ a 'connection' object that will be passed to [readLines()], the result of which will be concatenated into a single string and parsed and the connection will be closed. |
Value
a classed object holding an external pointer to parsed robots.txt data
Examples
imdb <- paste0(readLines(system.file("extdata", "imdb-robots.txt",
package="spiderbar")), collapse="\n")
rt <- robxp(imdb)
Retrieve a character vector of sitemaps from a parsed robots.txt object
Description
Retrieve a character vector of sitemaps from a parsed robots.txt object
Usage
sitemaps(xp)
Arguments
xp |
A |
Value
charcter vector of all sitemaps found in the parsed robots.txt
file
Examples
imdb <- paste0(readLines(system.file("extdata", "imdb-robots.txt",
package="rep")), collapse="\n")
rt <- robxp(imdb)
sitemaps(rt)
Parse and Test Robots Exclusion Protocol Files and Rules
Description
The 'Robots Exclusion Protocol' (https://www.robotstxt.org/orig.html) documents a set
of standards for allowing or excluding robot/spider crawling of different areas of
site content. Tools are provided which wrap The rep-cpp
https://github.com/seomoz/rep-cpp
C++ library for processing these 'robots.txt“ files.
Author(s)
Bob Rudis (bob@rud.is)