Type: | Package |
Title: | Regular Expression Removal, Extraction, and Replacement Tools |
Version: | 0.7.10 |
Maintainer: | Tyler Rinker <tyler.rinker@gmail.com> |
Depends: | R (≥ 3.5.0) |
Imports: | stringi (≥ 0.5-5) |
Suggests: | testthat |
LazyData: | TRUE |
Description: | A collection of regular expression tools associated with the 'qdap' package that may be useful outside of the context of discourse analysis. Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, citations, person tags, phone numbers, times, and zip codes. |
License: | GPL-2 |
URL: | https://github.com/trinker/qdapRegex |
BugReports: | https://github.com/trinker/qdapRegex/issues |
Collate: | 'S.R' 'bind.R' 'bind_or.R' 'c.extracted.R' 'case.R' 'cheat.R' 'utils.R' 'rm_default.R' 'escape.R' 'explain.R' 'grab.R' 'group.R' 'group_or.R' 'is.regex.R' 'pastex.R' 'print.extracted.R' 'print.regexr.R' 'qdapRegex-package.R' 'rm_.R' 'rm_abbreviation.R' 'rm_between.R' 'rm_bracket.R' 'rm_caps.R' 'rm_caps_phrase.R' 'rm_citation.R' 'rm_citation_tex.R' 'rm_city_state.R' 'rm_city_state_zip.R' 'rm_date.R' 'rm_dollar.R' 'rm_email.R' 'rm_emoticon.R' 'rm_endmark.R' 'rm_hash.R' 'rm_nchar_words.R' 'rm_non_ascii.R' 'rm_non_words.R' 'rm_number.R' 'rm_percent.R' 'rm_phone.R' 'rm_postal_code.R' 'rm_repeated_characters.R' 'rm_repeated_phrases.R' 'rm_repeated_words.R' 'rm_tag.R' 'rm_time.R' 'rm_title_name.R' 'rm_url.R' 'rm_white.R' 'rm_zip.R' 'validate.R' |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2025-03-22 16:03:30 UTC; tylerrinker |
Author: | Jason Gray [ctb], Tyler Rinker [aut, cre] |
Repository: | CRAN |
Date/Publication: | 2025-03-24 05:40:02 UTC |
qdapRegex: Regular Expression Removal, Extraction, & Replacement Tools for the qdap Package
Description
qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis. Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, citations, person tags, phone numbers, times, and zip codes.
Details
The qdapRegex package does not aim to compete with string manipulation
packages such as
stringr
or stringi
but is meant to provide access to canned, common regular expression patterns
that can be used within qdapRegex, with R's own regular
expression functions, or add on string manipulation packages such as
stringr
and stringi
.
Author(s)
Maintainer: Tyler Rinker tyler.rinker@gmail.com
Other contributors:
Jason Gray [contributor]
See Also
Useful links:
Use C-style String Formatting Commands
Description
Convenience wrapper for sprintf
that allows recycling of
... of length one.
Usage
S(x, ...)
Arguments
x |
A single string containing |
... |
A vector of substitutions equal in length to the number of
|
Value
Returns a string with "%s"
replaced.
See Also
Examples
S("@after_", "the", "the")
# Recycle
S("@after_", "the")
S("@rm_between", "LEFT", "RIGHT")
Upper/Lower/Title Case
Description
TC
- Capitalize titles according to traditional capitalization rules.
Usage
TC(text.var, lower = NULL, ...)
L(text.var, ...)
U(text.var, ...)
Arguments
text.var |
The text variable. |
lower |
A vector of words to retain lower case for (unless first or last word). |
... |
Other arguments passed to: |
Details
Case wrapper functions for stringi's stri_trans_tolower
,
stri_trans_toupper
, and stri_trans_totitle
.
Functions are useful within magrittr style chaining.
Value
Returns a character vector with new case (lower, upper, or title).
Note
TC
utilizes additional rules for capitalization beyond
stri_trans_totitle
that include:
Capitalize the first & last word
Lowercase articles, coordinating conjunctions, & prepositions
Lowercase "to" in an infinitive
See Also
stri_trans_tolower
,
stri_trans_toupper
,
stri_trans_totitle
Examples
y <- c(
"I'm liking it but not too much.",
"How much are you into it?",
"I'd say it's yet awesome yet."
)
L(y)
U(y)
TC(y)
Add Left/Right Character(s) Boundaries
Description
This convenience function wraps left and right boundaries of each element of
a character vector. The default is to use "\b"
for left and right
boundaries.
Usage
bind(
...,
left = "\\b",
right = left,
dictionary = getOption("regex.library")
)
Arguments
left |
A single length character vector to use as the left bound. |
right |
A single length character vector to use as the right bound. |
dictionary |
A dictionary of canned regular expressions to search within. |
... |
Regular expressions to add grouping parenthesis to a named
expression from the default regular expression dictionary prefixed with
single at ( |
Value
Returns a character vector.
See Also
Examples
bind(LETTERS, "[", "]")
## More useful default parameters/usage
x <- c("Computer is fun. Not too fun.", "No it's not, it's dumb.",
"What should we do?", "You liar, it stinks!", "I am telling the truth!",
"How can we be certain?", "There is no way.", "I distrust you.",
"What are you talking about?", "Shall we move on? Good then.",
"I'm hungry. Let's eat. You already?")
Fry25 <- c("the", "of", "and", "a", "to", "in", "is", "you", "that", "it",
"he", "was", "for", "on", "are", "as", "with", "his", "they",
"I", "at", "be", "this", "have", "from")
gsub(pastex(list(bind(Fry25))), "[[ELIM]]", x)
Boundary Wrap (Bind) and 'or' Concatenate Elements
Description
A wrapper for bind
and pastex
that wraps each sub-expression
element with left/right boundaries (\b
by default) and then
concatenate/joins bound strings with a regex 'or' ("|"). Equivalent to
pastex(bind(...), sep = "|")
.
Usage
bind_or(..., group.all = TRUE, left = "\\b", right = left)
Arguments
group.all |
logical. If |
left |
A single length character vector to use as the left bound. |
right |
A single length character vector to use as the right bound. |
... |
Regular expressions to paste together or a named expression
from the default regular expression dictionary prefixed with single at
( |
Examples
bind_or(LETTERS)
bind_or("them", "those", "that", "these")
bind_or("them", "those", "that", "these", group.all = FALSE)
Combines a extracted Object
Description
Combines a extracted object
Usage
## S3 method for class 'extracted'
c(x, ...)
Arguments
x |
The extracted object |
... |
ignored |
A Cheat Sheet of Common Regex Task Chunks
Description
Print a cheat sheet of common regex task chunks. cheat
prints a left
justified version of regex_cheat
.
Usage
cheat(dictionary = qdapRegex::regex_cheat, print = TRUE)
Arguments
dictionary |
A dictionary of cheat terms. Default is
|
print |
logical. If |
Value
Prints a cheat sheet of common regex tasks such as lookaheads.
Invisibly returns regex_cheat
.
See Also
Examples
cheat()
Escape Strings From Parsing
Description
Escape literal beginning at (@) strings from qdapRegex parsing.
Usage
escape(pattern)
Arguments
pattern |
A character string that should not be parsed. |
Details
Many qdapRegex functions parse pattern
strings
beginning with an at character (@) and comparing against the default and
supplemental (regex_supplement
) dictionaries. This
means that a string such as "@before_" will be returned as
"\\w+?(?= ((%s|%s)\\b))". If the user wanted to use a regular
expression that was literally "@before_" the escape
function classes
the character string and tells the qdapRegex functions not to parse it
(i.e., keep it as a literal string).
Value
Returns a character vector of the class "escape" and "character".
Examples
escape("@rm_caps")
x <- "...character vector. Default, \\code{@rm_caps} uses..."
rm_default(x, pattern = "@rm_caps")
rm_default(x, pattern = escape("@rm_caps"))
Visualize Regular Expressions
Description
Visualize regular expressions using https://regexper.com/
Usage
explain(
pattern,
open = FALSE,
print = TRUE,
dictionary = getOption("regex.library")
)
Arguments
pattern |
A character string containing a regular expression or a
character string starting with |
open |
logical. If |
print |
logical. Should |
dictionary |
A dictionary of canned regular expressions to search within. |
Details
Note that https://regexper.com/ is a Java based regular expression viewer. Lookbehind and negative lookbehinds are not respected.
Value
Prints https://regexper.com/ to the console, attempts to open the url to the visual representation provided by https://regexper.com/, and invisibly returns a list with the URLs.
Author(s)
Ananda Mahto, Matthew Flickinger, and Tyler Rinker <tyler.rinker@gmail.com>.
References
https://stackoverflow.com/a/27489977/1000343
https://regexper.com/
https://stackoverflow.com/a/27574103/1000343
See Also
Examples
explain("\\s*foo[A-Z]\\d{2,3}")
explain("@rm_time")
## Not run:
explain("\\s*foo[A-Z]\\d{2,3}", open = TRUE)
explain("@rm_time", open = TRUE)
## End(Not run)
Grab Regular Expressions from Dictionaries
Description
convenience function to
Usage
grab(pattern, dictionary = getOption("regex.library"))
Arguments
pattern |
A character string starting with |
dictionary |
A dictionary of canned regular expressions to search within. |
Details
Many R regular expressions contain doubled backslashes that are not
used in other regex interpreters. Using cat
can remove
backslash escapes (see Examples) or URLencode
if using in a url.
Value
Returns a single string regular expression from one of the qdapRegex dictionaries.
Examples
grab("@rm_white")
## Not run:
## Throws an error
grab("@foo")
## End(Not run)
cat(grab("@pages2"))
## Not run:
cat(grab("@pages2"), file="clipboard")
## End(Not run)
Group Regular Expressions
Description
group
- A wrapper for paste(collapse="|")
that also searches
the default and supplemental (regex_supplement
)
dictionaries for regular expressions before pasting them together with a pipe
(|
) separator.
Usage
group(..., left = "(", right = ")", dictionary = getOption("regex.library"))
Arguments
left |
A single length character vector to use as the left bound. |
right |
A single length character vector to use as the right bound. |
dictionary |
A dictionary of canned regular expressions to search within. |
... |
Regular expressions to add grouping parenthesis to a named
expression from the default regular expression dictionary prefixed with
single at ( |
Value
Returns a single string of regular expressions with grouping parenthesis added.
Examples
group(LETTERS)
group(1)
(grouped <- group("(the|them)\\b", "@rm_zip"))
pastex(grouped)
Group Wrap and 'or' Concatenate Elements
Description
A wrapper for group
and pastex
that wraps each sub-expression
element with grouping parenthesis and then concatenate/joins grouped strings
with a regex 'or' ("|"). Equivalent to pastex(group(...), sep = "|")
.
Usage
group_or(..., group.all = TRUE)
Arguments
group.all |
logical. If |
... |
Regular expressions to paste together or a named expression
from the default regular expression dictionary prefixed with single at
( |
Examples
group_or("@rm_hash", "@rm_tag")
group_or("them", "those", "that", "these")
group_or("them", "those", "that", "these", group.all = FALSE)
Test Regular Expression Validity
Description
Acts as a logical test of a regular expression's validity. is.regex
uses gsub
and tests for errors to determine a regular
expression's validity. The regular expression must conform to R's regular
expression rules (see ?regex
for details about how R handles regular
expressions).
Usage
is.regex(pattern)
Arguments
pattern |
A regular expression to be tested. |
Value
Returns a logical (TRUE
is a valid regular expression).
See Also
Examples
is.regex("I|***")
is.regex("I|i")
sapply(regex_usa, is.regex)
sapply(regex_supplement, is.regex) ## `version` is not a valid regex
Paste Regular Expressions
Description
pastex
- A wrapper for paste(collapse="|")
that also searches
the default and supplemental (regex_supplement
)
dictionaries for regular expressions before pasting them together with a pipe
(|
) separator.
%|%
- A binary operator version of pastex
that joins two
character strings with a regex or ("|"). Equivalent to
pastex(x, y, sep="|")
.
%+%
- A binary operator version of pastex
that joins two
character strings with no space. Equivalent to pastex(x, y, sep="")
.
Usage
pastex(..., sep = "|", dictionary = getOption("regex.library"))
x %|% y
x %+% y
Arguments
sep |
The separator to use between the expressions when they are collapsed. |
dictionary |
A dictionary of canned regular expressions to search within. |
x , y |
Two regular expressions to paste together. |
... |
Regular expressions to paste together or a named expression
from the default regular expression dictionary prefixed with single at
( |
Value
Returns a single string of regular expressions pasted together with
pipe(s) (|
).
Note
Note that while pastex
is designed for pasting purposes it can
also be used to call a single regex from the default regional dictionary or
the supplemental dictionary (regex_supplement
) (see
Examples).
See Also
Examples
x <- c("There is $5.50 for me.", "that's 45.6% of the pizza",
"14% is $26 or $25.99", "It's 12:30 pm to 4:00 am")
pastex("@rm_percent", "@rm_dollar")
pastex("@rm_percent", "@time_12_hours")
rm_dollar(x, extract=TRUE, pattern=pastex("@rm_percent", "@rm_dollar"))
rm_dollar(x, extract=TRUE, pattern=pastex("@rm_dollar", "@rm_percent", "@time_12_hours"))
## retrieve regexes from dictionary
pastex("@rm_email")
pastex("@rm_url3")
pastex("@version")
## pipe operator (%|%)
"x" %|% "y"
"@rm_url" %|% "@rm_twitter_url"
## pipe operator (%p%)
"x" %+% "y"
"@rm_time" %+% "\\s[AP]M"
## Remove Twitter Short URL
x <- c("download file from http://example.com",
"this is the link to my website http://example.com",
"go to http://example.com from more info.",
"Another url ftp://www.example.com",
"And https://www.example.net",
"twitter type: t.co/N1kq0F26tG",
"still another one https://t.co/N1kq0F26tG :-)")
rm_twitter_url(x)
rm_twitter_url(x, extract=TRUE)
## Combine removing Twitter URLs and standard URLs
rm_twitter_n_url <- rm_(pattern="@rm_twitter_url" %|% "@rm_url")
rm_twitter_n_url(x)
rm_twitter_n_url(x, extract=TRUE)
Prints a explain object
Description
Prints a explain object
Usage
## S3 method for class 'explain'
print(x, ...)
Arguments
x |
The explain object |
... |
ignored |
Prints a extracted Object
Description
Prints a extracted
object
Usage
## S3 method for class 'extracted'
print(x, ...)
Arguments
x |
The |
... |
Ignored. |
Prints a regexr Object
Description
Prints a regexr
object
Usage
## S3 method for class 'regexr'
print(x, ...)
Arguments
x |
The |
... |
Ignored. |
A dataset containing the regex chunk name, the regex string, and a description of what the chunk does.
Description
A dataset containing the regex chunk name, the regex string, and a description of what the chunk does.
Usage
data(regex_cheat)
Format
A data frame with 6 rows and 3 variables
Details
Name. The name of the regex chunk.
Regex. The regex chunk.
What it Does. Description of what the regex chunk does.
Supplemental Canned Regular Expressions
Description
A dataset containing a list of supplemental, canned regular expressions. The
regular expressions in this data set are considered useful but have not been
included in a formal function (of the type rm_XXX
). Users can utilize
the rm_
function to generate functions that can sub/replace/extract as
desired.
Usage
data(regex_supplement)
Format
A list with 24 elements
Details
The following canned regular expressions are included:
- after_a
single word after the word "a"
- after_the
single word after the word "the"
- after_
find single word after ? word (? = user defined); note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own (user supplies (1) n before, (2) the point, & (3) n after)- around_
find n words (not including punctuation) before or after ? word (? = user defined); note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own (user supplies (1) n before, (2) the point, & (3) n after)- around2_
find n words (plus punctuation) before or after ? word (? = user defined); note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own- before_
find sing word before ? word (? = user defined); note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own- except_first
find all occurrences of a substring except the first; regex pattern retrieved from StackOverflow's akrun: https://stackoverflow.com/a/31458261/1000343
- hexadecimal
substring beginning with hash (#) followed by either 3 or 6 select characters (a-f, A-F, and 0-9)
- ip_address
substring of four chunks of 1-3 consecutive digits separated with dots (.)
- last_occurrence
last occurrence of a delimiter; note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own (user supplies the delimiter)- pages
substring with "pp." or "p.", optionally followed by a space, followed by 1 or more digits, optionally followed by a dash, optionally followed by 1 or more digits, optionally followed by a semicolon, optionally followed by a space, optionally followed by 1 or more digits; intended for extraction/removal purposes
- pages2
substring 1 or more digits, optionally followed by a dash, optionally followed by 1 or more digits, optionally followed by a semicolon, optionally followed by a space, optionally followed by 1 or more digits; intended for validation purposes
- punctuation
punctuation characters (
[:punct:]
) with the ability to negate; note contains"%s"
that is replaced bysprintf
and is not a valid regex on its own- run_split
a regex that is useful for splitting strings in the characters runs (e.g., "wwxyyyzz" becomes "ww", "x", "yyy", "zz"); regex pattern retrieved from Robert Redd: https://stackoverflow.com/a/29383435/1000343
- split_keep_delim
regex string that splits on a delimiter and retains the delimiter
- thousands_separator
chunks digits > 4 into groups of 3 from right to left allowing for easy insertion of thousands separator; regex pattern retrieved from StackOverflow's stema: https://stackoverflow.com/a/10612685/1000343
- time_12_hours
substring of valid hours (1-12) followed by a colon (:) followed by valid minutes (0-60), followed by an optional space and the character chunk am or pm
- version
substring starting with "v" or "version" optionally followed by a space and then period separated digits for <major>.<minor>.<release>.<build>; the build sequence is optional and the "version"/"v" IS NOT contained in the substring
- version2
substring starting with "v" or "version" optionally followed by a space and then period separated digits for <major>.<minor>.<release>.<build>; the build sequence is optional and the "version"/"v" IS contained in the substring
- white_after_comma
substring of white space after a comma
- word_boundary
A true word boundary that only includes alphabetic characters; based on https://www.rexegg.com/'s suggestion taken from discussion of true word boundaries; note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own- word_boundary_left
A true left word boundary that only includes alphabetic characters; based on https://www.rexegg.com/'s suggestion taken from discussion of true word boundaries
- word_boundary_right
A true right word boundary that only includes alphabetic characters; based on https://www.rexegg.com/'s suggestion taken from discussion of true word boundaries
- youtube_id
substring of the video id from a YouTube video; taken from Jacob Overgaard's submission found https://regex101.com/r/kU7bP8/1
Regexes from this data set can be added to the pattern
argument of any
rm_XXX
function via an at sign (@) followed by a regex name from
this data set (e.g., pattern = "@after_the"
) provided the regular
expression does not contain non-regex such as sprintf
character string %s
.
Use qdapRegex:::examine_regex(regex_supplement)
to
interactively explore the regular expressions in regex_usa
. This will
provide a browser + console based break down of each regex in the dictionary.
Warning
Note that regexes containing %s
are replaced by
sprintf
and are not a valid regex on their own. The
S
is useful for adding these missing %s
parameters.
Examples
time <- rm_(pattern="@time_12_hours")
time("I will go at 12:35 pm")
x <- "v6.0.156 for Windows 2000/2003/XP/Vista
Server version 1.1.20
Client Manager version 1.1.24"
rm_default(x, pattern = "@version", extract=TRUE)
rm_default(x, pattern = "@version2", extract=TRUE)
x <- "this is 1000000 big 4356. And little 123 number."
rm_default(x, pattern="@thousands_separator", replacement="\\1,")
rm_default(x, pattern="@thousands_separator", replacement="\\1.")
rm_default("I was,but it costs 10,000.", pattern="@white_after_comma",
replacement=", ")
x <- "I like; the donuts; a lot"
strsplit(x, ";")
strsplit(x, S(grab("split_keep_delim"), ";"), perl=TRUE)
stringi::stri_split_regex(x, S(grab("split_keep_delim"), ";"))
stringi::stri_split_regex("I like; the donuts; a lot:cool",
S(grab("split_keep_delim"), ";|:"))
## Grab words around a point
x <- c(
"the magic word is e",
"the dog is red and they are blue",
"I am new but she is not new",
"hello world",
"why is it so cold? Perhaps it is Winter.",
"It is not true the 7 is 8.",
"Is that my drink?"
)
rm_default(x, pattern = S("@around_", 1, "is", 1), extract=TRUE)
rm_default(x, pattern = S("@around_", 2, "is", 2), extract=TRUE)
rm_default(x, pattern = S("@around_", 1, "is|are|am", 1), extract=TRUE)
rm_default(x, pattern = S("@around_", 1, "is not|is|are|am", 1), extract=TRUE)
rm_default(x, pattern = S("@around_", 1,
"is not|[Ii]s|[Aa]re|[Aa]m", 1), extract=TRUE)
x <- c(
"hello world",
"45",
"45 & 5 makes 50",
"x and y",
"abc and def",
"her him foo & bar for Jack and Jill then"
)
around_and <- rm_(pattern = S("@around_", 1, "and|\\&", 1), extract=TRUE)
around_and(x)
## Split runs into chunks
x <- "1111100000222000333300011110000111000"
strsplit(x, grab("@run_split"), per = TRUE)
## Not run:
library(qdap);library(ggplot2);library(reshape2)
out <- setNames(lapply(c("@after_a", "@after_the"), function(x) {
o <- rm_default(stringi:::stri_trans_tolower(pres_debates2012$dialogue),
pattern = x, extract=TRUE)
m <- qdapTools::matrix2df(data.frame(freq=sort(table(unlist(o)), TRUE)), "word")
m[m$freq> 7, ]
}), c("a", "the"))
dat <- setNames(Reduce(function(x, y) {
merge(x, y, by = "word", all = TRUE)}, out), c("Word", "A", "THE"))
dat <- reshape2::melt(dat, id="Word", variable.name="Article", value.name="freq")
dat <- dat[order(dat$freq, dat$Word), ]
ord <- aggregate(freq ~ Word, dat, sum)
dat$word <- factor(dat$Word, levels=ord[order(ord[[2]]), 1])
ggplot(dat, aes(x=freq, y=Word)) + geom_point()+ facet_grid(~Article)
## End(Not run)
## remove/extract pages numbers
x <- c("I read p. 36 and then pp. 45-49", "it's on pp. 23-24;28")
rm_pages <- rm_(pattern="@pages", extract=TRUE)
rm_pages(x)
rm_default(x, pattern = "@pages")
rm_default(x, pattern = "@pages", extract=TRUE)
rm_default(x, pattern = "@pages2", extract=TRUE)
## Validate pages
page_val <- validate("@pages2", FALSE)
page_val(c(66, "78-82", "hello world", TRUE, "44-45; 56"))
## Split on last occurrence
x <- c(
"test@aol@fg.mm.com",
"test@hotmail.com",
"test@xyz@rr@lk.edu",
"test@abc.xx@zz.vv.net"
)
strsplit(x, S("@last_occurrence", "\\."), perl=TRUE)
strsplit(x, S("@last_occurrence", "@"), perl=TRUE)
## True Word Boundaries
x <- "this is _not a word666 and this is not a word too."
## Standard regex word boundary
rm_default(x, pattern=bind("not a word"))
## Alphabetic only word boundaries
rm_default(x, pattern=S("@word_boundary", "not a word"))
## Remove all but first occurrence of something
x <- c(
"12-3=4-5=678-9",
"ABC-D=EF2-GHI-JK3=L-MN=",
"9-87=65",
"a - de=4fgh --= i5jkl",
NA
)
rm_default(x, pattern = S("@except_first", "-"))
rm_default(x, pattern = S("@except_first", "="))
Canned Regular Expressions (United States of America)
Description
A dataset containing a list U.S. specific, canned regular expressions for use in various functions within the qdapRegex package.
Usage
data(regex_usa)
Format
A list with 54 elements
Details
The following canned regular expressions are included:
- rm_abbreviation
abbreviations containing single lower case or capital letter followed by a period and then an optional space (this must be repeated 2 or more times)
- rm_between
Remove characters between a left and right boundary including the boundaries; note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own- rm_between2
Remove characters between a left and right boundary NOT including the boundaries; note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own- rm_caps
words containing 2 or more consecutive upper case letters and no lower case
- rm_caps_phrase
phrases of 1 word or more containing 1 or more consecutive upper case letters and no lower case; if phrase is one word long then phrase must be 2 or more consecutive capital letters
- rm_citation
substring that looks for in-text and parenthetical APA6 style citations (attempts to exclude references)
- rm_citation2
substring that looks for in-text APA6 style citations (attempts to exclude references)
- rm_citation3
substring that looks for parenthetical APA6 style citations (attempts to exclude references)
- rm_city_state
substring with city (single lower case word or multiple consecutive capitalized words before a comma and state) & state (2 consecutive capital letters)
- rm_city_state_zip
substring with city (single lower case word or multiple consecutive capitalized words before a comma and state) & state (2 consecutive capital letters) & zip code (exactly 5 or 5+4 consecutive digits)
- rm_date
dates in the form of 2 digit month, 2 digit day, and 2 or 4 digit year. Separator between month, day, and year may be dot (.), slash (/), or dash (-)
- rm_date2
dates in the form of 3-9 letters followed by one or more spaces, 2 digits, a comma(,), one or more spaces, and 4 digits
- rm_date3
dates in the form of XXXX-XX-XX; hyphen separated string of 4 digit year, 2 digit month, and 2 digit day
- rm_date4
dates in the form of both
rm_date
,rm_date2
, andrm_date3
- rm_dollar
substring with dollar sign ($) followed by (1) just dollars (no decimal), (2) dollars and cents (whole number and decimal), or (3) just cents (decimal value); dollars may contain commas
- rm_email
substring with (1) alphanumeric characters or dash (-), plus (+), or underscore (_) (This may be repeated) (2) followed by at (@), followed by the same regex sequence as before the at (@), and ending with dot (.) and 2-14 digits
- rm_emoticon
common emoticons (logic is complicated to explain in words) using ">?[:;=8XB]{1}[-~+o^]?[|\")(>DO>{pP3/]+|</?3|XD+|D:<|x[-~+o^]?[|\")(>DO>{pP3/]+" regex pattern; general pattern is optional hat character, followed by eyes character, followed by optional nose character, and ending with a mouth character
- rm_endmark
substring of the last endmark group in a string; endmarks include (! ? . * OR |)
- rm_endmark3
substring of the last endmark group in a string; endmarks include (! ? OR .)
- rm_endmark3
substring of the last endmark group in a string; endmarks include (! ? . * | ; OR :)
- rm_hash
substring that begins with a hash (#) followed by a word
- rm_nchar_words
substring of letters (that may contain apostrophes) n letters long (apostrophe not counted in length); note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own- rm_nchar_words2
substring of letters (that may contain apostrophes) n letters long (apostrophe counted in length); note contains
"%s"
that is replaced bysprintf
and is not a valid regex on its own- rm_non_ascii
substring of 2 digits or letters a-f inside of a left and right angle brace in the form of
"<a4>"
- rm_non_words
substring of any character that isn't a letter, apostrophe, or single space
- rm_number
substring that may begin with dash (-) for negatives, and is (1) just whole number (no decimal), (2) whole number and decimal, or (3) just decimal value; regex pattern provided by Jason Gray
- rm_percent
substring beginning with (1) just whole number (no decimal), (2) whole number and decimal, or (3) just decimal value and followed by a percent sign (%)
- rm_phone
phone numbers in the form of optional country code, valid 3 digit prefix, and 7 digits (may contain hyphens and parenthesis); logic is complex to explain (see https://stackoverflow.com/a/21008254/1000343 for more)
- rm_postal_code
U.S. state abbreviations (and District of Columbia) that is constrained to just possible U.S. state names, not just two consecutive capital letters; taken from Mike Hamilton's submission found https://regexlib.com/REDetails.aspx?regexp_id=2177
- rm_repeated_characters
substring with a repetition of repeated characters within a word; regex pattern retrieved from StackOverflow's, vks: https://stackoverflow.com/a/29438461/1000343
- rm_repeated_phrases
substring with a phrase (a sequence of 1 or more words) that is repeated 2 or more times (case is ignored; separating periods and commas are ignored); regex pattern retrieved from StackOverflow's, BrodieG: https://stackoverflow.com/a/28786617/1000343
- rm_repeated_words
substring with a word (marked with a boundary) that is repeat 2 or more times (case is ignored)
- rm_tag
substring that begins with an at (@) followed by a word
- rm_tag2
Twitter substring that begins with an at (@) followed by a word composed of alpha-numeric characters and underscores, no longer than 15 characters
- rm_title_name
substring beginning with title (Mrs., Mr., Ms., Dr.) that is case independent or full title (Miss, Mizz, mizz) followed by a single lower case word or multiple capitalized words
- rm_time
substring that (1) must begin with 0-2 digits, (2) must be followed by a single colon (:), (3) optionally may be followed by either a colon (:) or a dot (.), (4) optionally may be followed by 1-infinite digits (if previous condition is true)
- rm_time2
substring that is identical to
rm_time
with the additional search for Ante Meridiem/Post Meridiem abbreviations (e.g., AM, p.m., etc.)- rm_transcript_time
substring that is specific to transcription time stamps in the form of HH:MM:SS.OS where OS is milliseconds. HH: and .OS are optional. The SS.OS period divide may also be a comma or additional colon. The HH:SS divid may also be a period. String may be affixed with pound sign (#).
- rm_twitter_url
Twitter short link/url; substring optionally beginning with http, followed by t.co ending on a space or end of string (whichever comes first)
- rm_url
substring beginning with http, www., or ftp and ending on a space or end of string (whichever comes first); note that this regex is simple and may not cover all valid URLs or may include invalid URLs
- rm_url2
substring beginning with http, www., or ftp and more constrained than
rm_url
; based on @imme_emosol's response from https://mathiasbynens.be/demo/url-regex- rm_url3
substring beginning with http or ftp and more constrained than
rm_url
&rm_url2
though light-weight, making it ideal for validation purposes; taken from @imme_emosol's response found https://mathiasbynens.be/demo/url-regex- rm_white
substring of white space(s); this regular expression combines
rm_white_bracket
,rm_white_colon
,rm_white_comma
,rm_white_endmark
,rm_white_lead
,rm_white_trail
, andrm_white_multiple
- rm_white_bracket
substring of white space(s) following left brackets ("{", "(", "[") or preceding right brackets ("}", ")", "]")
- rm_white_colon
substring of white space(s) preceding colon(s)/semicolon(s)
- rm_white_comma
substring of white space(s) preceding a comma
- rm_white_endmark
substring of white space(s) preceding a single occurrence/combination of period(s), question mark(s), and exclamation point(s)
- rm_white_lead
substring of leading white space(s)
- rm_white_lead_trail
substring of leading/trailing white space(s)
- rm_white_multiple
substring of multiple, consecutive white spaces
- rm_white_punctuation
substring of white space(s) preceding a comma or a single occurrence/combination of colon(s), semicolon(s), period(s), question mark(s), and exclamation point(s)
- rm_white_trail
substring of trailing white space(s)
- rm_zip
substring of 5 digits optionally followed by a dash and 4 more digits
Extra
Use qdapRegex:::examine_regex()
to interactively explore the
regular expressions in regex_usa
. This will provide a browser + console
based break down of each regex in the dictionary.
Remove/Replace/Extract Function Generator
Description
Remove/replace/extract substrings from a string. A function generator used
to make regex functions that operate typical of other qdapRegex
rm_XXX
functions. Use rm_
for removal and ex_
for
extraction.
Usage
rm_(...)
ex_(...)
Arguments
... |
Arguments passed to
|
Value
Returns a function that operates typical of other qdapRegex
rm_XXX
functions but with user defined defaults.
See Also
Examples
rm_digit <- rm_(pattern="[0-9]")
rm_digit(" I 12 li34ke ice56cream78. ")
rm_lead <- rm_(pattern="^\\s+", trim = FALSE, clean = FALSE)
rm_lead(" I 12 li34ke ice56cream78. ")
rm_all_except_letters <- rm_(pattern="[^ a-zA-Z]")
rm_all_except_letters(" I 12 li34ke ice56cream78. ")
extract_consec_num <- rm_(pattern="[0-9]+", extract = TRUE)
extract_consec_num(" I 12 li34ke ice56cream78. ")
## Using the supplemental dictionary dataset:
x <- "A man lives there! The dog likes it. I want the map. I want an apple."
extract_word_after_the <- rm_(extract=TRUE, pattern="@after_the")
extract_word_after_a <- rm_(extract=TRUE, pattern="@after_a")
extract_word_after_the(x)
extract_word_after_a(x)
f <- rm_(pattern="@time_12_hours")
f("I will go at 12:35 pm")
x <- c(
"test@aol.fg.com",
"test@hotmail.com",
"test@xyzrr.lk.edu",
"test@abc.xx.zz.vv.net"
)
file_ext2 <- rm_(pattern="(?<=\\.)[a-z]*$", extract=TRUE)
tools::file_ext(x)
file_ext2(x)
Remove/Replace/Extract Abbreviations
Description
Remove/replace/extract abbreviations from a string containing lower case or capital letters followed by a period and then an optional space (this must be repeated 2 or more times).
Usage
rm_abbreviation(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_abbreviation",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_abbreviation(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_abbreviation",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with abbreviations removed.
See Also
Other rm_ functions:
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("I want $2.33 at 2:30 p.m. to go to A.n.p.",
"She will send it A.S.A.P. (e.g. as soon as you can) said I.",
"Hello world.", "In the U. S. A.")
rm_abbreviation(x)
ex_abbreviation(x)
Remove/Replace/Extract Strings Between 2 Markers
Description
Remove/replace/extract strings bounded between a left and right marker.
Usage
rm_between(
text.var,
left,
right,
fixed = TRUE,
trim = TRUE,
clean = TRUE,
replacement = "",
extract = FALSE,
include.markers = ifelse(extract, FALSE, TRUE),
dictionary = getOption("regex.library"),
...
)
rm_between_multiple(
text.var,
left,
right,
fixed = TRUE,
trim = TRUE,
clean = TRUE,
replacement = "",
extract = FALSE,
include.markers = FALSE,
merge = TRUE
)
ex_between(
text.var,
left,
right,
fixed = TRUE,
trim = TRUE,
clean = TRUE,
replacement = "",
extract = TRUE,
include.markers = ifelse(extract, FALSE, TRUE),
dictionary = getOption("regex.library"),
...
)
ex_between_multiple(
text.var,
left,
right,
fixed = TRUE,
trim = TRUE,
clean = TRUE,
replacement = "",
extract = TRUE,
include.markers = FALSE,
merge = TRUE
)
Arguments
text.var |
The text variable. |
left |
A vector of character or numeric symbols as the left edge to extract. |
right |
A vector of character or numeric symbols as the right edge to extract. |
fixed |
logical. If |
trim |
logical. If |
clean |
trim logical. If |
replacement |
Replacement for matched |
extract |
logical. If |
include.markers |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
merge |
logical. If |
Value
Returns a character string with markers removed. If
rm_between
returns merged strings and is significantly faster. If
rm_between_multiple
the strings are optionally merged by
left
/right
symbols. The latter approach is more flexible and
names extracted strings by symbol boundaries, however, it is slower than
rm_between
.
See Also
gsub
,
rm_bracket
,
stri_extract_all_regex
Other rm_ functions:
rm_abbreviation()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- "I like [bots] (not)."
rm_between(x, "(", ")")
ex_between(x, "(", ")")
rm_between(x, c("(", "["), c(")", "]"))
ex_between(x, c("(", "["), c(")", "]"))
rm_between(x, c("(", "["), c(")", "]"), include.markers=FALSE)
ex_between(x, c("(", "["), c(")", "]"), include.markers=TRUE)
## multiple (naming and ability to keep separate bracket types but slower)
x <- c("Where is the /big dog#?",
"I think he's @arunning@b with /little cat#.")
rm_between_multiple(x, "@a", "@b")
ex_between_multiple(x, "@a", "@b")
rm_between_multiple(x, c("/", "@a"), c("#", "@b"))
ex_between_multiple(x, c("/", "@a"), c("#", "@b"))
x2 <- c("Where is the L1big dogL2?",
"I think he's 98running99 with L1little catL2.")
rm_between_multiple(x2, c("L1", 98), c("L2", 99))
ex_between_multiple(x2, c("L1", 98), c("L2", 99))
state <- c("Computer is fun. Not too fun.", "No it's not, it's dumb.",
"What should we do?", "You liar, it stinks!", "I am telling the truth!",
"How can we be certain?", "There is no way.", "I distrust you.",
"What are you talking about?", "Shall we move on? Good then.",
"I'm hungry. Let's eat. You already?")
rm_between_multiple(state, c("is", "we"), c("too", "on"))
## Use Grouping
s <- "something before stuff $some text$ in between $1$ and after"
rm_between(s, "$", "$", replacement="<B>\\2<E>")
## Using regular expressions as boundaries (fixed =FALSE)
x <- c(
"There are 2.3 million species in the world",
"There are 2.3 billion species in the world"
)
ex_between(x, left='There', right = '[mb]illion', fixed = FALSE, include=TRUE)
Remove/Replace/Extract Brackets
Description
Remove/replace/extract bracketed strings.
Usage
rm_bracket(
text.var,
pattern = "all",
trim = TRUE,
clean = TRUE,
replacement = "",
extract = FALSE,
include.markers = ifelse(extract, FALSE, TRUE),
dictionary = getOption("regex.library"),
...
)
rm_round(
text.var,
pattern = "(",
trim = TRUE,
clean = TRUE,
replacement = "",
extract = FALSE,
include.markers = ifelse(extract, FALSE, TRUE),
dictionary = getOption("regex.library"),
...
)
rm_square(
text.var,
pattern = "[",
trim = TRUE,
clean = TRUE,
replacement = "",
extract = FALSE,
include.markers = ifelse(extract, FALSE, TRUE),
dictionary = getOption("regex.library"),
...
)
rm_curly(
text.var,
pattern = "{",
trim = TRUE,
clean = TRUE,
replacement = "",
extract = FALSE,
include.markers = ifelse(extract, FALSE, TRUE),
dictionary = getOption("regex.library"),
...
)
rm_angle(
text.var,
pattern = "<",
trim = TRUE,
clean = TRUE,
replacement = "",
extract = FALSE,
include.markers = ifelse(extract, FALSE, TRUE),
dictionary = getOption("regex.library"),
...
)
rm_bracket_multiple(
text.var,
trim = TRUE,
clean = TRUE,
pattern = "all",
replacement = "",
extract = FALSE,
include.markers = FALSE,
merge = TRUE
)
ex_bracket(
text.var,
pattern = "all",
trim = TRUE,
clean = TRUE,
replacement = "",
extract = TRUE,
include.markers = ifelse(extract, FALSE, TRUE),
dictionary = getOption("regex.library"),
...
)
ex_bracket_multiple(
text.var,
trim = TRUE,
clean = TRUE,
pattern = "all",
replacement = "",
extract = TRUE,
include.markers = FALSE,
merge = TRUE
)
ex_angle(
text.var,
pattern = "<",
trim = TRUE,
clean = TRUE,
replacement = "",
extract = TRUE,
include.markers = ifelse(extract, FALSE, TRUE),
dictionary = getOption("regex.library"),
...
)
ex_round(
text.var,
pattern = "(",
trim = TRUE,
clean = TRUE,
replacement = "",
extract = TRUE,
include.markers = ifelse(extract, FALSE, TRUE),
dictionary = getOption("regex.library"),
...
)
ex_square(
text.var,
pattern = "[",
trim = TRUE,
clean = TRUE,
replacement = "",
extract = TRUE,
include.markers = ifelse(extract, FALSE, TRUE),
dictionary = getOption("regex.library"),
...
)
ex_curly(
text.var,
pattern = "{",
trim = TRUE,
clean = TRUE,
replacement = "",
extract = TRUE,
include.markers = ifelse(extract, FALSE, TRUE),
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
pattern |
The type of bracket (and encased text) to remove. This is one
or more of the strings |
trim |
logical. If |
clean |
trim logical. If |
replacement |
Replacement for matched |
extract |
logical. If |
include.markers |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
merge |
logical. If |
Value
rm_bracket
- returns a character string with
multiple brackets removed. If extract = TRUE
the results are
optionally merged and named by bracket type. This is more flexible than
rm_bracket
but slower.
rm_round
- returns a character string with round brackets removed.
rm_square
- returns a character string with square brackets
removed.
rm_curly
- returns a character string with curly brackets
removed.
rm_angle
- returns a character string with angle brackets
removed.
rm_bracket_multiple
- returns a character string with
multiple brackets removed. If extract = TRUE
the results are
optionally merged and named by bracket type. This is more flexible than
rm_bracket
but slower.
Author(s)
Martin Morgan and Tyler Rinker <tyler.rinker@gmail.com>.
References
https://stackoverflow.com/q/8621066/1000343
See Also
gsub
,
rm_between
,
stri_extract_all_regex
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
examp <- structure(list(person = structure(c(1L, 2L, 1L, 3L),
.Label = c("bob", "greg", "sue"), class = "factor"), text =
c("I love chicken [unintelligible]!",
"Me too! (laughter) It's so good.[interrupting]",
"Yep it's awesome {reading}.", "Agreed. {is so much fun}")), .Names =
c("person", "text"), row.names = c(NA, -4L), class = "data.frame")
examp
rm_bracket(examp$text, pattern = "square")
rm_bracket(examp$text, pattern = "curly")
rm_bracket(examp$text, pattern = c("square", "round"))
rm_bracket(examp$text)
ex_bracket(examp$text, pattern = "square")
ex_bracket(examp$text, pattern = "curly")
ex_bracket(examp$text, pattern = c("square", "round"))
ex_bracket(examp$text, pattern = c("square", "round"), merge = FALSE)
ex_bracket(examp$text)
ex_bracket(examp$tex, include.markers=TRUE)
## Not run:
library(qdap)
ex_bracket(examp$tex, pattern="curly") %>%
unlist() %>%
na.omit() %>%
paste2()
## End(Not run)
x <- "I like [bots] (not). And <likely> many do not {he he}"
rm_round(x)
ex_round(x)
ex_round(x, include.marker = TRUE)
rm_square(x)
ex_square(x)
rm_curly(x)
ex_curly(x)
rm_angle(x)
ex_angle(x)
lapply(ex_between('She said, "I am!" and he responded..."Am what?".',
left='"', right='"'), "[", c(TRUE, FALSE))
Remove/Replace/Extract All Caps
Description
Remove/replace/extract 'all caps' words containing 2 or more consecutive upper case letters from a string.
Usage
rm_caps(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_caps",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_caps(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_caps",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with "all caps" removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("UGGG! When I use caps I am YELLING!")
rm_caps(x)
rm_caps(x, replacement="\\L\\1")
ex_caps(x)
Remove/Replace/Extract All Caps Phrases
Description
Remove/replace/extract 'all caps' phrases containing 1 or more consecutive upper case letters from a string. If one word phrase the word must be 3+ letters long.
Usage
rm_caps_phrase(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_caps_phrase",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_caps_phrase(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_caps_phrase",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with "all caps phrases" removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("UGGG! When I use caps I am YELLING!",
"Or it may mean this is VERY IMPORTANT!",
"or trying to make a LITTLE SEEM like IT ISN'T LITTLE"
)
rm_caps_phrase(x)
ex_caps_phrase(x)
Remove/Replace/Extract Citations
Description
Remove/replace/extract APA6 style citations from a string.
Counts of normalized citations ("et al." to original author converted to author + year standarization).
Usage
rm_citation(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_citation",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_citation(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_citation",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
as_count(x, ...)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Ignored. |
x |
The output from |
Details
The default regular expression used by rm_citation
finds
in-text and parenthetical citations. This behavior can be altered by using a
secondary regular expression from the regex_usa
data (or other dictionary) via (pattern = "@rm_citation2"
or
pattern = "@rm_citation3"
). See Examples for example usage.
Value
Returns a character string with citations removed.
Returns a data.frame
of Authors, Years, and n (counts).
Note
This function is experimental.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
## All Citations
x <- c("Hello World (V. Raptor, 1986) bye",
"Narcissism is not dead (Rinker, 2014)",
"The R Core Team (2014) has many members.",
paste("Bunn (2005) said, \"As for elegance, R is refined, tasteful, and",
"beautiful. When I grow up, I want to marry R.\""),
"It is wrong to blame ANY tool for our own shortcomings (Baer, 2005).",
"Wickham's (in press) Tidy Data should be out soon.",
"Rinker's (n.d.) dissertation not so much.",
"I always consult xkcd comics for guidance (Foo, 2012; Bar, 2014).",
"Uwe Ligges (2007) says, \"RAM is cheap and thinking hurts\""
)
rm_citation(x)
ex_citation(x)
as_count(ex_citation(x))
rm_citation(x, replacement="[CITATION HERE]")
## Not run:
qdapTools::vect2df(sort(table(unlist(rm_citation(x, extract=TRUE)))),
"citation", "count")
## End(Not run)
## In-Text
ex_citation(x, pattern="@rm_citation2")
## Parenthetical
ex_citation(x, pattern="@rm_citation3")
## Not run:
## Mining Citation
if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdap, qdapTools, dplyr, ggplot2)
url_dl("http://umlreading.weebly.com/uploads/2/5/2/5/25253346/whole_language_timeline-updated.docx")
parts <- read_docx("whole_language_timeline-updated.docx") %>%
rm_non_ascii() %>%
split_vector(split = "References", include = TRUE, regex=TRUE)
parts[[1]]
parts[[1]] %>%
unbag() %>%
ex_citation() %>%
c()
## Counts
parts[[1]] %>%
unbag() %>%
ex_citation() %>%
as_count()
## By line
ex_citation(parts[[1]])
## Frequency
cites <- parts[[1]] %>%
unbag() %>%
ex_citation() %>%
c() %>%
data_frame(citation=.) %>%
count(citation) %>%
arrange(n) %>%
mutate(citation=factor(citation, levels=citation))
## Distribution of citations (find locations and then plot)
cite_locs <- do.call(rbind, lapply(cites[[1]], function(x){
m <- gregexpr(x, unbag(parts[[1]]), fixed=TRUE)
data.frame(
citation=x,
start = m[[1]] -5,
end = m[[1]] + 5 + attributes(m[[1]])[["match.length"]]
)
}))
ggplot(cite_locs) +
geom_segment(aes(x=start, xend=end, y=citation, yend=citation), size=3,
color="yellow") +
xlab("Duration") +
scale_x_continuous(expand = c(0,0),
limits = c(0, nchar(unbag(parts[[1]])) + 25)) +
theme_grey() +
theme(
panel.grid.major=element_line(color="grey20"),
panel.grid.minor=element_line(color="grey20"),
plot.background = element_rect(fill="black"),
panel.background = element_rect(fill="black"),
panel.border = element_rect(colour = "grey50", fill=NA, size=1),
axis.text=element_text(color="grey50"),
axis.title=element_text(color="grey50")
)
## End(Not run)
Remove/Replace/Extract LaTeX Citations
Description
Remove/replace/extract LaTeX citations from a string.
Usage
rm_citation_tex(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_citation_tex",
replacement = "",
extract = FALSE,
split = extract,
unlist.extract = TRUE,
dictionary = getOption("regex.library"),
...
)
ex_citation_tex(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_citation_tex",
replacement = "",
extract = TRUE,
split = extract,
unlist.extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or character string). |
replacement |
Replacement for matched |
extract |
logical. If |
split |
logical. If |
unlist.extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Additional arguments passed to
|
Value
Returns a character string with citations (bibkeys) removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c(
"I say \\parencite*{Ted2005, Moe1999} go there in \\textcite{Few2010} said to.",
"But then \\authorcite{Ware2013} said it was so \\pcite[see][p. 22]{Get9999c}.",
"then I \\citep[p. 22]{Foo1882c} him")
rm_citation_tex(x)
rm_citation_tex(x, replacement="[[CITATION]]")
ex_citation_tex(x)
Remove/Replace/Extract City & State
Description
Remove/replace/extract city (single lower case word or multiple consecutive capitalized words before a comma and state) & state (2 consecutive capital letters) from a string.
Usage
rm_city_state(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_city_state",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_city_state(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_city_state",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with city & state removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- paste0("I went to Washington Heights, NY for food! ",
"It's in West ven,PA, near Bolly Bolly Bolly, CA!",
"I like Movies, PG13")
rm_city_state(x)
ex_city_state(x)
Remove/Replace/Extract City, State, & Zip
Description
Remove/replace/extract city (single lower case word or multiple consecutive capitalized words before a comma and state) + state (2 consecutive capital letters) + zip code (5 digits or 5 + 4 digits) from a string.
Usage
rm_city_state_zip(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_city_state_zip",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_city_state_zip(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_city_state_zip",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with city, state, & zip removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- paste0("I went to Washington Heights, NY 54321 for food! ",
"It's in West ven,PA 12345, near Bolly Bolly Bolly, CA12345-1234!",
"hello world")
rm_city_state_zip(x)
ex_city_state_zip(x)
Remove/Replace/Extract Dates
Description
Remove/replace/extract dates from a string in the form of (1) XX/XX/XXXX, XX/XX/XX, XX-XX-XXXX, XX-XX-XX, XX.XX.XXXX, or XX.XX.XX OR (2) March XX, XXXX or Mar XX, XXXX OR (3) both forms.
Usage
rm_date(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_date",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_date(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_date",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Details
The default regular expression used by rm_date
finds numeric
representations not word/abbreviations. This means that
"June 13, 2002"
is not matched. This behavior can be altered (to
include month names/abbreviations) by using a secondary regular expression
from the regex_usa
data (or other dictionary) via
(pattern = "@rm_date2"
, pattern = "@rm_date3"
, or
pattern = "@rm_date4"
). See
Examples for example usage.
Value
Returns a character string with dates removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
## Numeric Date Representation
x <- paste0("Format dates as 04/12/2014, 04-12-2014, 04.12.2014. or",
" 04/12/14 but leaves mismatched: 12.12/2014")
rm_date(x)
ex_date(x)
## Word/Abbreviation Date Representation
x2 <- paste0("Format dates as Sept 09, 2002 or October 22, 1887",
"but not 04-12-2014 and may match good 00, 9999")
rm_date(x2, pattern="@rm_date2")
ex_date(x2, pattern="@rm_date2")
## Year-Month-Day Representation
x3 <- sprintf("R uses time in this format %s.", Sys.time())
rm_date(x3, pattern="@rm_date3")
## Grab all types
ex_date(c(x, x2, x3), pattern="@rm_date4")
Remove/Replace/Extract Template
Description
Remove/replace/extract substring from a string. This is the template used by
other qdapRegex rm_XXX
functions.
Usage
rm_default(
text.var,
trim = !extract,
clean = TRUE,
pattern,
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_default(
text.var,
trim = !extract,
clean = TRUE,
pattern,
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with substring removed.
See Also
rm_
,
gsub
,
stri_extract_all_regex
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
## Built in regex dictionary
rm_default("I live in Buffalo, NY 14217", pattern="@rm_city_state_zip")
## User defined regular expression
pat <- "(\\s*([A-Z][\\w-]*)+),\\s([A-Z]{2})\\s(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b"
rm_default("I live in Buffalo, NY 14217", pattern=pat)
Remove/Replace/Extract Dollars
Description
Remove/replace/extract dollars amounts from a string.
Usage
rm_dollar(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_dollar",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_dollar(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_dollar",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with dollars removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("There is $5.50 for me.", "that's 45.6% of the pizza",
"14% is $26 or $25.99", "Really?...$123,234.99 is not cheap.")
rm_dollar(x)
ex_dollar(x)
Remove/Replace/Extract Email Addresses
Description
Remove/replace/extract email addresses from a string.
Usage
rm_email(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_email",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_email(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_email",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with email addresses removed.
Author(s)
Barry Rowlingson and Tyler Rinker <tyler.rinker@gmail.com>.
References
The email regular expression was taken from: https://stackoverflow.com/a/25077704/1000343
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- paste("fred is fred@foo.com and joe is joe@example.com - but @this is a
twitter handle for twit@here.com or foo+bar@google.com/fred@foo.fnord")
x2 <- c("fred is fred@foo.com and joe is joe@example.com - but @this is a",
"twitter handle for twit@here.com or foo+bar@google.com/fred@foo.fnord",
"hello world")
rm_email(x)
rm_email(x, replacement = '<a href="mailto:\\1" target="_blank">\\1</a>')
ex_email(x)
ex_email(x2)
Remove/Replace/Extract Emoticons
Description
Remove/replace/extract common emoticons from a string.
Usage
rm_emoticon(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_emoticon",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_emoticon(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_emoticon",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with emoticons removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("are :-)) it 8-D he XD on =-D they :D of :-) is :> for :o) that :-/",
"as :-D I xD with :^) a =D to =) the 8D and :3 in =3 you 8) his B^D was")
rm_emoticon(x)
ex_emoticon(x)
Remove/Replace/Extract Endmarks
Description
Remove/replace/extract endmarks from a string.
Usage
rm_endmark(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_endmark",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_endmark(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_endmark",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Details
The default regular expression used by rm_endmark
finds
endmark punctuation used in the qdap package; this includes ! . ? * AND
|. This behavior can be altered (to ; AND : or to use just ! . AND ?) by
using a secondary regular expression from the
regex_usa
data (or other dictionary) via
(pattern = "@rm_endmark2"
or pattern = "@rm_endmark3"
). See
Examples for example usage.
Value
Returns a character string with endmarks removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("I like the dog.", "I want it *|", "I;",
"Who is| that?", "Hello world", "You...")
rm_endmark(x)
ex_endmark(x)
rm_endmark(x, pattern="@rm_endmark2")
ex_endmark(x, pattern="@rm_endmark2")
rm_endmark(x, pattern="@rm_endmark3")
ex_endmark(x, pattern="@rm_endmark3")
Remove/Replace/Extract Hash Tags
Description
Remove/replace/extract hash tags from a string.
Usage
rm_hash(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_hash",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_hash(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_hash",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with hash tags removed.
Author(s)
stackoverflow's hwnd and Tyler Rinker <tyler.rinker@gmail.com>.
References
The hash tag regular expression was taken from: https://stackoverflow.com/a/25096474/1000343
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("@hadley I like #rstats for #ggplot2 work.",
"Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats:
http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio",
"Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization
presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1"
)
rm_hash(x)
rm_hash(rm_tag(x))
ex_hash(x)
## remove just the hash symbol
rm_hash(x, replace="\\3")
Remove/Replace/Extract N Letter Words
Description
Remove/replace/extract words that are n letters in length (apostrophes not counted).
Usage
rm_nchar_words(
text.var,
n,
trim = !extract,
clean = TRUE,
pattern = "@rm_nchar_words",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_nchar_words(
text.var,
n,
trim = !extract,
clean = TRUE,
pattern = "@rm_nchar_words",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
n |
The number of letters counted in the word. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Details
The default regular expression used by rm_nchar_words
counts
letter length, not characters. This means that apostrophes are not include
in the character count. This behavior can be altered (to include apostrophes
in the character count) by using a secondary regular expression from the
regex_usa
data (or other dictionary) via
(pattern = "@rm_nchar_words2"
). See Examples for example
usage.
Value
Returns a character string with n letter words removed.
Author(s)
stackoverflow's CharlieB and Tyler Rinker <tyler.rinker@gmail.com>.
References
The n letter/character word regular expression was taken from: https://stackoverflow.com/a/25243885/1000343
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- "This is Jon's dogs' 'bout there in a word Mike's re'y."
rm_nchar_words(x, 4)
ex_nchar_words(x, 4)
## Count characters (apostrophes and letters)
ex_nchar_words(x, 5, pattern = "@rm_nchar_words2")
## nchar range
rm_nchar_words(x, "1,2")
## Not run:
## Larger example
library(qdap)
ex_nchar_words(hamlet[["dialogue"]], 5)
## End(Not run)
Remove/Replace/Extract Non-ASCII
Description
Remove/replace/extract non-ASCII substring from a string. This is the template used by
other qdapRegex rm_XXX
functions.
Usage
rm_non_ascii(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_non_ascii",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
ascii.out = TRUE,
...
)
ex_non_ascii(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_non_ascii",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
ascii.out = TRUE,
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
ascii.out |
logical. If |
... |
ignored. |
Value
Returns a character string with "all non-ascii" removed.
Note
MacOS 14, Sonoma (and likely all versions afterward), has a different implementation of iconv which may not result in expected results.
Warning
iconv
is used within rm_non_ascii
.
iconv
's behavior across operating systems may not be
consistent.
Author(s)
stackoverflow's MrFlick, hwnd, and Tyler Rinker <tyler.rinker@gmail.com>.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("Hello World", "Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher")
Encoding(x) <- "latin1"
x
rm_non_ascii(x)
rm_non_ascii(x, replacement="<<FLAG>>")
ex_non_ascii(x)
ex_non_ascii(x, ascii.out=FALSE)
## simple regex to remove non-ascii
rm_default(x, pattern="[^ -~]")
ex_default(x, pattern="[^ -~]")
Remove/Replace/Extract Non-Words
Description
rm_non_words
- Remove/replace/extract non-words (Anything that's not a
letter or apostrophe; also removes multiple white spaces) from a string.
Usage
rm_non_words(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_non_words",
replacement = " ",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_non_words(
text.var,
trim = !extract,
clean = TRUE,
pattern = "[^A-Za-z' ]+",
replacement = " ",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with non-words removed.
Note
Setting the argument extract = TRUE
is not very useful. Use the
following setup instead (see Examples for a demonstration).
rm_default(x, pattern = "[^A-Za-z' ]", extract=TRUE)
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c(
"I like 56 dogs!",
"It's seventy-two feet from the px290.",
NA,
"What",
"that1is2a3way4to5go6.",
"What do you*% want? For real%; I think you'll see.",
"Oh some <html>code</html> to remove"
)
rm_non_words(x)
ex_non_words(x)
Remove/Replace/Extract Numbers
Description
rm_number
- Remove/replace/extract number from a string (works on
numbers with commas, decimals and negatives).
as_numeric
- A wrapper for as.numeric(gsub(",", "", x))
, which
removes commas and converts a list of vectors of strings to numeric. If the
string cannot be converted to numeric NA
is returned.
as_numeric2
- A convenience function for as_numeric
that
unlists and returns a vector rather than a list.
Usage
rm_number(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_number",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
as_numeric(x)
as_numeric2(x)
ex_number(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_number",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
x |
a character vector to convert to a numeric vector. |
Value
rm_number
- Returns a character string with number removed.
as_numeric
- Returns a list of vectors of numbers.
as_numeric2
- Returns an unlisted vector of numbers.
References
The number regular expression was created by Jason Gray.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("-2 is an integer. -4.3 and 3.33 are not.",
"123,456 is 0 alot -123456 more than -.2", "and 3456789123 fg for 345.",
"fg 12,345 23 .44 or 18.", "don't remove this 444,44", "hello world -.q")
rm_number(x)
ex_number(x)
##Convert to numeric
as_numeric(ex_number(x)) # retain list
as_numeric2(ex_number(x)) # unlist
Remove/Replace/Extract Percentages
Description
Remove/replace/extract percentages from a string.
Usage
rm_percent(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_percent",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_percent(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_percent",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with percentages removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("There is $5.50 for me.", "that's 45.6% of the pizza",
"14% is $26 or $25.99")
rm_percent(x)
ex_percent(x)
Remove/Replace/Extract Phone Numbers
Description
Remove/replace/extract phone numbers from a string.
Usage
rm_phone(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_phone",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_phone(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_phone",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with phone numbers removed.
Author(s)
stackoverflow's Marius and Tyler Rinker <tyler.rinker@gmail.com>.
References
The phone regular expression was taken from: https://stackoverflow.com/a/21008254/1000343
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
"43 Butter Rd, Brossard QC K0A 3P0 - 613 213 4567",
"Please contact Mr. Bean (613)2134567",
"1.575.555.5555 is his #1 number",
"7164347566",
"I like 1234567 dogs"
)
rm_phone(x)
ex_phone(x)
Remove/Replace/Extract Postal Codes
Description
Remove/replace/extract postal codes.
Usage
rm_postal_code(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_postal_code",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_postal_code(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_postal_code",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with postal codes removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("Anchorage, AK", "New York City, NY", "Some Place, Another Place, LA")
rm_postal_code(x)
ex_postal_code(x)
Remove/Replace/Extract Words With Repeating Characters
Description
Remove/replace/extract words with repeating characters. The word must contain characters, each repeating at east 2 times
Usage
rm_repeated_characters(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_repeated_characters",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_repeated_characters(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_repeated_characters",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with percentages removed.
Author(s)
stackoverflow's vks and Tyler Rinker <tyler.rinker@gmail.com>.
References
https://stackoverflow.com/a/29438461/1000343
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- "aaaahahahahaha that was a good joke peep and pepper and pepe"
rm_repeated_characters(x)
ex_repeated_characters(x)
Remove/Replace/Extract Repeating Phrases
Description
Remove/replace/extract repeating phrases from a string.
Usage
rm_repeated_phrases(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_repeated_phrases",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_repeated_phrases(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_repeated_phrases",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with percentages removed.
Author(s)
stackoverflow's BrodieG and Tyler Rinker <tyler.rinker@gmail.com>.
References
https://stackoverflow.com/a/28786617/1000343
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c(
"this is a big is a Big deal",
"I want want to see",
"I want, want to see",
"I want...want to see see see how",
"I like it. It is cool",
"this is a big is a Big deal for those of, those of you who are."
)
rm_repeated_phrases(x)
ex_repeated_phrases(x)
Remove/Replace/Extract Repeating Words
Description
Remove/replace/extract repeating words from a string.
Usage
rm_repeated_words(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_repeated_words",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_repeated_words(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_repeated_words",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with percentages removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c(
"this is a big is a Big deal",
"I want want to see",
"I want, want to see",
"I want...want to see see see how",
"I like it. It is cool",
"this is a big is a Big deal for those of, those of you who are."
)
rm_repeated_words(x)
ex_repeated_words(x)
Remove/Replace/Extract Person Tags
Description
Remove/replace/extract person tags from a string.
Usage
rm_tag(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_tag",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_tag(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_tag",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Details
The default regex pattern "(?<![@\w])@([a-z0-9_]+)\b"
is
more liberal and searches for the at (@) symbol followed by any word. This
can be accessed via pattern = "@rm_tag"
. Twitter user names are more
constrained. A second regex ("(?<![@\w])@([a-z0-9_]{1,15})\b"
) is
provide that contains the latter word to substring that begins with an at
(@) followed by a word composed of alpha-numeric characters and underscores,
no longer than 15 characters. This can be accessed via
pattern = "@rm_tag2"
(see Examples).
Value
Returns a character string with person tags removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("@hadley I like #rstats for #ggplot2 work.",
"Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats:
http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio",
"Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization
presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1",
"tyler.rinker@gamil.com is my email",
"A non valid Twitter is @abcdefghijklmnopqrstuvwxyz"
)
rm_tag(x)
rm_tag(rm_hash(x))
ex_tag(x)
## more restrictive Twitter regex
ex_tag(x, pattern="@rm_tag2")
## Remove only the @ sign
rm_tag(x, replacement = "\\3")
rm_tag(x, replacement = "\\3", pattern="@rm_tag2")
Remove/Replace/Extract Time
Description
rm_time
- Remove/replace/extract time from a string.
rm_transcript_time
- Remove/replace/extract transcript specific time
stamps from a string.
as_time
- Convert a time stamp removed by rm_time
or
rm_transcript_time
to a standard time format (HH:SS:MM.OS) and
optionally convert to as.POSIXlt
.
as_time
- A convenience function for as_time
that unlists and
returns a vector rather than a list.
Usage
rm_time(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_time",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
rm_transcript_time(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_transcript_time",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
as_time(x, as.POSIXlt = FALSE, millisecond = TRUE)
as_time2(x, ...)
ex_time(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_time",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
ex_transcript_time(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_transcript_time",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
x |
A list with extracted time stamps. |
as.POSIXlt |
logical. If |
millisecond |
logical. If |
Details
The default regular expression used by rm_time
finds
time with no AM/PM. This behavior can be altered by using a
secondary regular expression from the regex_usa
data (or other dictionary) via (pattern = "@rm_time2"
. See
Examples for example usage.
Value
Returns a character string with time removed.
Note
... in as_time2
are the other arguments passed to as_time
.
Author(s)
stackoverflow's hwnd and Tyler Rinker <tyler.rinker@gmail.com>.
References
The time regular expression was taken from: https://stackoverflow.com/a/25111133/1000343
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_title_name()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("R uses 1:5 for 1, 2, 3, 4, 5.",
"At 3:00 we'll meet up and leave by 4:30:20",
"We'll meet at 6:33.", "He ran it in :22.34")
rm_time(x)
ex_time(x)
## With AM/PM
x <- c(
"I'm getting 3:04 AM just fine, but...",
"for 10:47 AM I'm getting 0:47 AM instead.",
"no time here",
"Some time has 12:04 with no AM/PM after it",
"Some time has 12:04 a.m. or the form 1:22 pm"
)
ex_time(x)
ex_time(x, pat="@rm_time2")
rm_time(x, pat="@rm_time2")
ex_time(x, pat=pastex("@rm_time2", "@rm_time"))
# Convert to standard format
as_time(ex_time(x))
as_time(ex_time(x), as.POSIXlt = TRUE)
as_time(ex_time(x), as.POSIXlt = FALSE, millisecond = FALSE)
# Transcript specific time stamps
x2 <-c(
'08:15 8 minutes and 15 seconds 00:08:15.0',
'3:15 3 minutes and 15 seconds not 1:03:15.0',
'01:22:30 1 hour 22 minutes and 30 seconds 01:22:30.0',
'#00:09:33-5# 9 minutes and 33.5 seconds 00:09:33.5',
'00:09.33,75 9 minutes and 33.5 seconds 00:09:33.75'
)
rm_transcript_time(x2)
(out <- ex_transcript_time(x2))
as_time(out)
as_time(out, TRUE)
as_time(out, millisecond = FALSE)
## Not run:
if (!require("pacman")) install.packages("pacman")
pacman::p_load(chron)
lapply(as_time(out), chron::times)
lapply(as_time(out, , FALSE), chron::times)
## End(Not run)
Remove/Replace/Extract Title + Person Name
Description
Remove/replace/extract title (honorific) + person name(s) from a string.
Usage
rm_title_name(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_title_name",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_title_name(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_title_name",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with person tags removed.
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_url()
,
rm_white()
,
rm_zip()
Examples
x <- c("Dr. Brend is mizz hart's in mrs. Holtz's.",
"Where is mr. Bob Jr. and Ms. John Kennedy?")
rm_title_name(x)
ex_title_name(x)
Remove/Replace/Extract URLs
Description
rm_url
- Remove/replace/extract URLs from a string.
rm_twitter_url
- Remove/replace/extract Twitter Short URLs from a
string.
Usage
rm_url(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_url",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
rm_twitter_url(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_twitter_url",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_url(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_url",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
ex_twitter_url(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_twitter_url",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Details
The default regex pattern "(http[^ ]*)|(www\.[^ ]*)"
is more
liberal. More constrained versions can be accessed
via pattern = "@rm_url2"
& pattern = "@rm_url3"
see
Examples).
Value
Returns a character string with URLs removed.
References
The more constrained url regular expressions ("@rm_url2"
and "@rm_url3"
was adapted from imme_emosol's response:
https://mathiasbynens.be/demo/url-regex
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_white()
,
rm_zip()
Examples
x <- " I like www.talkstats.com and http://stackoverflow.com"
rm_url(x)
rm_url(x, replacement = '<a href="\\1" target="_blank">\\1</a>')
ex_url(x)
ex_url(x, pattern = "@rm_url2")
ex_url(x, pattern = "@rm_url3")
## Remove Twitter Short URL
x <- c("download file from http://example.com",
"this is the link to my website http://example.com",
"go to http://example.com from more info.",
"Another url ftp://www.example.com",
"And https://www.example.net",
"twitter type: t.co/N1kq0F26tG",
"still another one https://t.co/N1kq0F26tG :-)")
rm_twitter_url(x)
ex_twitter_url(x)
## Combine removing Twitter URLs and standard URLs
rm_twitter_n_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))
rm_twitter_n_url(x)
rm_twitter_n_url(x, extract=TRUE)
Remove/Replace/Extract White Space
Description
rm_white
- Remove multiple white space (> 1 becomes a single white
space), white space before a comma, white space before a single or
consecutive combination of a colon, semicolon, or endmark (period, question
mark, or exclamation point), white space after a left bracket ("{", "(", "[")
or before a right bracket ("}", ")", "]"), leading or trailing white space.
rm_white_bracket
- Remove white space after a left bracket ("{", "(", "[")
or before a right bracket ("}", ")", "]").
rm_white_colon
- Remove white space before a single or consecutive
combination of a colon, semicolon.
rm_white_comma
- Remove white space before a comma.
rm_white_endmark
- Remove white space before endmark(s) (".", "?", "!").
rm_white_lead
- Remove leading white space.
rm_white_lead_trail
- Remove leading or trailing white space.
rm_white_trail
- Remove trailing white space.
rm_white_multiple
- Remove multiple white space (> 1 becomes a single
white space).
rm_white_punctuation
- Remove multiple white space before a comma, white
space before a single or consecutive combination of a colon, semicolon, or
endmark (period, question mark, or exclamation point).
Usage
rm_white(
text.var,
trim = FALSE,
clean = FALSE,
pattern = "@rm_white",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_white(
text.var,
trim = FALSE,
clean = FALSE,
pattern = "@rm_white",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
rm_white_bracket(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_white_bracket",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_white_bracket(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_white_bracket",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
rm_white_colon(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_white_colon",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_white_colon(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_white_colon",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
rm_white_comma(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_white_comma",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_white_comma(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_white_comma",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
rm_white_endmark(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_white_endmark",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_white_endmark(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_white_endmark",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
rm_white_lead(
text.var,
trim = FALSE,
clean = FALSE,
pattern = "@rm_white_lead",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_white_lead(
text.var,
trim = FALSE,
clean = FALSE,
pattern = "@rm_white_lead",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
rm_white_lead_trail(
text.var,
trim = FALSE,
clean = FALSE,
pattern = "@rm_white_lead_trail",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_white_lead_trail(
text.var,
trim = FALSE,
clean = FALSE,
pattern = "@rm_white_lead_trail",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
rm_white_trail(
text.var,
trim = FALSE,
clean = FALSE,
pattern = "@rm_white_trail",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_white_trail(
text.var,
trim = FALSE,
clean = FALSE,
pattern = "@rm_white_trail",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
rm_white_multiple(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_white_multiple",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_white_multiple(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_white_multiple",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
rm_white_punctuation(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_white_punctuation",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_white_punctuation(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_white_punctuation",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with extra white space removed.
Author(s)
rm_white_endmark
/rm_white_punctuation
- stackoverflow's hwnd and Tyler Rinker <tyler.rinker@gmail.com>.
References
The rm_white_endmark
/rm_white_punctuation
regular expression was taken from:
https://stackoverflow.com/a/25464921/1000343
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_zip()
Examples
x <- c(" There is ( $5.50 ) for , me . ", " that's [ 45.6% ] of! the pizza !",
" 14% is { $26 } or $25.99 ?", "Oh ; here's colon : Yippee !")
rm_white(x)
rm_white_bracket(x)
rm_white_colon(x)
rm_white_comma(x)
rm_white_endmark(x)
rm_white_lead(x)
rm_white_trail(x)
rm_white_lead_trail(x)
rm_white_multiple(x)
rm_white_punctuation(x)
Remove/Replace/Extract Zip Codes
Description
Remove/replace/extract zip codes from a string.
Usage
rm_zip(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_zip",
replacement = "",
extract = FALSE,
dictionary = getOption("regex.library"),
...
)
ex_zip(
text.var,
trim = !extract,
clean = TRUE,
pattern = "@rm_zip",
replacement = "",
extract = TRUE,
dictionary = getOption("regex.library"),
...
)
Arguments
text.var |
The text variable. |
trim |
logical. If |
clean |
trim logical. If |
pattern |
A character string containing a regular expression (or
character string for |
replacement |
Replacement for matched |
extract |
logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
... |
Other arguments passed to |
Value
Returns a character string with U.S. 5 and 5+4 zip codes removed.
Author(s)
stackoverflow's hwnd and Tyler Rinker <tyler.rinker@gmail.com>.
References
The time regular expression was taken from: https://stackoverflow.com/a/25223890/1000343
See Also
Other rm_ functions:
rm_abbreviation()
,
rm_between()
,
rm_bracket()
,
rm_caps()
,
rm_caps_phrase()
,
rm_citation()
,
rm_citation_tex()
,
rm_city_state()
,
rm_city_state_zip()
,
rm_date()
,
rm_default()
,
rm_dollar()
,
rm_email()
,
rm_emoticon()
,
rm_endmark()
,
rm_hash()
,
rm_nchar_words()
,
rm_non_ascii()
,
rm_non_words()
,
rm_number()
,
rm_percent()
,
rm_phone()
,
rm_postal_code()
,
rm_repeated_characters()
,
rm_repeated_phrases()
,
rm_repeated_words()
,
rm_tag()
,
rm_time()
,
rm_title_name()
,
rm_url()
,
rm_white()
Examples
x <- c("Mr. Bean bought 2 tickets 2-613-213-4567",
"43 Butter Rd, Brossard QC K0A 3P0 - 613 213 4567",
"Rat Race, XX, 12345",
"Ignore phone numbers(613)2134567",
"Grab zips with dashes 12345-6789 or no space before12345-6789",
"Grab zips with spaces 12345 6789 or no space before12345 6789",
"I like 1234567 dogs"
)
rm_zip(x)
ex_zip(x)
## ======================= ##
## BUILD YOUR OWN FUNCTION ##
## ======================= ##
## example from: https://stackoverflow.com/a/26092576/1000343
zips <- data.frame(id = seq(1, 6),
address = c("Company, 18540 Main Ave., City, ST 12345",
"Company 18540 Main Ave. City ST 12345-0000",
"Company 18540 Main Ave. City State 12345",
"Company, 18540 Main Ave., City, ST 12345 USA",
"Company, One Main Ave Suite 18540m, City, ST 12345",
"company 12345678")
)
## Function to grab even if a character follows the zip
# paste together a more flexible regular expression
pat <- pastex(
"@rm_zip",
"(?<!\\d)\\d{5}(?!\\d)",
"(?<!\\d)\\d{5}-\\d{4}(?!\\d)"
)
# Create your own function that extract is set to TRUE
ex_zip2 <- rm_(pattern=pat, extract=TRUE)
ex_zip2(zips$address)
## Function to extract just 5 digit zips
ex_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract=TRUE)
ex_zip3(zips$address)
Regex Validation Function Generator
Description
Generate function to validate regular expressions.
Usage
validate(
pattern,
single = TRUE,
trim = FALSE,
clean = FALSE,
dictionary = getOption("regex.library")
)
Arguments
pattern |
A character string containing a regular expression (or
character string for |
single |
logical. If |
trim |
logical. If |
clean |
trim logical. If |
dictionary |
A dictionary of canned regular expressions to search within
if |
Value
Returns a function that operates typical of other qdapRegex
rm_XXX
functions but with user defined defaults.
Warning
validate
uses qdapRegex's built in regular
expressions. As this patterns are used for text analysis they tend to be
flexible and thus liberal. The user may wish to define more conservative
validation regular expressions and supply to pattern
.
Examples
## Single element email
valid_email <- validate("@rm_email")
valid_email(c("tyler.rinker@gmail.com", "@trinker"))
## Multiple elements
valid_email_1 <- validate("@rm_email", single=FALSE)
valid_email_1(c("tyler.rinker@gmail.com", "@trinker"))
## single element address
valid_address <- validate("@rm_city_state_zip")
valid_address("Buffalo, NY 14217")
valid_address("buffalo,NY14217")
valid_address("buffalo NY 14217")
valid_address2 <- validate(paste0("(\\b([A-Z][\\w-]*)+),",
"\\s([A-Z]{2})\\s(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b"))
valid_address2("Buffalo, NY 14217")
valid_address2("buffalo, NY 14217")
valid_address2("buffalo,NY14217")
valid_address2("buffalo NY 14217")