oshkaWe will implement simplified versions of dplyr and data.table to illustrate how to write programmable NSE functions with oshka. The implementations are intentionally limited in functionality, robustness, and speed for the sake of simplicity.
dplyrThe interface is as follows:
group_r <- function(x, ...) {...} # similar to dplyr::group_by
filter_r <- function(x, subset) {...} # similar to dplyr::filter
summarize_r <- function(x, ...) {...} # similar to dplyr::summarise
`%$%` <- function(x, y) {...} # similar to the magrittr pipeOur functions mimic the corresponding dplyr ones:
CO2 %$% # built-in dataset
filter_r(grepl("[12]", Plant)) %$%
group_r(Type, Treatment) %$%
summarize_r(mean(conc), mean(uptake)) Type Treatment mean.conc. mean.uptake.
1 Quebec nonchilled 435 34.19286
2 Mississippi nonchilled 435 26.87143
3 Quebec chilled 435 31.33571
4 Mississippi chilled 435 15.07143
Most of the implementation is not directly related to oshka NSE, but we will go over summarize_r to highlight how those parts integrate with the rest. summarize_r is just a forwarding function:
summarize_r <- function(x, ...)
eval(bquote(.(summarize_r_l)(.(x), .(substitute(list(...))))), parent.frame())We use the eval/bquote pattern to forward NSE arguments. We retrieve summarize_r_l from the current function frame with .(), because there is no guarantee we would find it on the search path starting from the parent frame. In this case it happens to be available, but it would not be if these functions were in a package.
We present summarize_r_l in full for reference, but feel free to skip as we highlight the interesting bits next:
summarize_r_l <- function(x, els) {
frm <- parent.frame()
exps.sub <- expand(substitute(els), x, frm)
if(is.null(exps.sub)) x else {
# compute groups and splits
grps <- make_grps(x) # see appendix
splits <- lapply(grps, eval, x, frm)
dat.split <- split(x, splits, drop=TRUE)
grp.split <- if(!is.null(grps)) lapply(splits, split, splits, drop=TRUE)
# aggregate
res.list <- lapply(
dot_list(exps.sub), # see appendix
function(exp) lapply(dat.split, eval, expr=exp, enclos=frm)
)
list_to_df(res.list, grp.split, splits) # see appendix
}
}The only oshka specific line is the second one:
exps.sub <- expand(substitute(els), x, frm)els is the language captured and forwarded by summarize_r. We run expand on that language with our data x as the environment and the parent frame as the enclosure. We then compute the groups:
grps <- make_grps(x) # see appendix
splits <- lapply(grps, eval, x, frm)make_grps extracts the grouping expressions generating by group_r. These have already been substituted so we evaluate each one with x as the environment and the parent frame as the enclosure. We use this to split our data into groups:
dat.split <- split(x, splits, drop=TRUE)Finally we can evaluate our expanded expressions within each of the groups:
# aggregate
res.list <- lapply(
dot_list(exps.sub), # see appendix
function(exp) lapply(dat.split, eval, expr=exp, enclos=frm)
)
list_to_df(res.list, grp.split, splits) # see appendixdot.list turns exps.sub into a list of expressions. Each expression is then evaluated with each data chunk as the environment and the parent frame as the enclosure. Finally list_to_df turns our lists of vectors into a data frame.
You can see the rest of the implementation in the appendix.
That single expand line enables a programmable NSE:
f.exp <- quote(grepl("[12]", Plant))
s.exp <- quote(mean(uptake))
CO2 %$%
filter_r(f.exp & conc > 500) %$%
group_r(Type, Treatment) %$%
summarize_r(round(s.exp)) Type Treatment round.mean.uptake..
1 Quebec nonchilled 41
2 Mississippi nonchilled 33
3 Quebec chilled 38
4 Mississippi chilled 18
Because %$% uses expand you can even do the following:
f.exp.b <- quote(filter_r(grepl("[12]", Plant) & conc > 500))
g.exp.b <- quote(group_r(Type, Treatment))
s.exp.b <- quote(summarize_r(mean(conc), mean(uptake)))
exp <- quote(f.exp.b %$% g.exp.b %$% s.exp.b)
CO2 %$% exp Type Treatment mean.conc. mean.uptake.
1 Quebec nonchilled 837.5 41.150
2 Mississippi nonchilled 837.5 32.625
3 Quebec chilled 837.5 38.500
4 Mississippi chilled 837.5 18.050
data.tableWe wish to re-use our ersatz dplyr functions to create a data.table-like interface:
as.super_df <- function(x) {
class(x) <- c("super_df", class(x))
x
}
"[.super_df" <- function(x, i=NULL, j=NULL, by=NULL) {
frm <- parent.frame() # as per docs, safer to do this here
x <- as.data.frame(x)
x <- eval(bquote(.(filter_r)( .(x), .(substitute(i)))), frm)
x <- eval(bquote(.(group_r_l)( .(x), .(substitute(by)))), frm)
x <- eval(bquote(.(summarize_r_l)(.(x), .(substitute(j)))), frm)
as.super_df(x)
}Again, we use the eval/bquote pattern to forward the NSE arguments to our NSE functions filter_r, group_r_l, and summarize_r_l. The pattern is not trivial, but it only took six lines of code to transmogrify our faux-dplyr into a faux-data.table.
After we add the super_df class to our data we can start using it with data.table semantics, but with programmable NSE:
co2 <- as.super_df(CO2)
co2[f.exp, s.exp, by=Type] Type mean.uptake.
1 Quebec 32.76429
2 Mississippi 20.97143
exp.a <- quote(max(conc))
exp.b <- quote(min(conc))
co2[f.exp, list(exp.a, exp.b), by=list(Type, Treatment)][1:3,] Type Treatment max.conc. min.conc.
1 Quebec nonchilled 1000 95
2 Mississippi nonchilled 1000 95
3 Quebec chilled 1000 95
exp.c <- quote(list(exp.a, exp.b))
exp.d <- quote(list(Type, Treatment))
co2[f.exp, exp.c, by=exp.d][1:3,] Type Treatment max.conc. min.conc.
1 Quebec nonchilled 1000 95
2 Mississippi nonchilled 1000 95
3 Quebec chilled 1000 95
Despite the forwarding layers the symbols resolve as expected in complex circumstances:
exps <- quote(list(stop("boo"), stop("ya"))) # don't use this
g.exp <- quote(Whatever) # nor this
local({
summarize_r_l <- function(x, y) stop("boom") # nor this
max.upt <- quote(max(uptake)) # use this
min.upt <- quote(min(uptake)) # and this
exps <- list(max.upt, min.upt)
g.exp <- quote(Treatment) # and this
lapply(exps, function(y) co2[f.exp, y, by=g.exp])
})[[1]]
Treatment max.uptake.
1 nonchilled 44.3
2 chilled 42.4
[[2]]
Treatment min.uptake.
1 nonchilled 10.6
2 chilled 7.7
And we can even nest our dplyr and data.table for an unholy abomination:
exp <- quote(data.frame(upt=uptake) %$% summarize_r(new.upt=upt * 1.2))
local({
exps <- list(quote(sum(exp$new.upt)), quote(sum(uptake)))
g.exp <- quote(Treatment)
lapply(exps, function(y) co2[f.exp, y, by=g.exp])
})[[1]]
Treatment V1
1 nonchilled 1025.88
2 chilled 779.64
[[2]]
Treatment sum.uptake.
1 nonchilled 854.9
2 chilled 649.7
Ersatz dplyr implementation:
## - Summarize -----------------------------------------------------------------
summarize_r <- function(x, ...)
eval(bquote(.(summarize_r_l)(.(x), .(substitute(list(...))))), parent.frame())
summarize_r_l <- function(x, els) {
frm <- parent.frame()
exps.sub <- expand(substitute(els), x, frm)
if(is.null(exps.sub)) x else {
# compute groups and splits
grps <- make_grps(x) # see appendix
splits <- lapply(grps, eval, x, frm)
dat.split <- split(x, splits, drop=TRUE)
grp.split <- if(!is.null(grps)) lapply(splits, split, splits, drop=TRUE)
# aggregate
res.list <- lapply(
dot_list(exps.sub), # see appendix
function(exp) lapply(dat.split, eval, expr=exp, enclos=frm)
)
list_to_df(res.list, grp.split, splits) # see appendix
}
}
## - Grouping ------------------------------------------------------------------
group_r <- function(x, ...)
eval(bquote(.(group_r_l)(.(x), .(substitute(list(...))))), parent.frame())
group_r_l <- function(x, els) {
exps.sub <- expand(substitute(els), x, parent.frame())
if(is.null(exps.sub)) x else {
if(!is.call(exps.sub) || exps.sub[[1L]] != quote(list))
exps.sub <- call("list", exps.sub)
structure(x, .GRP=dot_list(exps.sub, "G"))
} }
## - Filtering -----------------------------------------------------------------
filter_r <- function(x, subset) {
sub.exp <- expand(substitute(subset), x, parent.frame())
sub.val <- eval(sub.exp, x, parent.frame())
as.data.frame(
if(!is.null(sub.val)) {
as.data.frame(x)[
if(is.numeric(sub.val)) sub.val else !is.na(sub.val) & sub.val,
]
} else x
)
}
## - Pipe ----------------------------------------------------------------------
`%$%` <- function(x, y) {
x.sub <- expand(substitute(x), parent.frame())
y.sub <- expand(substitute(y), parent.frame())
y.list <- if(!is.call(y.sub)) list(y.sub) else as.list(y.sub)
eval(sub_dat(y.sub, x), parent.frame())
}
## - Helper Funs ---------------------------------------------------------------
# Takes result of `substitute(list(...))` and returns a list of quoted language
# object with nice names.
dot_list <- function(x, pre="V") {
if(!is.call(x) || x[[1L]] != quote(list)) x <- call("list", x)
dots <- tail(as.list(x), -1L)
if(is.null(names(dots))) names(dots) <- character(length(dots))
for(i in seq_along(dots)[!nzchar(names(dots))])
names(dots)[i] <- if(
is.language(dots[[i]]) && nchar(deparse(dots[[i]])[[1]]) < 20
) deparse(dots[[i]])[[1]] else sprintf("%s%d", pre, i)
dots
}
# Used by the `%$%` pipe operator to find the correct point in the RHS to
# substitute the forwarded argument in
sub_dat <- function(z, dat) {
if(is.call(z)) {
if(z[[1]] == as.name('%$%')) z[[2]] <- sub_dat(z[[2]], dat)
else {
z.list <- as.list(z)
z <- as.call(c(z.list[1], list(dat), tail(z.list, -1)))
} }
z
}
# convert the ".GRP" attribute into usable form
make_grps <- function(x)
if(is.null(attr(x, ".GRP")) || !length(attr(x, ".GRP")))
list(rep_len(1, nrow(x))) else attr(x, ".GRP")
# Takes result list and makes into a data.frame by recycling elements so they
# are the same length a longest, and also adds in cols for the group vars
list_to_df <- function(dat, grp, splits) {
lens <- do.call(pmax, lapply(dat, lengths, integer(length(splits))))
as.data.frame(
lapply(c(grp, dat), function(x) unname(unlist(Map(rep_len, x, lens))))
)
}