Fun Standardizing Non Standard Evaluation

Using dplyr calls on the back-end of the rfishbase re-write means working around the non-standard evaluation (NSE), as described in the dplyr vignette.

Grab the data I was using for this:

library("dplyr")
downloader::download("https://github.com/cboettig/2015/raw/fc0d9185659e7976927d0ec91981912537ac6018/assets/data/2015-02-06-taxa.csv", "taxa.csv")
all_taxa <- read.csv("taxa.csv")

Consider a simple NSE dplyr call:

x <- filter(all_taxa, Family == 'Scaridae')

The best SE version of this just needs to use the formula expression, ~, the _ SE version of the function and it’s .dots argument:

.dots <- list(~Family == 'Scaridae')
x1 <- filter_(all_taxa, .dots=.dots)

identical(x, x1)
[1] TRUE

This lets us treat the arguments (e.g. values of the factor on which we filter) as variables:

family <- 'Scaridae'
.dots <- list(~Family == family)
x2 <- filter_(all_taxa, .dots=.dots)
identical(x, x2)
[1] TRUE

If we want both the key and value to vary, we need to get pretty fancy to subvert the non-standard evaluation:

library(lazyeval)
family <- 'Scaridae'
field <- 'Family'
.dots <- list(interp(~y == x, 
                     .values = list(y = as.name(field), x = family)))
x3 <- filter_(all_taxa, .dots=.dots)
identical(x, x3)
[1] TRUE

At bit more fun to wrap this into a function where we take arbitrary number of arguments as name-value pairs:

query <- list(Family = 'Scaridae', SpecCode = 5537)
dots <- lapply(names(query), function(level){
    value <- query[[level]]
    interp(~y == x, 
                .values = list(y = as.name(level), x = value))
  })
  
x3 <-  filter_(all_taxa, .dots = dots) 

More fun standardizing NSE

The previous examples show only applications to filter_(). While the general idea is the same, this pattern doesn’t translate directly for other functions, such as mutate_. Here’s some common patterns I’ve adopted when using mutate_(). First consider the familiar NSE useage:

df <- mutate(mtcars, displ_l = disp / 61.0237)
head(df)
   mpg cyl disp  hp drat    wt  qsec vs am gear carb  displ_l
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 2.621932
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 2.621932
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 1.769804
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 4.227866
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 5.899347
6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 3.687092

Again we use list(interp( pattern, but note that we specify the name for our new column using setNames (naming the elements of the list).

dots <- setNames(list(lazyeval::interp(~x / y, x = quote(disp), y=61.0237)), "displ_l")
df2 <- mutate_(mtcars, .dots = dots)
identical(df, df2)
[1] TRUE

Of course the use y could be skipped for a more direct value if that was not a variable.

More dplyr patterns

Also thought I would scribble down some other common dplyr patterns I find myself re-using.

  • applying a function that returns a data.frame to each element of a list and coercing the combined output to a data.frame:
mylist %>% lapply(myfun) %>% dplyr::bind_rows() 

To place this deeper in the hadleyverse, purrr::map could be dropped in for lapply in the above example.

Also includes an example of how to define group_by_all() since that is usually the grouping I need from an expand.grid() call (that is, I want to apply over all combinations of some parameter settings, etc)

Something I hope is not a common pattern but one I struggled with for a bit: making recursive calls of the above pattern for nested lists. This code in RNeXML illustrates my solution, which required both function recursion and function closure.