How do you program the column manipulation dynamically when working with dplyr verb?

Say, we have a dataframe with multiple columns in pair,

library(tidyverse)

# simulate some data
df <- rerun(3 * 2, rnorm(10, 5, 2)) %>% bind_cols()
# imagine some column names
colnames(df) <- map(c("weight", "height", "length"), paste0, c(".x", ".y")) %>% unlist()
df
## # A tibble: 10 x 6
##    weight.x weight.y height.x height.y length.x length.y
##       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1    3.36      8.00     4.09     5.64     3.87     5.11
##  2    4.20      5.04     3.38     4.66     4.95     3.42
##  3    3.79      4.75     7.54     5.81     4.35     9.59
##  4   -0.856     5.01     6.93     1.97     1.96     2.94
##  5    5.82      4.66     5.65     5.29     5.99     3.72
##  6    2.53      6.72     7.89     5.56     5.54     7.20
##  7    6.06      4.20     2.93     4.29     3.39     4.95
##  8    5.76      3.99     2.47     3.66     6.85     4.55
##  9    4.60      4.25     6.53     3.06     4.44     2.26
## 10    4.52      5.49     6.39     4.26     5.31     5.48

and we try to calculate the difference between metrices .x and .y

Expression

Let formulate a form of “expression”,

calc_diff <- function(var) {
    # the desire expression is (var.x - var.y)
    sprintf("%s.x - %s.y", var, var)
}
# example
calc_diff("weight")
## [1] "weight.x - weight.y"

Rlang

So we get our desire expression, but how do we pass them to dplyr to understand? We can use rlang to transform text into R expressions.

calc_diff <- function(df, var) {
    # parse as R code
    expr <- sprintf("%s.x - %s.y", var, var) %>% 
        rlang::parse_expr()
    df %>% mutate(!!expr)
}

df %>% 
    # add new column
    calc_diff("weight") %>% 
    # focus on result 
    select(contains("weight"))
## # A tibble: 10 x 3
##    weight.x weight.y `weight.x - weight.y`
##       <dbl>    <dbl>                 <dbl>
##  1    3.36      8.00                -4.64 
##  2    4.20      5.04                -0.843
##  3    3.79      4.75                -0.960
##  4   -0.856     5.01                -5.86 
##  5    5.82      4.66                 1.17 
##  6    2.53      6.72                -4.19 
##  7    6.06      4.20                 1.86 
##  8    5.76      3.99                 1.77 
##  9    4.60      4.25                 0.351
## 10    4.52      5.49                -0.970

Column Name

Well, we need a way to add name to the new column.

calc_diff <- function(df, var) {
    # to transform a quosure to a string 
    new_col_name <- quo_name(paste0(var, ".diff"))
    expr <- sprintf("%s.x - %s.y", var, var) %>% 
        rlang::parse_expr()
    # so that we can use it in dplyr context
    df %>% mutate(!!new_col_name := !!expr)
}

df %>% 
    # it works!
    calc_diff("weight") %>% 
    select(contains("weight"))
## # A tibble: 10 x 3
##    weight.x weight.y weight.diff
##       <dbl>    <dbl>       <dbl>
##  1    3.36      8.00      -4.64 
##  2    4.20      5.04      -0.843
##  3    3.79      4.75      -0.960
##  4   -0.856     5.01      -5.86 
##  5    5.82      4.66       1.17 
##  6    2.53      6.72      -4.19 
##  7    6.06      4.20       1.86 
##  8    5.76      3.99       1.77 
##  9    4.60      4.25       0.351
## 10    4.52      5.49      -0.970

Many Columns

So now we know how to do with 1 variable, how do we apply it to many?

  1. Use map as per normal to loop and generate multiple expressions
  2. Use set_names to give expressions list a name
  3. Use !!! (3x ! not 2) to splice multiple arguments in mutate. It will assign new variable names using names given by list (in step 2).
calc_diff <- function(df, ...) {
    # first, an expressions list
    n_exprs <- 
        # loop over multiple inputs
        map(list(...), ~ sprintf("%s.x - %s.y", .x, .x) %>% 
                       rlang::parse_expr()) %>% 
        set_names(map(list(...), ~ paste0(.x, ".diff")))
    
    # splices above expressions
    df %>% mutate(!!!n_exprs)
}

# tada~~
df %>% 
    calc_diff("weight", "height", "length")
## # A tibble: 10 x 9
##    weight.x weight.y height.x height.y length.x length.y weight.diff
##       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>       <dbl>
##  1    3.36      8.00     4.09     5.64     3.87     5.11      -4.64 
##  2    4.20      5.04     3.38     4.66     4.95     3.42      -0.843
##  3    3.79      4.75     7.54     5.81     4.35     9.59      -0.960
##  4   -0.856     5.01     6.93     1.97     1.96     2.94      -5.86 
##  5    5.82      4.66     5.65     5.29     5.99     3.72       1.17 
##  6    2.53      6.72     7.89     5.56     5.54     7.20      -4.19 
##  7    6.06      4.20     2.93     4.29     3.39     4.95       1.86 
##  8    5.76      3.99     2.47     3.66     6.85     4.55       1.77 
##  9    4.60      4.25     6.53     3.06     4.44     2.26       0.351
## 10    4.52      5.49     6.39     4.26     5.31     5.48      -0.970
## # ... with 2 more variables: height.diff <dbl>, length.diff <dbl>