Site icon R-bloggers

Keep your R scripts locally sourced

[This article was first published on Higher Order Functions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few weeks ago, I had a bad debugging session. The code was just not doing what I expected, and I went down a lot of deadends trying to fix or simplify things. I could not get the problem to happen in a reproducible example (reprex) or interactively (in RStudio). Eventually, the most minimal example of the problem completely broke my mental model for how the code should work.

The problem had to do with names and what they mean. select() is a function the lives in the MASS package and the dplyr package, and I always intend for select() to point to dplyr::select(). But sometimes a statistics package will load in MASS and overwrite select() to point to MASS::select(). And in this case, my attempts to use select() in a source()-ed file kept reverting to MASS::select() instead of dplyr::select(). A tweet from the session shows the minimal example and my wracked brain. (I will describe the example in more detail below.)

i'm dry heaving here wtf is going pic.twitter.com/KIeRJT6kwY


— tj mahr ?? (@tjmahr) July 21, 2021

Here’s what happens:

  1. I explicitly assign select to dplyr::select().
  2. I make a function f() that prints the environment of select (where the name/function is defined), store the function in a .R text file and source() in the text file. (source() runs the code in an R script.)
  3. I print the value of select and see that it is indeed from the dplyr environment.
  4. I call my function, and it says that select is actually in the MASS package.
  5. I check the value of select, and it reports the dplyr environment once again.

A similar problem using functions

This problem only happened while knitting one of my analysis notebooks (which was a clue). Right now, it’s proving difficult for me to write examples of this problem for this blogpost, so I’m going to show the source ? of the problem using functions.

First, let’s set up things so that select belongs to the MASS package. We are also going to use the conflicted package which normally prevents package name conflicts from happening. This part isn’t necessary or helpful; I just want to illustrate that this is not a simple name conflict problem.

library(conflicted)
library(MASS)
environment(select)
#> <environment: namespace:MASS>

We are going to make a function that does what my original code example tried to do:

source_in_my_code <- function(...) {
  # set dplyr select
  select <- dplyr::select
  
  # write a script to temporary file
  temp_script <- tempfile(fileext = ".R")
  my_code <- "
    f <- function() environment(select)
  "
  writeLines(my_code, temp_script)
  
  # run the script
  source(temp_script, ...)
  
  list(
    source_select_environment = f(),
    function_select_environment = environment(select)
  )
}


default_results <- source_in_my_code()

What do you think the select environment should be? dplyr, right? That’s what select means everywhere else inside of the function. source() is just like dropping in some R code and running it, right? That’s what I thought.

default_results
#> $source_select_environment
#> <environment: namespace:MASS>
#> 
#> $function_select_environment
#> <environment: namespace:dplyr>

No, it’s the MASS environment. ?

Local and parent environments

In order to understand what’s happening, let’s first note that R works by evaluating expressions in an environment. The environment defines the values of names. If a name is not found in an environment, R searches parent environment for the name (or the parent’s parent, and so on). This idea is illustrated beautifully in Advanced R using diagrams.

For an analogy, you might think of environments as looking up someone in an office, a building directory, then an area directory:

I like the multi-company building analogy. If you want to call Jim, first you look in your company directory. If there isn’t a Jim there, you look in the all-building maintenance dir. If not there, you look in the city services dir. You don’t look in another company-specific dir

— Brenton Wiernik ?️‍? (@bmwiernik) April 27, 2021

Here is small example showing a local function environment, its parent environment and how a name will take different values depending on the context.

where_am_i <- "outside of the function"
where_are_you <- "outside of the function too"

where_is_everyone <- function() {
  where_am_i <- "inside of the function"
  list(
    where_am_i = where_am_i,
    where_are_you = where_are_you
  )
} 

where_am_i
#> [1] "outside of the function"
where_is_everyone()
#> $where_am_i
#> [1] "inside of the function"
#> 
#> $where_are_you
#> [1] "outside of the function too"
where_am_i
#> [1] "outside of the function"

Outside of the function, where_am_i is "outside of the function", but in the body of the function, it is defined to "inside of the function". The variable where_are_you is only defined "out of the function too", so the function has to search for the variable in its parent environment.

"parent" environment suggests a family metaphor. if you cant find what a symbol means, ask a parent.

— tj mahr ?? (@tjmahr) April 27, 2021

Locally sourced R code

Reading the documentation to source(), we find the solution to the original problem:

Arguments

local
TRUE, FALSE or an environment, determining where the parsed expressions are evaluated. FALSE (the default) corresponds to the user’s workspace (the global environment) and TRUE to the environment from which source is called.

By default, the code evaluated by source() runs in the global environment–that is, “outside” of the body of the function. The code breaks out of the function environment and runs at the higher environment.

My mental model for source() was completely wrong. source() is not like dropping in the R code from a file and running it. It is more like pausing everything that you’re doing in your current context, backing out to the highest level context, running that code, and then resuming what you’re doing.

Fortunately, if we ask source to run locally (local = TRUE), select has the same environment inside the function and in the code run using source().

# I defined the function so it could pass arguments to source()
source_in_my_code(local = TRUE)
#> $source_select_environment
#> <environment: namespace:dplyr>
#> 
#> $function_select_environment
#> <environment: namespace:dplyr>

When we’re using source() as one of the first few lines of an R script, the default global environment for source() doesn’t really matter. But in contexts like the function example or code stored in a custom knitr/RMarkdown setup (my original problem), this difference is a problem. Therefore, in the future, I’m going to abide by the motto Keep it locally sourced. This way fits my mental model for source() as something that drops in R code and runs it in place.

And by the way, yes, even though I cited Advanced R above, I clearly did not do all of the exercises:

20.2.4 Exercises

  1. Carefully read the documentation for source(). What environment does it use by default? What if you supply local = TRUE? How do you provide a custom environment?

Last knitted on 2021-08-16. Source code on GitHub.1

  1. sessioninfo::session_info()
    #> - Session info ---------------------------------------------------------------
    #>  setting  value                       
    #>  version  R version 4.1.0 (2021-05-18)
    #>  os       Windows 10 x64              
    #>  system   x86_64, mingw32             
    #>  ui       RTerm                       
    #>  language (EN)                        
    #>  collate  English_United States.1252  
    #>  ctype    English_United States.1252  
    #>  tz       America/Chicago             
    #>  date     2021-08-16                  
    #> 
    #> - Packages -------------------------------------------------------------------
    #>  package     * version    date       lib source                     
    #>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.1.0)             
    #>  cachem        1.0.5      2021-05-15 [1] CRAN (R 4.1.0)             
    #>  cli           3.0.1      2021-07-17 [1] CRAN (R 4.1.0)             
    #>  conflicted  * 1.0.4      2019-06-21 [1] CRAN (R 4.1.0)             
    #>  crayon        1.4.1      2021-02-08 [1] CRAN (R 4.1.0)             
    #>  DBI           1.1.1      2021-01-15 [1] CRAN (R 4.1.0)             
    #>  dplyr         1.0.7      2021-06-18 [1] CRAN (R 4.1.0)             
    #>  ellipsis      0.3.2      2021-04-29 [1] CRAN (R 4.1.0)             
    #>  emo           0.0.0.9000 2021-06-28 [1] Github (hadley/emo@3f03b11)
    #>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.1.0)             
    #>  fansi         0.5.0      2021-05-25 [1] CRAN (R 4.1.0)             
    #>  fastmap       1.1.0      2021-01-25 [1] CRAN (R 4.1.0)             
    #>  generics      0.1.0      2020-10-31 [1] CRAN (R 4.1.0)             
    #>  git2r         0.28.0     2021-01-10 [1] CRAN (R 4.1.0)             
    #>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.1.0)             
    #>  here          1.0.1      2020-12-13 [1] CRAN (R 4.1.0)             
    #>  knitr       * 1.33       2021-04-24 [1] CRAN (R 4.1.0)             
    #>  lifecycle     1.0.0      2021-02-15 [1] CRAN (R 4.1.0)             
    #>  lubridate     1.7.10     2021-02-26 [1] CRAN (R 4.1.0)             
    #>  magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.1.0)             
    #>  MASS        * 7.3-54     2021-05-03 [1] CRAN (R 4.1.0)             
    #>  pillar        1.6.2      2021-07-29 [1] CRAN (R 4.1.0)             
    #>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.1.0)             
    #>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.1.0)             
    #>  R6            2.5.0      2020-10-28 [1] CRAN (R 4.1.0)             
    #>  ragg          1.1.3      2021-06-09 [1] CRAN (R 4.1.0)             
    #>  Rcpp          1.0.7      2021-07-07 [1] CRAN (R 4.1.0)             
    #>  rlang         0.4.11     2021-04-30 [1] CRAN (R 4.1.0)             
    #>  rprojroot     2.0.2      2020-11-15 [1] CRAN (R 4.1.0)             
    #>  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.1.0)             
    #>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.1.0)             
    #>  stringi       1.7.3      2021-07-16 [1] CRAN (R 4.1.0)             
    #>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.1.0)             
    #>  systems   1.0.2      2021-05-11 [1] CRAN (R 4.1.0)             
    #>  textshaping   0.3.5      2021-06-09 [1] CRAN (R 4.1.0)             
    #>  tibble        3.1.3      2021-07-23 [1] CRAN (R 4.1.0)             
    #>  tidyselect    1.1.1      2021-04-30 [1] CRAN (R 4.1.0)             
    #>  utf8          1.2.2      2021-07-24 [1] CRAN (R 4.1.0)             
    #>  vctrs         0.3.8      2021-04-29 [1] CRAN (R 4.1.0)             
    #>  withr         2.4.2      2021-04-18 [1] CRAN (R 4.1.0)             
    #>  xfun          0.24       2021-06-15 [1] CRAN (R 4.1.0)             
    #> 
    #> [1] C:/Users/Tristan/Documents/R/win-library/4.1
    #> [2] C:/Program Files/R/R-4.1.0/library
    

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.