Keep your R scripts locally sourced
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A few weeks ago, I had a bad debugging session. The code was just not doing what I expected, and I went down a lot of deadends trying to fix or simplify things. I could not get the problem to happen in a reproducible example (reprex) or interactively (in RStudio). Eventually, the most minimal example of the problem completely broke my mental model for how the code should work.
The problem had to do with names and what they mean. select()
is a
function the lives in the MASS package and the dplyr package, and I
always intend for select()
to point to
dplyr::select()
.
But sometimes a statistics package will load in MASS and overwrite
select()
to point to
MASS::select()
. And in
this case, my attempts to use select()
in a
source()
-ed file kept reverting
to MASS::select()
instead of dplyr::select()
. A tweet from the
session shows the minimal example and my wracked brain. (I will describe
the example in more detail below.)
i'm dry heaving here wtf is going pic.twitter.com/KIeRJT6kwY
— tj mahr ?? (@tjmahr) July 21, 2021
Here’s what happens:
- I explicitly assign
select
todplyr::select()
. - I make a function
f()
that prints the environment ofselect
(where the name/function is defined), store the function in a.R
text file andsource()
in the text file. (source()
runs the code in an R script.) - I print the value of
select
and see that it is indeed from the dplyr environment. - I call my function, and it says that
select
is actually in the MASS package. - I check the value of
select
, and it reports the dplyr environment once again.
A similar problem using functions
This problem only happened while knitting one of my analysis notebooks (which was a clue). Right now, it’s proving difficult for me to write examples of this problem for this blogpost, so I’m going to show the source ? of the problem using functions.
First, let’s set up things so that select
belongs to the MASS package.
We are also going to use the conflicted package which normally prevents
package name conflicts from happening. This part isn’t necessary or
helpful; I just want to illustrate that this is not a simple name
conflict problem.
library(conflicted) library(MASS) environment(select) #> <environment: namespace:MASS>
We are going to make a function that does what my original code example tried to do:
- set
select
to dplyr explicitly source()
in a file that gives the environment ofselect
- return the environment of
select
, both using thesource()
-ed function and directly.
source_in_my_code <- function(...) { # set dplyr select select <- dplyr::select # write a script to temporary file temp_script <- tempfile(fileext = ".R") my_code <- " f <- function() environment(select) " writeLines(my_code, temp_script) # run the script source(temp_script, ...) list( source_select_environment = f(), function_select_environment = environment(select) ) } default_results <- source_in_my_code()
What do you think the select
environment should be? dplyr, right?
That’s what select
means everywhere else inside of the function.
source()
is just like dropping in some R code and running it, right?
That’s what I thought.
default_results #> $source_select_environment #> <environment: namespace:MASS> #> #> $function_select_environment #> <environment: namespace:dplyr>
No, it’s the MASS environment. ?
Local and parent environments
In order to understand what’s happening, let’s first note that R works by evaluating expressions in an environment. The environment defines the values of names. If a name is not found in an environment, R searches parent environment for the name (or the parent’s parent, and so on). This idea is illustrated beautifully in Advanced R using diagrams.
For an analogy, you might think of environments as looking up someone in an office, a building directory, then an area directory:
I like the multi-company building analogy. If you want to call Jim, first you look in your company directory. If there isn’t a Jim there, you look in the all-building maintenance dir. If not there, you look in the city services dir. You don’t look in another company-specific dir
— Brenton Wiernik ?️? (@bmwiernik) April 27, 2021
Here is small example showing a local function environment, its parent environment and how a name will take different values depending on the context.
where_am_i <- "outside of the function" where_are_you <- "outside of the function too" where_is_everyone <- function() { where_am_i <- "inside of the function" list( where_am_i = where_am_i, where_are_you = where_are_you ) } where_am_i #> [1] "outside of the function" where_is_everyone() #> $where_am_i #> [1] "inside of the function" #> #> $where_are_you #> [1] "outside of the function too" where_am_i #> [1] "outside of the function"
Outside of the function, where_am_i
is "outside of the function"
,
but in the body of the function, it is defined to "inside of the
function"
. The variable where_are_you
is only defined "out of the
function too"
, so the function has to search for the variable in its
parent environment.
"parent" environment suggests a family metaphor. if you cant find what a symbol means, ask a parent.
— tj mahr ?? (@tjmahr) April 27, 2021
Locally sourced R code
Reading the documentation to source()
, we find the solution to the
original problem:
Arguments
local
TRUE
,FALSE
or an environment, determining where the parsed expressions are evaluated.FALSE
(the default) corresponds to the user’s workspace (the global environment) andTRUE
to the environment from whichsource
is called.
By default, the code evaluated by source()
runs in the global
environment–that is, “outside” of the body of the function. The code
breaks out of the function environment and runs at the higher
environment.
My mental model for source()
was completely wrong. source()
is not
like dropping in the R code from a file and running it. It is more like
pausing everything that you’re doing in your current context, backing
out to the highest level context, running that code, and then resuming
what you’re doing.
Fortunately, if we ask source to run locally (local = TRUE
), select
has the same environment inside the function and in the code run using
source()
.
# I defined the function so it could pass arguments to source() source_in_my_code(local = TRUE) #> $source_select_environment #> <environment: namespace:dplyr> #> #> $function_select_environment #> <environment: namespace:dplyr>
When we’re using source()
as one of the first few lines of an R
script, the default global environment for source()
doesn’t really
matter. But in contexts like the function example or code stored in a
custom knitr/RMarkdown setup (my original problem), this difference is
a problem. Therefore, in the future, I’m going to abide by the motto
Keep it locally sourced. This way fits my mental model for source()
as something that drops in R code and runs it in place.
And by the way, yes, even though I cited Advanced R above, I clearly did not do all of the exercises:
- Carefully read the documentation for
source()
. What environment does it use by default? What if you supplylocal = TRUE
? How do you provide a custom environment?
Last knitted on 2021-08-16. Source code on GitHub.1
-
sessioninfo::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.1.0 (2021-05-18) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.1252 #> ctype English_United States.1252 #> tz America/Chicago #> date 2021-08-16 #> #> - Packages ------------------------------------------------------------------- #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) #> cachem 1.0.5 2021-05-15 [1] CRAN (R 4.1.0) #> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0) #> conflicted * 1.0.4 2019-06-21 [1] CRAN (R 4.1.0) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0) #> dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) #> emo 0.0.0.9000 2021-06-28 [1] Github (hadley/emo@3f03b11) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0) #> git2r 0.28.0 2021-01-10 [1] CRAN (R 4.1.0) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) #> here 1.0.1 2020-12-13 [1] CRAN (R 4.1.0) #> knitr * 1.33 2021-04-24 [1] CRAN (R 4.1.0) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0) #> lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) #> MASS * 7.3-54 2021-05-03 [1] CRAN (R 4.1.0) #> pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0) #> ragg 1.1.3 2021-06-09 [1] CRAN (R 4.1.0) #> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0) #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) #> stringi 1.7.3 2021-07-16 [1] CRAN (R 4.1.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) #> systemfonts 1.0.2 2021-05-11 [1] CRAN (R 4.1.0) #> textshaping 0.3.5 2021-06-09 [1] CRAN (R 4.1.0) #> tibble 3.1.3 2021-07-23 [1] CRAN (R 4.1.0) #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) #> xfun 0.24 2021-06-15 [1] CRAN (R 4.1.0) #> #> [1] C:/Users/Tristan/Documents/R/win-library/4.1 #> [2] C:/Program Files/R/R-4.1.0/library
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.