Site icon R-bloggers

Using a Data Dictionary to Recode Columns with dplyr

[This article was first published on Mostly Counting | Tim Tiefenbach, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Intro

Today’s blog post is all about recoding columns using a data dictionary and dplyr::recode().

Many datasets, especially from surveys, come along with a proper documentation often in form of a so called “data dictionary”. A data dictionary contains at least three pieces of information: the (i) column names that are used in the dataset as well as corresponding (ii) numeric values and (iii) labels which translate those abstract numbers into meaningful terms.

At times, you may need to transform the raw values into their associated labels for tasks like reporting or plotting. Here, dplyr::recode() serves as an efficient tool to programmatically recode, and also rename, columns according to a data dictionary.

Recently I revamped an old ETL script I had written, which recoded multiple datasets based on a data dictionary. This script was from the pre-dplyr 1.0 era, so updating it provided a great opportunity to revisit this task, this time armed with the latest dplyr version. The resulting approach was such a significant improvement over my original method, that I felt compelled to share it here, serving as both a personal reference and a resource for anyone confronted with similar challenges.

We’ll start with a straightforward example to demonstrate the basic workflow. We then look at a more advanced scenario involving multiple datasets and a comprehensive data dictionary. Lastly, this blog concludes with some thoughts about the recent changes in dplyr version 1.1.0 and recode‘s new successor case_match().

Recoding one dataset

Setup

Lets start with a really simple dataset composed of three columms and fives rows:

library(dplyr)

# dataset
dat <- tibble(a = 1:5,
              b = c(10:14),
              c = c(20:24))
dat
#> # A tibble: 5 × 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1    10    20
#> 2     2    11    21
#> 3     3    12    22
#> 4     4    13    23
#> 5     5    14    24

Assume we have a data dictionary that looks like this:

# dictionary
dat_dict <- tibble(
  
  col_nm = c(
    rep("b", 5),
    rep("c", 5)
  ),
  
  value = c(
    10:14,
    20:24
  ),
  
  label = c(
    letters[1:5],
    letters[6:10]
  )
)

dat_dict
#> # A tibble: 10 × 3
#>    col_nm value label
#>    <chr>  <int> <chr>
#>  1 b         10 a    
#>  2 b         11 b    
#>  3 b         12 c    
#>  4 b         13 d    
#>  5 b         14 e    
#>  6 c         20 f    
#>  7 c         21 g    
#>  8 c         22 h    
#>  9 c         23 i    
#> 10 c         24 j

The dictionary has three columns: col_nm records our dataset’s column names, value holds the values, as seen in our raw data dat, and label stores the corresponding labels.

Note, that our dataset contains a column a that is not part of the data dictionary, just to show that the dictionary doesn’t need to hold value-label pairs for all columns.

Typically, attached data dictionaries come in form of csv or xlsx files, but after reading them into R, they should be in a similar form as our mock dictionary above.

Preparations

Next, we need to prepare two things: (i) a custom function to recode a single column according to the dictionary and (ii) a vector of columns names we want to recode. We will then use both inside dplyr::mutate(across(...)).

Taking a closer look at our custom recode function, it accepts the column to recode, x, as its only argument. The function works in two steps. Initially, we create a named vector of matching label-value pairs. Next, we splice this vector as arguments into dplyr::recode(x, ...) using rlang’s triple bang operator !!!.

# recode function to be used within `dplyr::mutate(across(...))`
recode_col <- function(x) {
  
  recode_vec <- dat_dict |>
    filter(col_nm == cur_column()) |>
    pull(label, name = value)
  
  dplyr::recode(x, !!! recode_vec)
}

To create a named vector of matching label-value pairs we start with our data dictionary dat_dict. We filter the current column using cur_column(), which is possible since we’re going use this function inside dplyr::across() where cur_column() yields us the string name of the current column. Finally, we use pull(label, name = value) to get the labels as character vector, along with their matching values as names.

In the final step of our custom function, we “splice” this named vector of label-value pairs into dplyr::recode(). Despite “splicing” sounding rather technical, it’s essentially an early evaluation. We evaluate the vector recode_vec prior to processing the dplyr::recode() call. Assume we want to recode column b in dat and recode_vec looks like this:

recode_vec <- c("10" = "a", "11" = "b", "12" = "c", "13" = "d", "14" = "e")

Then …

dplyr::recode(x, !!! recode_vec)

… becomes:

dplyr::recode(x, c("10" = "a", "11" = "b", "12" = "c", "13" = "d", "14" = "e"))

Keep in mind, we’re presuming hat the data dictionary is available in the global environment, and that our dataset dat doesn’t contain a column with an identical name as our data dictionary dat_dict. We can bolster the safety of our approach by supplying the data dictionary as a second argument to our recode_col() function. For interested readers this is shown in the info box below.

If we want to make sure that our recode_col() function uses the correct dictionary then we can supply it as second argument:

# recode function to be used within `dplyr::mutate(across(...))`
recode_col_safe <- function(x, dict) {
  
  recode_vec <- dict |>
    filter(col_nm == cur_column()) |>
    pull(label, name = value)
  
  dplyr::recode(x, !!! recode_vec)
}

In that case, it is not enough to only supply the bare function recode_col to across. We need to create an anonymous function \(x, dic) ... that calls recode_col. Here we can use .env$dat_dict to tell dplyr to look for an object dat_dict in a parent environment (and not inside our data.frame). This would prevent an error in case our data would contain an column with the same name as our data dictionary dat_dict.

# vector of columns to recode
cols_vec <- unique(dat_dict$col_nm)

# recoding defined columns
dat |> 
  mutate(across(all_of(cols_vec),
                \(x, dic) recode_col_safe(x, .env$dat_dict))
  )
#> # A tibble: 5 × 3
#>       a b     c    
#>   <int> <chr> <chr>
#> 1     1 a     f    
#> 2     2 b     g    
#> 3     3 c     h    
#> 4     4 d     i    
#> 5     5 e     j

With our custom recode function ready, the next thing we need is a vector of column names that we want to apply this function to. A straightforward way to do this is to get all unique column names from our data dictionary.

# vector of columns to recode
cols_vec <- unique(dat_dict$col_nm)
cols_vec
#> [1] "b" "c"

Recoding

Now we are all set, and the only thing left to do is to call across() on all of the column names in our data dictionary all_of(col_vec) and let our custom recode function recode_col() do its job.

# recoding defined columns
dat |> 
  mutate(across(all_of(cols_vec),
                recode_col)
  )
#> # A tibble: 5 × 3
#>       a b     c    
#>   <int> <chr> <chr>
#> 1     1 a     f    
#> 2     2 b     g    
#> 3     3 c     h    
#> 4     4 d     i    
#> 5     5 e     j

Renaming columns

Occasionally, data dictionaries offer not just corresponding values and labels, but also new, typically more descriptive, column names.

I’ve dedicated an entire blog post to the subject of renaming columns based on a lookup table. However, as the approach is slightly different when using a dictionary compared to a pure lookup table, and since this topic is quite relevant, let’s take a brief look at it.

Suppose our data dictionary, dat_dict2 includes the original abbreviated column names short_nm and corresponding descriptive column names long_nm:

#> # A tibble: 10 × 4
#>    short_nm long_nm value label
#>    <chr>    <chr>   <int> <chr>
#>  1 b        new_b      10 a    
#>  2 b        new_b      11 b    
#>  3 b        new_b      12 c    
#>  4 b        new_b      13 d    
#>  5 b        new_b      14 e    
#>  6 c        new_c      20 f    
#>  7 c        new_c      21 g    
#>  8 c        new_c      22 h    
#>  9 c        new_c      23 i    
#> 10 c        new_c      24 j
# dictionary
dat_dict2 <- tibble(
  
  short_nm = c(
    rep("b", 5),
    rep("c", 5)
  ),
  
  long_nm = c(
    rep("new_b", 5),
    rep("new_c", 5)
  ),
  
  value = c(
    10:14,
    20:24
  ),
  
  label = c(
    letters[1:5],
    letters[6:10]
  )
)

This time, we only require a named vector of corresponding old and new column names. To create that we use our data dictionary and filter it for distinct entries in short_nm and long_nm. Then we use again pull() together with its name argument, but note that the old and new values are in reverse positions compared to recode().

rename_vec <- dat_dict2 |>
  distinct(short_nm, long_nm) |>
  pull(short_nm, name = long_nm)

rename_vec
#> new_b new_c 
#>   "b"   "c"

Once we have this named vector of corresponding short and long column names we use all_of() inside rename():

dat |>
  rename(all_of(rename_vec))
#> # A tibble: 5 × 3
#>       a new_b new_c
#>   <int> <int> <int>
#> 1     1    10    20
#> 2     2    11    21
#> 3     3    12    22
#> 4     4    13    23
#> 5     5    14    24

As we’ve demonstrated above, it’s remarkably straightforward to not just recode, but also rename columns according to a data dictionary. If your work solely involves single datasets, you can skip the next section, which will expand upon the previous approach, demonstrating how to recode a list of datasets.

Recoding many datasets

Setup and Reasoning

Initially, the necessity of having a list of datasets and one comprehensive data dictionary may not be evident. A plausible scenario, for instance, could be a customer survey program composed of numerous surveys featuring similar, yet not identical, survey items. These survey results are stored in a generic table within a database, with columns simply labeled item1, item2, etc. The number of survey items may vary among surveys. Each row contains a respondent ID to identify a respondent and a survey ID to indicate the specific customer journey under which a respondent was surveyed. As the surveys are similar but not identical, the same column (e.g., item1) may contain different values across different surveys. Even if column values are consistent, they could correspond to different labels.

Again, our objective is to recode—and while we’re at it, also rename—all columns listed in the data dictionary across all surveys.

To keep this example as straightforward as possible, we’ll use a minimal setup. In addition to dat from before, let’s construct another small toy dataset dat2 and nest both within a data.frame consisting of two columns: the id of each dataset and the actual data.

# another dataset
dat2 <- tibble(a = 1:5,
               d = 10:14,
               e = 7:11)


# a list of datasets
dat_ls <- tibble(id = c("dat1", "dat2"),
                 data = tibble::lst(dat, dat2))

dat_ls
#> # A tibble: 2 × 2
#>   id    data            
#>   <chr> <named list>    
#> 1 dat1  <tibble [5 × 3]>
#> 2 dat2  <tibble [5 × 3]>

This is how the “data” list-column looks like:

dat_ls$data
#> $dat
#> # A tibble: 5 × 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1    10    20
#> 2     2    11    21
#> 3     3    12    22
#> 4     4    13    23
#> 5     5    14    24
#> 
#> $dat2
#> # A tibble: 5 × 3
#>       a     d     e
#>   <int> <int> <int>
#> 1     1    10     7
#> 2     2    11     8
#> 3     3    12     9
#> 4     4    13    10
#> 5     5    14    11

We assume once more that we have a data dictionary, dat_dict3, which contains old, short column names short_nm, new long column names long_nm, as well as a mapping between values and labels. The only difference from the previous example is that we now have an additional id column, indicating to which dataset the value-label mappings belong.

#> # A tibble: 20 × 5
#>    id    short_nm long_nm value label
#>    <chr> <chr>    <chr>   <int> <chr>
#>  1 dat1  b        new_b      10 a    
#>  2 dat1  b        new_b      11 b    
#>  3 dat1  b        new_b      12 c    
#>  4 dat1  b        new_b      13 d    
#>  5 dat1  b        new_b      14 e    
#>  6 dat1  c        new_c      20 f    
#>  7 dat1  c        new_c      21 g    
#>  8 dat1  c        new_c      22 h    
#>  9 dat1  c        new_c      23 i    
#> 10 dat1  c        new_c      24 j    
#> 11 dat2  d        new_d      10 f    
#> 12 dat2  d        new_d      11 g    
#> 13 dat2  d        new_d      12 h    
#> 14 dat2  d        new_d      13 i    
#> 15 dat2  d        new_d      14 j    
#> 16 dat2  e        new_e       7 17   
#> 17 dat2  e        new_e       8 18   
#> 18 dat2  e        new_e       9 19   
#> 19 dat2  e        new_e      10 20   
#> 20 dat2  e        new_e      11 21
# a data dictionary containing codes for several datasets
dat_dict3 <- tibble(
  
  id = c(
    rep("dat1", 10),
    rep("dat2", 10)
  ),
  
  short_nm = c(
    rep("b", 5),
    rep("c", 5),
    rep("d", 5),
    rep("e", 5)
  ),
  
  long_nm = c(
    rep("new_b", 5),
    rep("new_c", 5),
    rep("new_d", 5),
    rep("new_e", 5)
  ),
  
  value = 
    c(
    10:14,
    20:24,
    10:14,
    7:11
  ),
  
  label = c(
    letters[1:5],
    letters[6:10],
    letters[6:10],
    17:21
  )
)

Preparations

Before the actual recoding can take place, we have to prepare two functions:

  1. a custom function, recode_col2(), that recodes one column according to a specified dictionary and

  2. another custom function, recode_df(), which applies recode_col2() inside across() to all specified columns.

The first function is pretty similar to what we have seen earlier, with the only difference being the use of two arguments, the column to recode, x and the dictionary according to which the column should be recoded dict:

# recode function
recode_col2 <- function(x, dict) {
  
  col_nm <- cur_column()
  
  recode_vec <- dict |>
    filter(short_nm == cur_column()) |>
    pull(label, name = value)
  
  dplyr::recode(x, !!! recode_vec)
}

The second function, recode_df, basically wraps the actual recoding that we’ve used in the section above into a function. It takes two arguments: the dataset we want to recode, dat, and the id of the dataset as specified in the data dictionary (which should be the same as in our nested data.frame dat_ls).

recode_df <- function(dat, dat_id) {
  
  # 1. get the current entries 
  cur_dat_dict <- dat_dict3 |>
    filter(id == dat_id) 
  
  # 2. vector of columns to recode
  cols_vec <- unique(cur_dat_dict[["short_nm"]])
  
  # 3. use across with both inputs and recode_col2
  dat |> 
    mutate(across(all_of(cols_vec),
                  \(x) recode_col2(x, cur_dat_dict))
    )
}

This function will be applied iteratively to each individual dataset in our list of data.frames dat_ls. The function consists of three steps:

  1. We subset our data dictionary with the supplied data ID to ensure only the value-label mappings of the current dataset remain. We call this subset of our dictionary cur_dat_dict.

  2. We then create a vector of column names, cols_vec, that we want to recode. These will consist of all unique column names in the current dictionary cur_dat_dict.

  3. Finally, we use dplyr::across() on all_of the columns in cols_vec and supply the current column x and the current dictionary cur_dat_dict to recode_col2().

Recoding

The last step is to iteratively apply our new function recode_df() to our nested data.frame dat_ls. This requires us first to call rowwise() on dat_ls, which applies all subsequent dplyr functions row-by-row. We then overwrite our column holding the data with list(recode_df(data, id)). It’s important to note that our custom function must be wrapped in list(), as it returns a non-atomic vector (a list of data.frames).

dat_ls |> 
  rowwise() |> 
  mutate(data = list(recode_df(data, id))) |>
  pull(data) # <= for better printing
#> [[1]]
#> # A tibble: 5 × 3
#>       a b     c    
#>   <int> <chr> <chr>
#> 1     1 a     f    
#> 2     2 b     g    
#> 3     3 c     h    
#> 4     4 d     i    
#> 5     5 e     j    
#> 
#> [[2]]
#> # A tibble: 5 × 3
#>       a d     e    
#>   <int> <chr> <chr>
#> 1     1 f     17   
#> 2     2 g     18   
#> 3     3 h     19   
#> 4     4 i     20   
#> 5     5 j     21

Rename many datasets based on a data dicitonary

Analogous to recode_df(), we can create a function that renames all columns of a dataset, below called rename_df(). The function works in two steps. Initially, we create vector of old and new name pairs based on the distinct entries of our data dictionary that are relevant for this dataset filter(id == dat_id). Next, we use this named vector within rename(all_of():

rename_df <- function(data, dat_id) {
  
  # 1. create a vector of old and new name pairs ...
  # ... based on the current dictionary
  rename_vec <- dat_dict3 |>
    filter(id == dat_id ) |>
    distinct(short_nm, long_nm) |>
    pull(short_nm, name = long_nm)

  # 2. use this vector on the supplied data
  data |>
    rename(all_of(rename_vec))
}

Applying this function to our nested data.frame of datasets is basically the same approach as outlined above:

dat_ls |> 
  rowwise() |> 
  mutate(data = list(rename_df(data, id))) |>
  pull(data) # <= for betters printing
#> [[1]]
#> # A tibble: 5 × 3
#>       a new_b new_c
#>   <int> <int> <int>
#> 1     1    10    20
#> 2     2    11    21
#> 3     3    12    22
#> 4     4    13    23
#> 5     5    14    24
#> 
#> [[2]]
#> # A tibble: 5 × 3
#>       a new_d new_e
#>   <int> <int> <int>
#> 1     1    10     7
#> 2     2    11     8
#> 3     3    12     9
#> 4     4    13    10
#> 5     5    14    11

Final Thoughts

Readers who follow me on Twitter might know that, although being a dplyr fan-boy, my feelings towards dplyr::recode() were less enthusiastic. My main issue was the unconventional order of arguments, which diverges from what we know from rename() or mutate().

< e-frame src="https://twitter.com/timteafan/status/1254898416402018310?s=61&t=i8eoMCFkLgkJKWoZHxVvOQ">

The order of arguments was in part a reason why recode() was flagged as “questioning” in dplyr version 1.0. Since dplyr version 1.1.0 recode() moved one stage further in its life cycle and is now “superseded” by dplyr::case_match().

I’m not sure of the full implications of this development. If I understand the life cycle stages correctly, then “superseded” means that dplyr::recode() is not going away any time soon and will continue to be maintained, though it will not see new features.

However, if there’s a chance that dplyr::recode() might become deprecated in future major releases, we would need to think about a workaround, since case_match() doesn’t support the splicing of named vectors as arguments that we have used above. In that case, I will certainly update this blog post.

Despite my initial skepticism towards dplyr::recode(), I have to concede that, particularly in combination with across(), it provides a clear and straight-forward workflow. With its ability to be used programmatically and to handel complex cases, I hope that this blog post has convincingly shown the benefits of this approach.

My original dplyr workflow was much more convoluted. The curious reader can find it in the answers to my question on StackOverflow from a couple of years ago.

But even outside of dplyr I haven’t encountered a similarly seamless approach to recoding multiple columns across several datasets. If you are up for a challenge, I’d love to see what base R or data.table solutions you can come with to tackle this problem. Let me know in the comments or via Twitter or Mastodon if you have an alternative approach.

< details class="sess"> < summary class="session-header"> Session Info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2023-06-30
#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package * version date (UTC) lib source
#>  dplyr   * 1.1.0   2023-01-29 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
To leave a comment for the author, please follow the link and comment on their blog: Mostly Counting | Tim Tiefenbach.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version