Site icon R-bloggers

tidyr::spread() and dplyr::rename_at() in action

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !-- BLOGDOWN-HEAD --> < !-- /BLOGDOWN-HEAD --> < !-- BLOGDOWN-BODY-BEFORE --> < !-- /BLOGDOWN-BODY-BEFORE -->

I was recently confronted to a situation that required going from a long dataset to a wide dataset, but with a small twist: there were two datasets, which I had to merge into one. You might wonder what kinda crappy twist that is, right? Well, let’s take a look at the data:

data1; data2
## # A tibble: 20 x 4
##    country       date       variable_1 value
##      <chr>      <chr>            <chr> <int>
##  1      lu 01/01/2005            maybe    22
##  2      lu 01/07/2005            maybe    13
##  3      lu 01/01/2006            maybe    40
##  4      lu 01/07/2006            maybe    25
##  5      lu 01/01/2005    totally_agree    42
##  6      lu 01/07/2005    totally_agree    17
##  7      lu 01/01/2006    totally_agree    25
##  8      lu 01/07/2006    totally_agree    16
##  9      lu 01/01/2005 totally_disagree    39
## 10      lu 01/07/2005 totally_disagree    17
## 11      lu 01/01/2006 totally_disagree    23
## 12      lu 01/07/2006 totally_disagree    21
## 13      lu 01/01/2005   kinda_disagree    69
## 14      lu 01/07/2005   kinda_disagree    12
## 15      lu 01/01/2006   kinda_disagree    10
## 16      lu 01/07/2006   kinda_disagree     9
## 17      lu 01/01/2005      kinda_agree    38
## 18      lu 01/07/2005      kinda_agree    31
## 19      lu 01/01/2006      kinda_agree    19
## 20      lu 01/07/2006      kinda_agree    12
## # A tibble: 20 x 4
##    country       date       variable_2 value
##      <chr>      <chr>            <chr> <int>
##  1      lu 01/01/2005      kinda_agree    22
##  2      lu 01/07/2005      kinda_agree    13
##  3      lu 01/01/2006      kinda_agree    40
##  4      lu 01/07/2006      kinda_agree    25
##  5      lu 01/01/2005    totally_agree    42
##  6      lu 01/07/2005    totally_agree    17
##  7      lu 01/01/2006    totally_agree    25
##  8      lu 01/07/2006    totally_agree    16
##  9      lu 01/01/2005 totally_disagree    39
## 10      lu 01/07/2005 totally_disagree    17
## 11      lu 01/01/2006 totally_disagree    23
## 12      lu 01/07/2006 totally_disagree    21
## 13      lu 01/01/2005            maybe    69
## 14      lu 01/07/2005            maybe    12
## 15      lu 01/01/2006            maybe    10
## 16      lu 01/07/2006            maybe     9
## 17      lu 01/01/2005   kinda_disagree    38
## 18      lu 01/07/2005   kinda_disagree    31
## 19      lu 01/01/2006   kinda_disagree    19
## 20      lu 01/07/2006   kinda_disagree    12

As explained in Hadley (2014), this is how you should keep your data… But for a particular purpose, I had to transform these datasets. What I was asked to do was to merge these into a single wide data frame. Doing this for one dataset is easy:

data1 %>%
  spread(variable_1, value)
## # A tibble: 4 x 7
##   country       date kinda_agree kinda_disagree maybe totally_agree
## *   <chr>      <chr>       <int>          <int> <int>         <int>
## 1      lu 01/01/2005          38             69    22            42
## 2      lu 01/01/2006          19             10    40            25
## 3      lu 01/07/2005          31             12    13            17
## 4      lu 01/07/2006          12              9    25            16
## # ... with 1 more variables: totally_disagree <int>

But because data1 and data2 have the same levels for variable_1 and variable_2, this would not work. So the solution I found online, in this SO thread was to use tidyr::spread() with dplyr::rename_at() like this:

data1 <- data1 %>%
  spread(variable_1, value) %>%
  rename_at(vars(-country, -date), funs(paste0("variable1:", .)))

glimpse(data1)
## Observations: 4
## Variables: 7
## $ country                    <chr> "lu", "lu", "lu", "lu"
## $ date                       <chr> "01/01/2005", "01/01/2006", "01/07/...
## $ variable1:kinda_agree      <int> 38, 19, 31, 12
## $ variable1:kinda_disagree   <int> 69, 10, 12, 9
## $ variable1:maybe            <int> 22, 40, 13, 25
## $ variable1:totally_agree    <int> 42, 25, 17, 16
## $ variable1:totally_disagree <int> 39, 23, 17, 21
data2 <- data2 %>%
  spread(variable_2, value) %>%
  rename_at(vars(-country, -date), funs(paste0("variable2:", .)))

glimpse(data2)
## Observations: 4
## Variables: 7
## $ country                    <chr> "lu", "lu", "lu", "lu"
## $ date                       <chr> "01/01/2005", "01/01/2006", "01/07/...
## $ variable2:kinda_agree      <int> 22, 40, 13, 25
## $ variable2:kinda_disagree   <int> 38, 19, 31, 12
## $ variable2:maybe            <int> 69, 10, 12, 9
## $ variable2:totally_agree    <int> 42, 25, 17, 16
## $ variable2:totally_disagree <int> 39, 23, 17, 21

rename_at() needs variables which you pass to vars(), a helper function to select variables, and a function that will do the renaming, passed to funs(). The function I use is simply paste0(), which pastes a string, for example “variable1:” with the name of the columns, given by the single ‘.’, a dummy argument. Now these datasets can be merged:

data1 %>%
  full_join(data2) %>%
  glimpse()
## Joining, by = c("country", "date")
## Observations: 4
## Variables: 12
## $ country                    <chr> "lu", "lu", "lu", "lu"
## $ date                       <chr> "01/01/2005", "01/01/2006", "01/07/...
## $ variable1:kinda_agree      <int> 38, 19, 31, 12
## $ variable1:kinda_disagree   <int> 69, 10, 12, 9
## $ variable1:maybe            <int> 22, 40, 13, 25
## $ variable1:totally_agree    <int> 42, 25, 17, 16
## $ variable1:totally_disagree <int> 39, 23, 17, 21
## $ variable2:kinda_agree      <int> 22, 40, 13, 25
## $ variable2:kinda_disagree   <int> 38, 19, 31, 12
## $ variable2:maybe            <int> 69, 10, 12, 9
## $ variable2:totally_agree    <int> 42, 25, 17, 16
## $ variable2:totally_disagree <int> 39, 23, 17, 21

Hope this post helps you understand the difference between long and wide datasets better, as well as dplyr::rename_at()!

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.