Site icon R-bloggers

Pivoting in tidyr and data.table

[This article was first published on Data By John, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We all need to pivot data at some point, so these are just some notes for my own benefit really, because gather and spread are no longer in favour within tidyr.

I tended to only ever need gather, and nearly always relied on the same key and value names, so it was an easy function for me to use.

I have discovered that pivot_longer and pivot_wider are much more flexible, they just take a little bit more thinking about.

For example, my old approach has now changed along these lines

# old way with `gather`
df %>%
  mutate(row = row_number()) %>%
  gather('column', 'source', -row, -N)  # key = column, value = source, retain row and N
  # further transforms

# new way with pivot_longer
df %>%
  mutate(row = row_number()) %>%
  pivot_longer(!c(row , N), names_to = 'column', values_to  = 'source') 
  # further transforms

However, what I really want to do is show how to replicate much of the tidyr pivot functionality with data.table.

Once again, this is not intended to be in-depth.
I have simply used the tidyr help file code, and tried to replicate it with data.table.
I’d be interested in improvements to my data.table code.

Let’s pivot!

Note – in all examples, I’ll create a copy of the data set as a data.table using setDT(copy(source_data))

Also, I intended to use code folding to show the datasets and results, but that’s gone horribly wrong, so you can run the code yourself.

You only need:

library(tidyr)
library(data.table)

I’m using the base pipe for simplicity.

tidyr::pivot_longer() ~ data.table::melt()

Using the built-in relig_income dataset:

Code comparison

relig_income |>
  pivot_longer(!religion, # keep religion as a column
               names_to = "income", # desired name for new column
               values_to = "count") # what data goes into the new column?


melt(DT, id.vars = "religion",
     variable.name = "income",
     value.name = "count",
     variable.factor = FALSE) # added to keep output consistent with tidyr

With data.table, you can often get away with only supplying either measure.vars or id.vars, and nothing else, and it does a pretty great job of guessing what to do.
Obviously it’s better to be specific, but worth bearing in mind.

Drop missing values

Now, a slightly more complex case where columns have common prefix, and missings are structural so should be dropped, using the billboard dataset

Code comparison:

billboard |>
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    names_prefix = "wk",
    values_to = "rank",
    values_drop_na = TRUE
  )

melt(DT,
     measure.vars = patterns("^wk"),
     variable.name = "week",
     value.name = "rank",
     na.rm = TRUE)

Multiple variables stored in column names

who |> pivot_longer(
  cols = new_sp_m014:newrel_f65,
  names_to = c("diagnosis", "gender", "age"),
  names_pattern = "new_?(.*)_(.)(.*)", # Whoa
  values_to = "count"
)


DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
            ][,variable := gsub("new_?","", variable)
              ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE)
                ][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE)
                  ][, age := tstrsplit(temp, "[mf]+", fixed = FALSE, keep = 2)
                    ][, !c("variable","temp")
                      ][,c(1:4,6:8,5)][]

OK – we’re going to need to break this one down. There’s a lot of magic going on with that names_pattern call in pivot_longer, where various parts of the tidyverse are working in the background.

I’m not going to dissect pivot_longer, because calling debugonce show you that various methods are involved, which is harder to debug. Instead, I will build up the data.table solution.

Firstly, note that this time round, I am using melt inside of DT, rather than melt(DT)

DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
            ][]

This returns a new variablecolumn, which needs splitting up into more columns.

First, we need to get rid of that new_, using gsub to replace it with nothing at all:

DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
            ][,variable := gsub("new_?","", variable)
              ][]

Now I want the diagnosis part, so I create a diagnosis column, and a “temp” column for the remaining bits, which I’ll split in the next passes

DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
            ][,variable := gsub("new_?","", variable)
              ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE)
                ][]

Now to split out the gender column, from the new temp column. This is going to be either an “m” or an “f”.

DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
            ][,variable := gsub("new_?","", variable)
              ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE)
                ][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE)
                  ][]

Then retrieve the age, again from the temp column. This generates two vectors, I only want the second one, hence the keep = 2

DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
            ][,variable := gsub("new_?","", variable)
              ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE)
                ][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE)
                  ][, age := tstrsplit(temp, "[mf]+", fixed = FALSE, keep = 2)
                    ][]

The last two lines discard the variable and temp columns, and sort the columns into the same order as that returned by pivot_longer

DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
            ][,variable := gsub("new_?","", variable)
              ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE)
                ][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE)
                  ][, age := tstrsplit(temp, "[mf]+", fixed = FALSE, keep = 2)
                    ][, !c("variable","temp")
                      ][,c(1:4,6:8,5)][]

Giving us these results:

DT[,melt(.SD, measure.vars = 5:60, value.name = "count")
            ][,variable := gsub("new_?","", variable)
              ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE)
                ][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE)
                  ][, age := tstrsplit(temp, "[mf]+", fixed = FALSE, keep = 2)
                    ][, !c("variable","temp")
                      ][,c(1:4,6:8,5)][]

##             country iso2 iso3 year diagnosis gender age count
##      1: Afghanistan   AF  AFG 1980        sp      m 014    NA
##      2: Afghanistan   AF  AFG 1981        sp      m 014    NA
##      3: Afghanistan   AF  AFG 1982        sp      m 014    NA
##      4: Afghanistan   AF  AFG 1983        sp      m 014    NA
##      5: Afghanistan   AF  AFG 1984        sp      m 014    NA
##     ---                                                      
## 405436:    Zimbabwe   ZW  ZWE 2009       rel      f  65    NA
## 405437:    Zimbabwe   ZW  ZWE 2010       rel      f  65    NA
## 405438:    Zimbabwe   ZW  ZWE 2011       rel      f  65    NA
## 405439:    Zimbabwe   ZW  ZWE 2012       rel      f  65    NA
## 405440:    Zimbabwe   ZW  ZWE 2013       rel      f  65   725

Matrix to long

Using the anscombe dataset

Code comparison:

anscombe |>
  pivot_longer(
    everything(),
    cols_vary = "slowest",
    names_to = c(".value", "set"),
    names_pattern = "(.)(.)" 
  )

DT[,melt(.SD,
            variable.name = "set",
            value.name = c("x","y"),
            variable.factor = FALSE,
            measure.vars = patterns("^x","^y"))]

tidyr::pivot_wider() ~ data.table::dcast()

I have to be honest, dcast is not a term that makes me think long to wide, but you can’t have everything.

We’re using this fish_encounters dataset for this bit. It’s about time fish made an appearance on this blog. Probably.

fish_encounters

## # A tibble: 114 × 3
##    fish  station  seen
##    <fct> <fct>   <int>
##  1 4842  Release     1
##  2 4842  I80_1       1
##  3 4842  Lisbon      1
##  4 4842  Rstr        1
##  5 4842  Base_TD     1
##  6 4842  BCE         1
##  7 4842  BCW         1
##  8 4842  BCE2        1
##  9 4842  BCW2        1
## 10 4842  MAE         1
## # … with 104 more rows

Code comparison:

fish_encounters |>
  pivot_wider(names_from = station, values_from = seen)

dcast(DT, fish ~ station, value.var = "seen") 

Technically, we can avoid stating that value.var = "seen", but we do then get a large informational message in the console, which is always disconcerting. (If data.table tells you something, you’re well advised to act on it, because the help is scarily accurate)

Fill in missing values

Let’s get rid of those NA’s.

fish_encounters |>
  pivot_wider(names_from = station, values_from = seen, values_fill = 0)

dcast(DT, fish ~ station, value.var = "seen", fill = 0)

Rather than have the DT inside dcast, we can use .SD and have dcast inside DT, which is helpful for further chaining. The same applies to melt, as you’ll have seen above.

DT[, dcast(.SD, fish ~ station, value.var = "seen", fill = 0)]

Generate column names from multiple variables

Using the us_rent_income data

Comparison:

us_rent_income |>
  pivot_wider(
    names_from = variable,
    values_from = c(estimate, moe)
  )

dcast(DT, GEOID + NAME ~ variable, value.var = c("estimate","moe"))
#  alternatively we  can pass `...` to indicate all other unspecified columns
# E.G. :
# dcast(DT, ... ~ variable, value.var = c("estimate","moe"))

Specify a different names separator

us_rent_income |>
  pivot_wider(
    names_from = variable,
    names_sep = ".",
    values_from = c(estimate, moe)
  )

dcast(DT, GEOID + NAME ~ variable,
      value.var = c("estimate","moe"), 
      sep = ".")


## alternatively

DT[, dcast(.SD, GEOID + NAME ~ variable,
          value.var = c("estimate","moe"), 
          sep = ".")]

Names vary

From the tidyr help: “You can control whether names_from values vary fastest or slowest relative to the values_from column names using names_vary”.

us_rent_income |>
  pivot_wider(
    names_from = variable,
    values_from = c(estimate, moe),
    names_vary = "slowest"
  ) |> names()

## [1] "GEOID"           "NAME"            "estimate_income" "moe_income"     
## [5] "estimate_rent"   "moe_rent"

To mimic this in data.table, we specify the final column order

DT[, dcast(.SD, GEOID + NAME ~ variable,
           value.var = c("estimate","moe"))
   ][,c(1:3,5,4,6)] |> names()

## [1] "GEOID"           "NAME"            "estimate_income" "moe_income"     
## [5] "estimate_rent"   "moe_rent"

The fastest option is exactly the same as the original

us_rent_income |>
  pivot_wider(
    names_from = variable,
    values_from = c(estimate, moe),
    names_vary = "fastest"
  ) |> names()

## [1] "GEOID"           "NAME"            "estimate_income" "estimate_rent"  
## [5] "moe_income"      "moe_rent"

DT[, dcast(.SD, GEOID + NAME ~ variable,
           value.var = c("estimate","moe"))] |> names()

## [1] "GEOID"           "NAME"            "estimate_income" "estimate_rent"  
## [5] "moe_income"      "moe_rent"

Performing aggregation with values_fn

Using the warpbreaks dataset:

warpbreaks %>%
  pivot_wider(
    names_from = wool,
    values_from = breaks,
    values_fn = mean
  )

dcast(DT, tension ~ wool, value.var = "breaks", fun = mean)

warpbreaks %>%
  pivot_wider(
    names_from = wool,
    values_from = breaks,
    values_fn = mean
  )

## # A tibble: 3 × 3
##   tension     A     B
##   <fct>   <dbl> <dbl>
## 1 L        44.6  28.2
## 2 M        24    28.8
## 3 H        24.6  18.8

dcast(DT, tension ~ wool, value.var = "breaks", fun = mean)

##    tension        A        B
## 1:       L 44.55556 28.22222
## 2:       M 24.00000 28.77778
## 3:       H 24.55556 18.77778

Or, do it inside DT:

DT[,dcast(.SD, tension ~ wool, value.var = "breaks", fun = mean)]

I hope this is useful to others.
If nothing else, please at least give me credit for not using the Ross Geller GIF.

To leave a comment for the author, please follow the link and comment on their blog: Data By John.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version