Pivoting in tidyr and data.table

Posted on February 19, 2023 by HighlandR in R bloggers | 0 Comments

[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We all need to pivot data at some point, so these are just some notes for my own benefit really, because gather and spread are no longer in favour within tidyr.

NB – this post has been updated with collapsible sections to show/hide the data and outputs.

I tended to only ever need gather, and nearly always relied on the same key and value names, so it was an easy function for me to use.

pivot_longer and pivot_wider are much more flexible, they just take a little bit more thinking about.

For example, my old approach has now changed along these lines

# old way with `gather`
df %>%
  mutate(row = row_number()) %>%
  gather('column', 'source', -row, -N)  # key = column, value = source, retain row and N
  # further transforms

# new way with pivot_longer
df %>%
  mutate(row = row_number()) %>%
  pivot_longer(!c(row , N), 
  names_to = 'column', 
  values_to  = 'source') 
  # further transforms

However, what I really want to do is show how to replicate much of the tidyr pivot functionality with data.table.

Once again, this is not intended to be in-depth.
I have simply used the tidyr help file code, and tried to replicate it with data.table.
I’d be interested in improvements to my data.table code.

Let’s pivot!

Note – in all examples, I’ll create a copy of the data set as a data.table using setDT(copy(source_data))

Also, I intended to use code folding to show the datasets and results, but that’s gone horribly wrong, so you can run the code yourself.

You only need:

library(tidyr)
library(data.table)

I’m using the base pipe for simplicity.

tidyr::pivot_longer() ~ data.table::melt()

Using the built-in relig_income dataset:

Show data

## # A tibble: 18 × 11
##    religion      `<$10k` $10-2…¹ $20-3…² $30-4…³ $40-5…⁴ $50-7…⁵ $75-1…⁶ $100-…⁷
##    <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Agnostic           27      34      60      81      76     137     122     109
##  2 Atheist            12      27      37      52      35      70      73      59
##  3 Buddhist           27      21      30      34      33      58      62      39
##  4 Catholic          418     617     732     670     638    1116     949     792
##  5 Don’t know/r…      15      14      15      11      10      35      21      17
##  6 Evangelical …     575     869    1064     982     881    1486     949     723
##  7 Hindu               1       9       7       9      11      34      47      48
##  8 Historically…     228     244     236     238     197     223     131      81
##  9 Jehovah's Wi…      20      27      24      24      21      30      15      11
## 10 Jewish             19      19      25      25      30      95      69      87
## 11 Mainline Prot     289     495     619     655     651    1107     939     753
## 12 Mormon             29      40      48      51      56     112      85      49
## 13 Muslim              6       7       9      10       9      23      16       8
## 14 Orthodox           13      17      23      32      32      47      38      42
## 15 Other Christ…       9       7      11      13      13      14      18      14
## 16 Other Faiths       20      33      40      46      49      63      46      40
## 17 Other World …       5       2       3       4       2       7       3       4
## 18 Unaffiliated      217     299     374     365     341     528     407     321
## # … with 2 more variables: `>150k` <dbl>, `Don't know/refused` <dbl>, and
## #   abbreviated variable names ¹`$10-20k`, ²`$20-30k`, ³`$30-40k`, ⁴`$40-50k`,
## #   ⁵`$50-75k`, ⁶`$75-100k`, ⁷`$100-150k`

Code comparison

relig_income |>
  pivot_longer(!religion, # keep religion as a column
               names_to = "income", # desired name for new column
               values_to = "count") # what data goes into the new column?


melt(DT, id.vars = "religion",
     variable.name = "income",
     value.name = "count",
     variable.factor = FALSE) # added to keep output consistent with tidyr

With data.table, you can often get away with only supplying either measure.vars or id.vars, and nothing else, and it does a pretty great job of guessing what to do.
Obviously it’s better to be specific, but worth bearing in mind.

You can compare outputs here:

pivot_longeroutput

## # A tibble: 180 × 3
##    religion income             count
##    <chr>    <chr>              <dbl>
##  1 Agnostic <$10k                 27
##  2 Agnostic $10-20k               34
##  3 Agnostic $20-30k               60
##  4 Agnostic $30-40k               81
##  5 Agnostic $40-50k               76
##  6 Agnostic $50-75k              137
##  7 Agnostic $75-100k             122
##  8 Agnostic $100-150k            109
##  9 Agnostic >150k                 84
## 10 Agnostic Don't know/refused    96
## # … with 170 more rows

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Pivoting in tidyr and data.table

NB – this post has been updated with collapsible sections to show/hide the data and outputs.

tidyr::pivot_longer() ~ data.table::melt()

Drop missing values

Multiple variables stored in column names

Matrix to long

tidyr::pivot_wider() ~ data.table::dcast()

Fill in missing values

Generate column names from multiple variables

Specify a different names separator

Names vary

Performing aggregation with `values_fn`

Related

NB – this post has been updated with collapsible sections to show/hide the data and outputs.

tidyr::pivot_longer() ~ data.table::melt()

Drop missing values

Multiple variables stored in column names

Matrix to long

tidyr::pivot_wider() ~ data.table::dcast()

Fill in missing values

Generate column names from multiple variables

Specify a different names separator

Names vary

Performing aggregation with values_fn

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Performing aggregation with `values_fn`

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)