[This article was first published on HighlandR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We all need to pivot data at some point, so these are just some notes
for my own benefit really, because gather and spread are no longer
in favour within tidyr.
NB – this post has been updated with collapsible sections to show/hide the data and outputs.
I tended to only ever need gather, and nearly always relied on the
same key and value names, so it was an easy function for me to use.
pivot_longer and pivot_wider are much more flexible, they just take a little bit more thinking about.
For example, my old approach has now changed along these lines
# old way with `gather`
df %>%
mutate(row = row_number()) %>%
gather('column', 'source', -row, -N) # key = column, value = source, retain row and N
# further transforms
# new way with pivot_longer
df %>%
mutate(row = row_number()) %>%
pivot_longer(!c(row , N),
names_to = 'column',
values_to = 'source')
# further transforms
However, what I really want to do is show how to replicate much of the
tidyr pivot functionality with data.table.
Once again, this is not intended to be in-depth.
I have simply used the tidyr help file code, and tried to replicate it
with data.table.
I’d be interested in improvements to my data.table code.
Let’s pivot!
Note – in all examples, I’ll create a copy of the data set as a
data.table using setDT(copy(source_data))
Also, I intended to use code folding to show the datasets and results, but that’s gone horribly wrong, so you can run the code yourself.
relig_income |>
pivot_longer(!religion, # keep religion as a column
names_to = "income", # desired name for new column
values_to = "count") # what data goes into the new column?
melt(DT, id.vars = "religion",
variable.name = "income",
value.name = "count",
variable.factor = FALSE) # added to keep output consistent with tidyr
With data.table, you can often get away with only supplying either
measure.vars or id.vars, and nothing else, and it does a pretty
great job of guessing what to do.
Obviously it’s better to be specific, but worth bearing in mind.
## A tibble: 7,240 × 60
## country iso2 iso3 year new_s…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan… AF AFG 1980 NA NA NA NA NA NA
## 2 Afghan… AF AFG 1981 NA NA NA NA NA NA
## 3 Afghan… AF AFG 1982 NA NA NA NA NA NA
## 4 Afghan… AF AFG 1983 NA NA NA NA NA NA
## 5 Afghan… AF AFG 1984 NA NA NA NA NA NA
## 6 Afghan… AF AFG 1985 NA NA NA NA NA NA
## 7 Afghan… AF AFG 1986 NA NA NA NA NA NA
## 8 Afghan… AF AFG 1987 NA NA NA NA NA NA
## 9 Afghan… AF AFG 1988 NA NA NA NA NA NA
## 10 Afghan… AF AFG 1989 NA NA NA NA NA NA
## … with 7,230 more rows, 50 more variables: new_sp_m65 <dbl>,
## new_sp_f014 <dbl>, new_sp_f1524 <dbl>, new_sp_f2534 <dbl>,
## new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, new_sp_f5564 <dbl>,
## new_sp_f65 <dbl>, new_sn_m014 <dbl>, new_sn_m1524 <dbl>,
## new_sn_m2534 <dbl>, new_sn_m3544 <dbl>, new_sn_m4554 <dbl>,
## new_sn_m5564 <dbl>, new_sn_m65 <dbl>, new_sn_f014 <dbl>,
## new_sn_f1524 <dbl>, new_sn_f2534 <dbl>, new_sn_f3544 <dbl>, …
## ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
OK – we’re going to need to break this one down. There’s a lot of magic
going on with that names_pattern call in pivot_longer, where various
parts of the tidyverse are working in the background.
I’m not going to dissect pivot_longer, because calling debugonce
show you that various methods are involved, which is harder to debug.
Instead, I will build up the data.table solution.
Firstly, note that this time round, I am using melt inside of DT,
rather than melt(DT)
## country iso2 iso3 year diagnosis gender age count
## 1: Afghanistan AF AFG 1980 sp m 014 NA
## 2: Afghanistan AF AFG 1981 sp m 014 NA
## 3: Afghanistan AF AFG 1982 sp m 014 NA
## 4: Afghanistan AF AFG 1983 sp m 014 NA
## 5: Afghanistan AF AFG 1984 sp m 014 NA
## ---
## 405436: Zimbabwe ZW ZWE 2009 rel f 65 NA
## 405437: Zimbabwe ZW ZWE 2010 rel f 65 NA
## 405438: Zimbabwe ZW ZWE 2011 rel f 65 NA
## 405439: Zimbabwe ZW ZWE 2012 rel f 65 NA
## 405440: Zimbabwe ZW ZWE 2013 rel f 65 725
We’ve done wide to long, now for the opposite.
I have to be honest, dcast is not a term that makes me think long to wide, but you can’t have everything.
We’re using this fish_encounters dataset for this bit.
It’s about time fish made an appearance on this blog.
Probably.
## # A tibble: 19 × 12
## fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
## <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 4842 1 1 1 1 1 1 1 1 1 1 1
## 2 4843 1 1 1 1 1 1 1 1 1 1 1
## 3 4844 1 1 1 1 1 1 1 1 1 1 1
## 4 4845 1 1 1 1 1 NA NA NA NA NA NA
## 5 4847 1 1 1 NA NA NA NA NA NA NA NA
## 6 4848 1 1 1 1 NA NA NA NA NA NA NA
## 7 4849 1 1 NA NA NA NA NA NA NA NA NA
## 8 4850 1 1 NA 1 1 1 1 NA NA NA NA
## 9 4851 1 1 NA NA NA NA NA NA NA NA NA
## 10 4854 1 1 NA NA NA NA NA NA NA NA NA
## 11 4855 1 1 1 1 1 NA NA NA NA NA NA
## 12 4857 1 1 1 1 1 1 1 1 1 NA NA
## 13 4858 1 1 1 1 1 1 1 1 1 1 1
## 14 4859 1 1 1 1 1 NA NA NA NA NA NA
## 15 4861 1 1 1 1 1 1 1 1 1 1 1
## 16 4862 1 1 1 1 1 1 1 1 1 NA NA
## 17 4863 1 1 NA NA NA NA NA NA NA NA NA
## 18 4864 1 1 NA NA NA NA NA NA NA NA NA
## 19 4865 1 1 1 NA NA NA NA NA NA NA NA
Technically, we can avoid stating that value.var = "seen", but we do
then get a large informational message in the console, which is always
disconcerting. (If data.table tells you something, you’re well advised
to act on it, because the help is scarily accurate)
Here are the data.table outputs
< details>
< summary>show dcast output
## fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
## 1: 4842 1 1 1 1 1 1 1 1 1 1 1
## 2: 4843 1 1 1 1 1 1 1 1 1 1 1
## 3: 4844 1 1 1 1 1 1 1 1 1 1 1
## 4: 4845 1 1 1 1 1 NA NA NA NA NA NA
## 5: 4847 1 1 1 NA NA NA NA NA NA NA NA
## 6: 4848 1 1 1 1 NA NA NA NA NA NA NA
## 7: 4849 1 1 NA NA NA NA NA NA NA NA NA
## 8: 4850 1 1 NA 1 1 1 1 NA NA NA NA
## 9: 4851 1 1 NA NA NA NA NA NA NA NA NA
## 10: 4854 1 1 NA NA NA NA NA NA NA NA NA
## 11: 4855 1 1 1 1 1 NA NA NA NA NA NA
## 12: 4857 1 1 1 1 1 1 1 1 1 NA NA
## 13: 4858 1 1 1 1 1 1 1 1 1 1 1
## 14: 4859 1 1 1 1 1 NA NA NA NA NA NA
## 15: 4861 1 1 1 1 1 1 1 1 1 1 1
## 16: 4862 1 1 1 1 1 1 1 1 1 NA NA
## 17: 4863 1 1 NA NA NA NA NA NA NA NA NA
## 18: 4864 1 1 NA NA NA NA NA NA NA NA NA
## 19: 4865 1 1 1 NA NA NA NA NA NA NA NA
Rather than have the DT inside dcast, we can use .SD and have
dcast inside DT, which is helpful for further chaining. The same
applies to melt, as you’ll have seen above.
DT[, dcast(.SD, fish ~ station, value.var = "seen", fill = 0)]
Generate column names from multiple variables
Using the us_rent_income data
< details>
< summary>show us_rent_income data
## # A tibble: 104 × 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 01 Alabama income 24476 136
## 2 01 Alabama rent 747 3
## 3 02 Alaska income 32940 508
## 4 02 Alaska rent 1200 13
## 5 04 Arizona income 27517 148
## 6 04 Arizona rent 972 4
## 7 05 Arkansas income 23789 165
## 8 05 Arkansas rent 709 5
## 9 06 California income 29454 109
## 10 06 California rent 1358 3
## # … with 94 more rows
## breaks wool tension
## 1 26 A L
## 2 30 A L
## 3 54 A L
## 4 25 A L
## 5 70 A L
## 6 52 A L
## 7 51 A L
## 8 26 A L
## 9 67 A L
## 10 18 A M
## 11 21 A M
## 12 29 A M
## 13 17 A M
## 14 12 A M
## 15 18 A M
## 16 35 A M
## 17 30 A M
## 18 36 A M
## 19 36 A H
## 20 21 A H
## 21 24 A H
## 22 18 A H
## 23 10 A H
## 24 43 A H
## 25 28 A H
## 26 15 A H
## 27 26 A H
## 28 27 B L
## 29 14 B L
## 30 29 B L
## 31 19 B L
## 32 29 B L
## 33 31 B L
## 34 41 B L
## 35 20 B L
## 36 44 B L
## 37 42 B M
## 38 26 B M
## 39 19 B M
## 40 16 B M
## 41 39 B M
## 42 28 B M
## 43 21 B M
## 44 39 B M
## 45 29 B M
## 46 20 B H
## 47 21 B H
## 48 24 B H
## 49 17 B H
## 50 13 B H
## 51 15 B H
## 52 15 B H
## 53 16 B H
## 54 28 B H
warpbreaks %>%
pivot_wider(
names_from = wool,
values_from = breaks,
values_fn = mean
)
dcast(DT, tension ~ wool,
value.var = "breaks", fun = mean)
warpbreaks %>%
pivot_wider(
names_from = wool,
values_from = breaks,
values_fn = mean
)
## # A tibble: 3 × 3
## tension A B
## <fct> <dbl> <dbl>
## 1 L 44.6 28.2
## 2 M 24 28.8
## 3 H 24.6 18.8
dcast(DT, tension ~ wool,
value.var = "breaks", fun = mean)
## tension A B
## 1: L 44.55556 28.22222
## 2: M 24.00000 28.77778
## 3: H 24.55556 18.77778