Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Dplyr 0.8.0 launched recently, which you probably already know, but just in case you missed it..
Two new functions have been catching my eye : group_map and group_split.
The aim of this post – take a first look at these and try and get a new blog post up on github before February is out.
Time now: 23:38(UK)
What it says on the tin
Load packages and create a function for demo purposes , basically grab the first 5 rows of whatever we throw at it:
library(nycflights13) ## Warning: package 'nycflights13' was built under R version 3.5.2 library(dplyr) ## Warning: package 'dplyr' was built under R version 3.5.2 ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union data(flights) header <- function(x,n = 5){ head(x,n) }
Let’s check the function works first of all:
## # A tibble: 5 x 19 ## # Groups: origin [3] ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 533 529 4 850 ## 3 2013 1 1 542 540 2 923 ## 4 2013 1 1 544 545 -1 1004 ## 5 2013 1 1 554 600 -6 812 ## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dttm>
Good, that just returns 5 rows from the entire datset. Now let’s replace group_by with group_split, and group by origin
split_test <- flights %>% group_by(origin)
Then we use group_split to create individual tibbles for each group
group_split(split_test) ## [[1]] ## # A tibble: 120,835 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 554 558 -4 740 ## 3 2013 1 1 555 600 -5 913 ## 4 2013 1 1 558 600 -2 923 ## 5 2013 1 1 559 600 -1 854 ## 6 2013 1 1 601 600 1 844 ## 7 2013 1 1 606 610 -4 858 ## 8 2013 1 1 607 607 0 858 ## 9 2013 1 1 608 600 8 807 ## 10 2013 1 1 615 615 0 833 ## # ... with 120,825 more rows, and 12 more variables: sched_arr_time <int>, ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ## ## [[2]] ## # A tibble: 111,279 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 542 540 2 923 ## 2 2013 1 1 544 545 -1 1004 ## 3 2013 1 1 557 600 -3 838 ## 4 2013 1 1 558 600 -2 849 ## 5 2013 1 1 558 600 -2 853 ## 6 2013 1 1 558 600 -2 924 ## 7 2013 1 1 559 559 0 702 ## 8 2013 1 1 606 610 -4 837 ## 9 2013 1 1 611 600 11 945 ## 10 2013 1 1 613 610 3 925 ## # ... with 111,269 more rows, and 12 more variables: sched_arr_time <int>, ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm> ## ## [[3]] ## # A tibble: 104,662 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 533 529 4 850 ## 2 2013 1 1 554 600 -6 812 ## 3 2013 1 1 557 600 -3 709 ## 4 2013 1 1 558 600 -2 753 ## 5 2013 1 1 559 600 -1 941 ## 6 2013 1 1 600 600 0 851 ## 7 2013 1 1 600 600 0 837 ## 8 2013 1 1 602 610 -8 812 ## 9 2013 1 1 602 605 -3 821 ## 10 2013 1 1 623 610 13 920 ## # ... with 104,652 more rows, and 12 more variables: sched_arr_time <int>, ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm>
We can verify how the grouping was carried out by using group_keys:
group_keys(split_test) ## # A tibble: 3 x 1 ## origin ## <chr> ## 1 EWR ## 2 JFK ## 3 LGA
Now, let’s get rid of that, and try group_map, again, by origin. We may as well use my highly pointless function. If all goes well, we will return a tibble of 15 rows
rm(split_test) test5 <- flights %>% group_by(origin) %>% group_map(~ header(.x))
The dplyr help tells us that ‘.x’ refers to the subset of rows for each group that is passed in. That returned 5 rows per origin, as expected.Let’s try again and obtain some more rows per group:
test10 <- flights %>% group_by(origin) %>% group_map(~ header(.x,10L))
Now we have 30 observations in our new tibble, so this all looks good. Can we split and map together?
#test_group_and_map <- flights %>% # group_split(origin) %>% #group_map(~ first5(.x,20))
No we cannot. Group_map works on grouped tibbles, and by splitting, we’re passing it a list. Instant fail. RTFM folks.
At this point, I can see group_map being pretty useful, but I will need to do a bit more work on the use case for group_split – at present, I’m not seeing what it offers over and above group_by, for staying in a purely dplyr workflow. I understand if you want to throw some purrr into the mix, it becomes more useful. As always, feel free to chip in if you have any thoughts or answers..I will try to play around with this more as well.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.