Lesser known dplyr tricks
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this blog post I share some lesser-known (at least I believe they are) tricks that use mainly functions from dplyr
.
Removing unneeded columns
Did you know that you can use -
in front of a column name to remove it from a data frame?
mtcars %>% select(-disp) %>% head() ## mpg cyl hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 105 2.76 3.460 20.22 1 0 3 1
Re-ordering columns
Still using select()
, it is easy te re-order columns in your data frame:
mtcars %>% select(cyl, disp, hp, everything()) %>% head() ## cyl disp hp mpg drat wt qsec vs am gear carb ## Mazda RX4 6 160 110 21.0 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 6 160 110 21.0 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 4 108 93 22.8 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 6 258 110 21.4 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 8 360 175 18.7 3.15 3.440 17.02 0 0 3 2 ## Valiant 6 225 105 18.1 2.76 3.460 20.22 1 0 3 1
As its name implies everything()
simply means all the other columns.
Renaming columns with rename()
mtcars <- rename(mtcars, spam_mpg = mpg) mtcars <- rename(mtcars, spam_disp = disp) mtcars <- rename(mtcars, spam_hp = hp) head(mtcars) ## spam_mpg cyl spam_disp spam_hp drat wt qsec vs am ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 ## gear carb ## Mazda RX4 4 4 ## Mazda RX4 Wag 4 4 ## Datsun 710 4 1 ## Hornet 4 Drive 3 1 ## Hornet Sportabout 3 2 ## Valiant 3 1
Selecting columns with a regexp
It is easy to select the columns that start with “spam” with some helper functions:
mtcars %>% select(contains("spam")) %>% head() ## spam_mpg spam_disp spam_hp ## Mazda RX4 21.0 160 110 ## Mazda RX4 Wag 21.0 160 110 ## Datsun 710 22.8 108 93 ## Hornet 4 Drive 21.4 258 110 ## Hornet Sportabout 18.7 360 175 ## Valiant 18.1 225 105
take also a look at starts_with()
, ends_with()
, contains()
, matches()
, num_range()
, one_of()
and everything()
.
Create new columns with mutate()
and if_else()
mtcars %>% mutate(vs_new = if_else( vs == 1, "one", "zero", NA_character_)) %>% head() ## spam_mpg cyl spam_disp spam_hp drat wt qsec vs am gear carb vs_new ## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 zero ## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 zero ## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 one ## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 one ## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 zero ## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 one
You might want to create a new variable conditionally on several values of another column:
mtcars %>% mutate(carb_new = case_when(.$carb == 1 ~ "one", .$carb == 2 ~ "two", .$carb == 4 ~ "four", TRUE ~ "other")) %>% head(15) ## spam_mpg cyl spam_disp spam_hp drat wt qsec vs am gear carb ## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 ## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 ## 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 ## 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 ## 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 ## carb_new ## 1 four ## 2 four ## 3 one ## 4 one ## 5 two ## 6 one ## 7 four ## 8 two ## 9 two ## 10 four ## 11 four ## 12 other ## 13 other ## 14 other ## 15 four
Mind the .$
before the variable carb
. There is a github issue about this, and it is already fixed in the development version of dplyr
, which means that in the next version of dplyr
, case_when()
will work as any other specialized dplyr
function inside mutate()
.
Apply a function to certain columns only, by rows
mtcars %>% select(am, gear, carb) %>% purrr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2 head(mtcars2) ## # A tibble: 6 × 4 ## am gear carb sum_am_gear_carb ## <dbl> <dbl> <dbl> <dbl> ## 1 1 4 4 9 ## 2 1 4 4 9 ## 3 1 4 1 6 ## 4 0 3 1 4 ## 5 0 3 2 5 ## 6 0 3 1 4
For this, I had to use purrr
’s by_row()
function. You can then add this column to your original data frame:
mtcars <- cbind(mtcars, "sum_am_gear_carb" = mtcars2$sum_am_gear_carb) head(mtcars) ## spam_mpg cyl spam_disp spam_hp drat wt qsec vs am ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 ## gear carb sum_am_gear_carb ## Mazda RX4 4 4 9 ## Mazda RX4 Wag 4 4 9 ## Datsun 710 4 1 6 ## Hornet 4 Drive 3 1 4 ## Hornet Sportabout 3 2 5 ## Valiant 3 1 4
Use do()
to do any arbitrary operation
mtcars %>% group_by(cyl) %>% do(models = lm(spam_mpg ~ drat + wt, data = .)) %>% broom::tidy(models) ## Source: local data frame [9 x 6] ## Groups: cyl [3] ## ## cyl term estimate std.error statistic p.value ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 4 (Intercept) 33.2493403 17.0987286 1.9445504 0.087727622 ## 2 4 drat 1.3244329 3.4519717 0.3836743 0.711215433 ## 3 4 wt -5.2400608 2.2150213 -2.3656932 0.045551615 ## 4 6 (Intercept) 30.6544931 7.5141648 4.0795609 0.015103868 ## 5 6 drat -0.4435744 1.1740862 -0.3778039 0.724768945 ## 6 6 wt -2.9902720 1.5685053 -1.9064468 0.129274249 ## 7 8 (Intercept) 29.6519180 7.0878976 4.1834574 0.001527613 ## 8 8 drat -1.4698722 1.6285054 -0.9025897 0.386081744 ## 9 8 wt -2.4518017 0.7985112 -3.0704664 0.010651044
do()
is useful when you want to use any R function (user defined functions work too!) with dplyr
functions. First I grouped the observations by cyl
and then ran a linear model for each group. Then I converted the output to a tidy data frame using broom::tidy()
.
Using dplyr()
functions inside your own functions
extract_vars <- function(data, some_string){ data %>% select_(lazyeval::interp(~contains(some_string))) -> data return(data) } extract_vars(mtcars, "spam") ## spam_mpg spam_disp spam_hp ## Mazda RX4 21.0 160.0 110 ## Mazda RX4 Wag 21.0 160.0 110 ## Datsun 710 22.8 108.0 93 ## Hornet 4 Drive 21.4 258.0 110 ## Hornet Sportabout 18.7 360.0 175 ## Valiant 18.1 225.0 105 ## Duster 360 14.3 360.0 245 ## Merc 240D 24.4 146.7 62 ## Merc 230 22.8 140.8 95 ## Merc 280 19.2 167.6 123 ## Merc 280C 17.8 167.6 123 ## Merc 450SE 16.4 275.8 180 ## Merc 450SL 17.3 275.8 180 ## Merc 450SLC 15.2 275.8 180 ## Cadillac Fleetwood 10.4 472.0 205 ## Lincoln Continental 10.4 460.0 215 ## Chrysler Imperial 14.7 440.0 230 ## Fiat 128 32.4 78.7 66 ## Honda Civic 30.4 75.7 52 ## Toyota Corolla 33.9 71.1 65 ## Toyota Corona 21.5 120.1 97 ## Dodge Challenger 15.5 318.0 150 ## AMC Javelin 15.2 304.0 150 ## Camaro Z28 13.3 350.0 245 ## Pontiac Firebird 19.2 400.0 175 ## Fiat X1-9 27.3 79.0 66 ## Porsche 914-2 26.0 120.3 91 ## Lotus Europa 30.4 95.1 113 ## Ford Pantera L 15.8 351.0 264 ## Ferrari Dino 19.7 145.0 175 ## Maserati Bora 15.0 301.0 335 ## Volvo 142E 21.4 121.0 109
About this last point, you can read more about it here.
Hope you liked this small list of tricks!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.