[This article was first published on MilanoR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post was originally posted on Quantide blog – see here.
If you want to compute arbitrary operations on a data frame returning more than one number back, use dplyr
do()
!
do()
, along with giving some advice in using and programming.
do()
is a verb (function) of dplyr
. dplyr
is a powerful R package for data manipulation, written and maintained by Hadley Wickham. This package allows you to perform the common data manipulation tasks on data frames, like: filtering for rows, selecting specific columns, re-ordering rows, adding new columns, summarizing data and computing arbitrary operations.
First of all, you have to install dplyr
package:
install.packages("dplyr")and to load it:
require(dplyr)We will analyze the use of
do()
with the following dataset, created with random data:
set.seed(100) ds <- data.frame(group=c(rep("a",100), rep("b",100), rep("c",100)), x=rnorm(n = 300, mean = 3, sd = 2), y=rnorm(n = 300, mean = 2, sd = 2))We firstly transform it into a
tbl_df
object to achieve a better print method. No changes occur on the input data frame.
ds <- tbl_df(ds) ds Source: local data frame [300 x 3] group x y (fctr) (dbl) (dbl) 1 a 1.995615 -1.71089045 2 a 3.263062 -0.03712943 3 a 2.842166 -0.09022217 4 a 4.773570 0.69742469 5 a 3.233943 2.76536531 6 a 3.637260 4.06379942 7 a 1.836419 2.26214995 8 a 4.429065 2.75438347 9 a 1.349481 -1.77539016 10 a 2.280276 3.04043881 .. ... ... ...
Base Concepts of do() (Non Standard Evaluation Version)
As we already said,do()
computes arbitrary operations on a data frame returning more than one number back.
To use do()
, you must know that:
- it always returns a dataframe
- unlike the others data manipulation verbs of
dplyr
,do()
needs the specification of.
placeholder inside the function to apply, referring to the data it has to work with.# Head of ds ds %>% do(head(.))
Source: local data frame [6 x 3] group x y (fctr) (dbl) (dbl) 1 a 1.995615 -1.71089045 2 a 3.263062 -0.03712943 3 a 2.842166 -0.09022217 4 a 4.773570 0.69742469 5 a 3.233943 2.76536531 6 a 3.637260 4.06379942
- it is conceived to be used with dplyr
group_by()
to compute operations within groups:# Head of ds by group ds %>% group_by(group) %>% do(head(.))
Source: local data frame [18 x 3] Groups: group [3] group x y (fctr) (dbl) (dbl) 1 a 1.99561530 -1.71089045 2 a 3.26306233 -0.03712943 3 a 2.84216582 -0.09022217 4 a 4.77356962 0.69742469 5 a 3.23394254 2.76536531 6 a 3.63726018 4.06379942 7 b 2.33415330 -0.56965729 8 b 5.72622741 1.71643653 9 b 2.06170532 4.87756954 10 b 4.68575126 -0.08011508 11 b 0.08401255 -0.04767590 12 b 2.19938816 4.18954758 13 c 3.05634353 -0.89257491 14 c 2.28659319 2.63171152 15 c 4.70525275 1.31450497 16 c 4.02673050 -1.86270620 17 c 5.03640599 2.48564201 18 c 0.95704183 1.27446410
- the argument of
do()
can be named or unnamed:- named arguments (more than one supplied) become list-columns, with one element for each group:
# Tail (last 3 obs) of x by group ds %>% group_by(group) %>% do(out=tail(.$x, 3)) Source: local data frame [3 x 2] Groups: <by row> group out (fctr) (chr) 1 a <dbl[3]> 2 b <dbl[3]> 3 c <dbl[3]>
- unnamed argument (only one supplied) must be a data frame and labels will be duplicated accordingly:
# Tail (last 3 obs) of x by group ds %>% group_by(group) %>% do(data.frame(out=tail(.$x, 3))) Source: local data frame [9 x 2] Groups: group [3] group out (fctr) (dbl) 1 a 3.8270397 2 a 0.6426337 3 a 0.6519305 4 b 3.3238824 5 b 0.8290942 6 b 4.1538746 7 c 6.5861213 8 c 4.6280643 9 c 0.3599512
my_fun <- function(x, y){ res_x = mean(x) + 2 res_y = mean(y) * 5 return(data.frame(res_x, res_y)) }If the argument is named the result is:
# Apply my_fun() function to ds by group ds %>% group_by(group) %>% do(out=my_fun(x=.$x, y=.$y)) Source: local data frame [3 x 2] Groups: <by row> group out (fctr) (chr) 1 a <data.frame [1,2]> 2 b <data.frame [1,2]> 3 c <data.frame [1,2]>Otherwise, if argument is unnamed the result is:
# Apply my_fun() function to ds by group ds %>% group_by(group) %>% do(my_fun(x=.$x, y=.$y)) Source: local data frame [3 x 3] Groups: group [3] group res_x res_y (fctr) (dbl) (dbl) 1 a 5.005825 9.167546 2 b 5.022282 8.683619 3 c 5.025586 11.240558
Programming with do_() (Standard Evaluation Version)
How can we enclose the previous operations inside a function? Simple! Using
Let us apply the previous function to
do_()
(the SE version of do()
) and interp()
function of lazyeval
package.
lazyeval
is an R package, written and maintained by Hadley Wickham. It represents a new approach to Non Standard Evaluation (NSE) for R. The difference between SE and NSE approaches is the quoting of input variable names. NSE is suitable for interactive use (see the previous paragraph), but not for programming, for which SE approach is recommended.
Install and load lazyeval
, if you haven’t already done it.
1
|
install.packages(“lazyeval”)
|
1
|
require(lazyeval)
|
interp()
helps to build the expression up from a mixture of constants and variables to be passed to .dots
argument of dplyr
verbs. For more details see Non Standard Evaluation vignette.
In the following example interp()
is used to build up the expression to be passed to .dots
argument of group_by_()
(SE version of group_by()
), which consists of the grouping variable name. It is used also to build up the expression to be passed to .dots
argument of do_()
. This expression consists of the function name specifying also its arguments in brackets.
1
2
3
4
5
6
7
8
9
10
11
|
fun <– function(data, x_var_name, y_var_name, group_var_name){
# group_by_() .dots argument
group_dots <– interp(~ group_var_name, group_var_name = as.name(group_var_name))
# do_() .dots argument
do_dots = interp( ~ my_fun(x = .[[x_var_name]], y = .[[y_var_name]]))
# Operations
out <– data %>%
group_by_(.dots = group_dots) %>%
do_(.dots = do_dots)
return(out)
}
|
ds
dataset:
1
|
fun(data=ds, x_var_name=“x”, y_var_name=“y”, group_var_name=“group”)
|
1
2
3
4
5
6
7
8
|
Source: local data frame [3 x 3]
Groups: group [3]
group res_x res_y
(fctr) (dbl) (dbl)
1 a 5.005825 9.167546
2 b 5.022282 8.683619
3 c 5.025586 11.240558
|
Other Examples
do()
is often used to fit models and to display the results.
Look at the following functions!
Let us define a function that fits linear model and returns coefficients as a data frame and apply it to ds
by group
:
1
2
3
4
5
6
|
# Function that fits linear model and returns coefficients as a data frame
my_fun_2 <– function(data, x, y){
mod = lm(formula = x~y, data = data)
out = data.frame(intercept=mod$coefficients[1], slope=mod$coefficients[2])
return(out)
}
|
1
2
|
# Apply my_fun_2() function (unnamed elements and nse version) to ds by group
ds %>% group_by(group) %>% do(my_fun_2(x=x, y=y, data=.))
|
1
2
3
4
5
6
7
8
|
Source: local data frame [3 x 3]
Groups: group [3]
group intercept slope
(fctr) (dbl) (dbl)
1 a 2.939123 0.03637955
2 b 3.149110 –0.07302733
3 c 3.249187 –0.09946141
|
ds
bygroup
:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
# Enclose the previous operations inside a function
fun_2 <– function(data, x_var_name, y_var_name, group_var_name){
# group_by_() .dots argument
group_dots <– interp(~ group_var_name, group_var_name = as.name(group_var_name))
# do_() .dots argument
do_dots = interp( ~ my_fun_2(data=., x = x_var_name, y = y_var_name),
x_var_name=as.name(x_var_name), y_var_name=as.name(y_var_name))
# Operations
res <– data %>%
group_by_(.dots = group_dots) %>%
do_(.dots = do_dots)
return(res)
}
|
1
2
|
# Apply fun_2() function (se version) to ds by group
fun_2(data=ds, x_var_name=“x”, y_var_name=“y”, group_var_name=“group”)
|
1
2
3
4
5
6
7
8
|
Source: local data frame [3 x 3]
Groups: group [3]
group intercept slope
(fctr) (dbl) (dbl)
1 a 2.939123 0.03637955
2 b 3.149110 –0.07302733
3 c 3.249187 –0.09946141
|
To leave a comment for the author, please follow the link and comment on their blog: MilanoR.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.