Understanding the *apply() functions and then others in R by asking questions
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The *apply() functions are powerful but designed poorly. The R documents are written poorly. They combined make R very hard to learn. All the time, it seems there are some explanations missing. (Another bad example of documentation is Python docs. I cannot retrieve information from the long long pages with multi-level bullets. Why don’t they make it shorter! )
I was revisiting the R data manipulations today. I am so tired of looking at these *apply() functions all the time and not knowing what exactly how I can use them. I knew it for a while and quickly forgot it – that should be a good indicator of bad design for me.
Here are the Q&A to help understand *apply().
What’s the usage of apply?
apply(X, MARGIN, FUN, …)
The other *apply() are like this.
Hmm… What’s MARGIN?
When X is a matrix,
If MARGIN=1, the FUN is applied to rows.
If MARGIN=2, the FUN is applied to rows.
If MARGIN=c(1,2), “c(1, 2) indicates rows and columns”.
For X that has dimension names, MARGIN can be dimension name string.
Hmm… What does it mean by “c(1, 2) indicates rows and columns”?
It means a cell. So,
If MARGIN=c(1,2), the FUN is applied to a cell.
Hmm… Why use 1,2, not “row”, “column”, “cell” for easier understanding? Why call it “MARGIN”, not “DIMENSION” ?
R is for smart people.
Somebody uses the following code, I don’t understand the function part.
m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2);
apply(m, 1:2, function(x) x/2)
It means
apply(m, 1:2, function(x){x/2})
or even better:
apply(X=m, MARGIN=1:2, FUN=function(x) {x/2})
Dropping whatever droppable is a way to make yourself like a guru.
What’s the thing passing to the function? It is not explained.
Code:
m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)foo <- function(x){print( c(“”, x, “ ”));return(NULL)}mapply(X=m, MARGIN=1, FUN=foo)
Output:
> m[,1] [,2][1,] 1 11[2,] 2 12[3,] 3 13[4,] 4 14[5,] 5 15[6,] 6 16[7,] 7 17[8,] 8 18[9,] 9 19[10,] 10 20> apply(X=m, MARGIN=1, FUN=foo)[1] “” “1″ “11″ “ ”[1] “” “2″ “12″ “ ”[1] “” “3″ “13″ “ ”[1] “” “4″ “14″ “ ”[1] “” “5″ “15″ “ ”[1] “” “6″ “16″ “ ”[1] “” “7″ “17″ “ ”[1] “” “8″ “18″ “ ”[1] “” “9″ “19″ “ ”[1] “” “10″ “20″ “ ”NULL
As you can see, each row is printed out separately in a function.
> apply(X=m, MARGIN=2, FUN=foo)[1] “” “1″ “2″ “3″ “4″ “5″ “6″ “7″ “8″ “9″ “10″ “ ”[1] “” “11″ “12″ “13″ “14″ “15″ “16″ “17″ “18″ “19″ “20″ “ ”NULL>
As you can see, each column is printed out separately.
> apply(X=m, MARGIN=c(1,2), FUN=foo)[1] “” “1″ “ ”[1] “” “2″ “ ”[1] “” “3″ “ ”[1] “” “4″ “ ”[1] “” “5″ “ ”[1] “” “6″ “ ”[1] “” “7″ “ ”[1] “” “8″ “ ”[1] “” “9″ “ ”[1] “” “10″ “ ”[1] “” “11″ “ ”[1] “” “12″ “ ”[1] “” “13″ “ ”[1] “” “14″ “ ”[1] “” “15″ “ ”[1] “” “16″ “ ”[1] “” “17″ “ ”[1] “” “18″ “ ”[1] “” “19″ “ ”[1] “” “20″ “ ”NULL>
as you can see, each cell is printed out separately.
So there is a hidden rule that what you extract from X will be passed to FUN as the first parameter?
True.
And, sometimes you can override the first parameter, sometimes you cannot:
> foo <- function(x)+ {+ print( c(“”, x, “ ”));+ return(NULL)+ }>> apply(X=m, MARGIN=1, FUN=foo, x = 1)Error in FUN(newX[, i], …) : unused argument(s) (newX[, i])> foo <- function(x,y)+ {+ print( c(“”, x, “ ”));+ return(NULL)+ }>> apply(X=m, MARGIN=1, FUN=foo, x = 1)[1] “” “1″ “ ”[1] “” “1″ “ ”[1] “” “1″ “ ”[1] “” “1″ “ ”[1] “” “1″ “ ”[1] “” “1″ “ ”[1] “” “1″ “ ”[1] “” “1″ “ ”[1] “” “1″ “ ”[1] “” “1″ “ ”NULL> foo <- function(x,y)+ {+ print( c(“”, x, “ ”));+ return(NULL)+ }>> apply(X=m, MARGIN=1, FUN=foo, y = 1)[1] “” “1″ “11″ “ ”[1] “” “2″ “12″ “ ”[1] “” “3″ “13″ “ ”[1] “” “4″ “14″ “ ”[1] “” “5″ “15″ “ ”[1] “” “6″ “16″ “ ”[1] “” “7″ “17″ “ ”[1] “” “8″ “18″ “ ”[1] “” “9″ “19″ “ ”[1] “” “10″ “20″ “ ”NULL>
Why did you do “return (NULL)” in the examples above?
To hide something about how returning values are organized, so I can explain here.
Code:
foo <- function(x,y){print( c(“”, x, “ ”));return(x)}apply(X=m, MARGIN=1, FUN=foo)
Output:
> apply(X=m, MARGIN=1, FUN=foo)[1] “” “1″ “11″ “ ”[1] “” “2″ “12″ “ ”[1] “” “3″ “13″ “ ”[1] “” “4″ “14″ “ ”[1] “” “5″ “15″ “ ”[1] “” “6″ “16″ “ ”[1] “” “7″ “17″ “ ”[1] “” “8″ “18″ “ ”[1] “” “9″ “19″ “ ”[1] “” “10″ “20″ “ ”[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10][1,] 1 2 3 4 5 6 7 8 9 10[2,] 11 12 13 14 15 16 17 18 19 20
The returned values arranged column by column. It means first returned value is the first column, second returned value is the second column.
> apply(X=m, MARGIN=2, FUN=foo)[1] “” “1″ “2″ “3″ “4″ “5″ “6″ “7″ “8″ “9″ “10″ “ ”[1] “” “11″ “12″ “13″ “14″ “15″ “16″ “17″ “18″ “19″ “20″ “ ”[,1] [,2][1,] 1 11[2,] 2 12[3,] 3 13[4,] 4 14[5,] 5 15[6,] 6 16[7,] 7 17[8,] 8 18[9,] 9 19[10,] 10 20> apply(X=m, MARGIN=c(1,2), FUN=foo)[1] “” “1″ “ ”[1] “” “2″ “ ”[1] “” “3″ “ ”[1] “” “4″ “ ”[1] “” “5″ “ ”[1] “” “6″ “ ”[1] “” “7″ “ ”[1] “” “8″ “ ”[1] “” “9″ “ ”[1] “” “10″ “ ”[1] “” “11″ “ ”[1] “” “12″ “ ”[1] “” “13″ “ ”[1] “” “14″ “ ”[1] “” “15″ “ ”[1] “” “16″ “ ”[1] “” “17″ “ ”[1] “” “18″ “ ”[1] “” “19″ “ ”[1] “” “20″ “ ”[,1] [,2][1,] 1 11[2,] 2 12[3,] 3 13[4,] 4 14[5,] 5 15[6,] 6 16[7,] 7 17[8,] 8 18[9,] 9 19[10,] 10 20>
What does by() do?
It is like ddply(). it applies a function to a subset of data frame separated by factor(s).
To explain it, let me build a data frame and use it as an example.
m1 <- data.frame( c1=1:10, c2=11:20, type=rep(c(“a”, “b”), 5))foo1 <-function(x){print(x)return(NULL)}by(data=m1, INDICES=m1$type, FUN=foo1)c1 c2 type
1 1 11 a
3 3 13 a
5 5 15 a
7 7 17 a
9 9 19 a
c1 c2 type
2 2 12 b
4 4 14 b
6 6 16 b
8 8 18 b
10 10 20 b
m1$type: a
NULL
———————————————————————————————————————————————
m1$type: b
NULL
You can see that “data frame with the same factor(type) is passed to the foo1() function. Knowing what have been passed, you can figure out what you can do by your self.
What’s not convenient of by() is that it returns lists. It is hard to operate on list. ddply() is better since it returns data frame. The equivalent ddply() is
ddply(m, .(type), foo1)> ddply(m, .(type), foo1)
c1 c2 ctype
1 3 13 a
2 4 14 b
3 5 15 a
c1 c2 ctype
1 1 11 a
2 2 12 b
data frame with 0 columns and 0 rows
>
Which one should I use?
I’ll use special cases here for easier understanding. You can generalize it by yourself.
- apply():
- use case #1: I have a matrix, I want get the sum of every row/column.
- use case #2: I have a matrix, I want a new matrix with 1/2 of each cell in the original value.
- by():
- use case #1: I have a data frame that has “sex” and “weight” columns. I want to see the average weight of each sex.
- eapply():
- “e” means environment, you may think it as a object that can hold other objects.
- use case #1: you build an environment for a class of student. each variable in the environment has the info of a student. You want to apply a function to each student to get their scores and return them as a list.
- lapply:
- “l” stands for “list”.
- use case #1: you have a list of vectors. each vector has some numbers. You want to apply mean() to each vector in the list and return all the means in a list.
- sapply:
- it is same as lapply(), but returns vector.
- vapply:
- you have similar requirement as lapply(), but you want to name each row so that you know what they are.
- mapply:
- you have many vector variables, you want to get the sum of the first elements in these variables. if you do mapply(sum, v1, v2, v3). it is like getting the sum of each row in the following matrix:
v1[1] v2[1] v3[1]
v1[2] v2[2] v3[2]
v1[3] v2[3] v3[3]
- rapply:
- You want the functionalities of lapply() but sometime you want to return as a vector, sometimes a list.
- tapply:
- “t” might stands for “type(factor)”.
- use case: you have a vector, the elements in the vector have different types. You want to get the average of each type. The return can be a list or vector.
Where can I learn other functions?
http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.