Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this new post about the apply family functions, we’ll show some examples of three new functions: lapply()
, sapply()
, and vapply()
. These functions generally works with objects of class list
; however, other variants can work with vectors.
Suppose now we have several databases on which we need to obtain the sum of each column. We’ll generate a simple function that returns a database of random numbers in the range $ [1,100] $ and also contains missing values, with a certain number of rows and columns.
dat <- function(rows, columns){ values <- sample(x = c(NA,1:100), size = rows*columns, replace = TRUE) as.data.frame(matrix(data = values, nrow = rows, ncol = columns)) }
Thus, for example, we can generate a database with 50 rows and 5 columns by executing the following command:
set.seed(20191208) df1 <- dat(50, 5)
Let’s generate more other databases of dimensions \(100\times5\) and \(150\times5\).
set.seed(20191208) df2 <- dat(100, 5) df3 <- dat(150, 5)
As we saw in this post, we could use the apply()
function to get the sum of the columns.
apply(df1,2,sum) ## V1 V2 V3 V4 V5 ## 2546 2602 2699 1993 2482 apply(df2,2,sum) ## V1 V2 V3 V4 V5 ## 5148 4692 NA 5024 5177 apply(df3,2,sum) ## V1 V2 V3 V4 V5 ## NA 7554 NA NA NA
Or, something simpler would be to use the function colSums()
.
colSums(df1) ## V1 V2 V3 V4 V5 ## 2546 2602 2699 1993 2482 colSums(df2) ## V1 V2 V3 V4 V5 ## 5148 4692 NA 5024 5177 colSums(df3) ## V1 V2 V3 V4 V5 ## NA 7554 NA NA NA
However, we could have a large number of databases, so the above procedures are not viable since writing so many lines of code is quite dull.
That’s the time the lists come into play. The three databases we generate can be stored in a single list as follows.
list1 <- list(df1, df2, df3)
By having a list, each database becomes an element of that list. For example, element 1 is the set df1, while the sets df2 and df3 are elements 2 and three, respectively. One way to call the elements within a list is by [[]]
, as follows we call the data set df2:
list1[[2]] ## V1 V2 V3 V4 V5 ## 1 35 90 35 84 79 ## 2 51 36 4 100 5 ## 3 76 47 65 90 36 ## 4 17 22 97 72 1 ## 5 95 16 8 86 65 ## 6 73 31 47 52 87 ## 7 37 23 64 85 13 ## 8 81 62 33 82 27 ## 9 59 63 79 40 45 ## 10 38 25 34 9 24 ## 11 92 82 53 75 80 ## 12 94 51 50 35 87 ## 13 44 54 10 49 71 ## 14 96 23 54 62 99 ## 15 73 87 8 64 16 ## 16 58 94 67 41 100 ## 17 92 93 78 34 59 ## 18 30 64 53 49 80 ## 19 14 21 94 59 86 ## 20 14 9 51 42 19 ## 21 55 99 12 70 54 ## 22 30 34 93 76 68 ## 23 19 40 85 29 51 ## 24 46 83 98 41 88 ## 25 48 34 100 49 37 ## 26 29 34 72 11 62 ## 27 39 78 68 41 20 ## 28 68 62 51 68 76 ## 29 92 69 37 15 95 ## 30 15 25 63 72 24 ## 31 72 90 4 63 35 ## 32 7 34 91 95 5 ## 33 62 43 1 66 57 ## 34 32 69 29 87 30 ## 35 96 67 21 32 13 ## 36 13 71 26 96 91 ## 37 58 72 30 66 5 ## 38 80 45 84 45 86 ## 39 84 89 83 86 2 ## 40 44 78 13 6 23 ## 41 24 70 63 93 1 ## 42 22 65 32 85 40 ## 43 32 54 79 34 88 ## 44 10 12 37 31 21 ## 45 48 28 9 61 75 ## 46 19 95 22 71 41 ## 47 60 48 4 26 30 ## 48 79 34 59 80 65 ## 49 25 77 61 5 95 ## 50 69 7 71 75 60 ## 51 33 53 100 43 19 ## 52 16 82 9 98 52 ## 53 53 60 35 3 77 ## 54 9 31 33 61 72 ## 55 42 17 64 40 80 ## 56 71 87 95 31 72 ## 57 79 20 11 6 67 ## 58 67 55 57 55 31 ## 59 82 82 5 87 93 ## 60 84 34 20 22 9 ## 61 38 22 92 45 24 ## 62 48 75 60 5 74 ## 63 44 78 78 29 66 ## 64 10 70 14 37 16 ## 65 12 12 93 66 19 ## 66 88 29 63 94 30 ## 67 77 44 34 59 96 ## 68 5 46 81 11 72 ## 69 79 40 73 26 99 ## 70 96 24 46 94 7 ## 71 46 64 12 2 30 ## 72 3 38 50 89 85 ## 73 86 17 14 21 37 ## 74 57 37 NA 15 36 ## 75 23 12 48 20 46 ## 76 41 7 82 69 46 ## 77 59 5 4 93 98 ## 78 47 52 23 59 94 ## 79 19 47 58 12 47 ## 80 44 34 15 15 47 ## 81 15 38 57 84 67 ## 82 98 8 83 10 64 ## 83 50 38 48 95 40 ## 84 17 18 24 1 59 ## 85 73 12 2 30 48 ## 86 41 79 38 48 68 ## 87 12 84 77 25 79 ## 88 66 52 5 85 41 ## 89 80 39 60 29 29 ## 90 100 43 15 16 3 ## 91 9 37 75 52 94 ## 92 94 22 19 48 52 ## 93 53 9 100 83 52 ## 94 37 39 78 31 65 ## 95 96 26 66 87 22 ## 96 64 25 29 7 55 ## 97 62 50 91 21 45 ## 98 18 44 94 15 18 ## 99 61 11 48 48 50 ## 100 98 45 27 17 68
Now, if we need to apply the colSums()
function to each data set, we can use the lapply()
function:
lapply(list1, colSums) ## [[1]] ## V1 V2 V3 V4 V5 ## 2546 2602 2699 1993 2482 ## ## [[2]] ## V1 V2 V3 V4 V5 ## 5148 4692 NA 5024 5177 ## ## [[3]] ## V1 V2 V3 V4 V5 ## NA 7554 NA NA NA
The result is a list with the sum of the columns of each database. If we need to do the sums, but without counting the missing values, we have to incorporate the respective argument of the colSums()
function.
lapply(list1, colSums, na.rm=TRUE) ## [[1]] ## V1 V2 V3 V4 V5 ## 2546 2602 2699 1993 2482 ## ## [[2]] ## V1 V2 V3 V4 V5 ## 5148 4692 4887 5024 5177 ## ## [[3]] ## V1 V2 V3 V4 V5 ## 7780 7554 7470 7826 7033
As you can see, the lapply()
function works with three arguments: a list -in this case, the list object-, a function that we want to apply to each element of that list -in this case, colSums()
-, and if necessary, the arguments requested by the indicated function -in this case na.rm = TRUE
of the colSums()
– function.
The previous result returns the calculations in an object of class list
; however, in many cases, it is desirable to obtain a more orderly format. The sapply()
function works identically to lapply()
, with the proviso that if the result of each item in the list has the same length, the sapply()
function groups the result. By using the lapply()
motion, we obtain a list of three elements, where each element is a vector of length five, that is, they all have the same length, so the sapply()
function would return the following:
sapply(list1, colSums) ## [,1] [,2] [,3] ## V1 2546 5148 NA ## V2 2602 4692 7554 ## V3 2699 NA NA ## V4 1993 5024 NA ## V5 2482 5177 NA sapply(list1, colSums, na.rm=TRUE) ## [,1] [,2] [,3] ## V1 2546 5148 7780 ## V2 2602 4692 7554 ## V3 2699 4887 7470 ## V4 1993 5024 7826 ## V5 2482 5177 7033
Although the sapply()
function seems more useful than lapply()
, it has a small inconvenience, and it is always going to work… How can this be an inconvenience? Let’s incorporate a new data set, but this time with six columns instead of five like the previous ones.
df4 <- dat(150, 6) list2 <- list(df1, df2, df3, df4)
If we use the lapply()
function again, we would again obtain the sums by columns in the form of a list:
lapply(list2, colSums, na.rm=TRUE) ## [[1]] ## V1 V2 V3 V4 V5 ## 2546 2602 2699 1993 2482 ## ## [[2]] ## V1 V2 V3 V4 V5 ## 5148 4692 4887 5024 5177 ## ## [[3]] ## V1 V2 V3 V4 V5 ## 7780 7554 7470 7826 7033 ## ## [[4]] ## V1 V2 V3 V4 V5 V6 ## 8205 6712 6794 8049 7562 7335
While if we apply the sapply()
function:
sapply(list2, colSums, na.rm=TRUE) ## [[1]] ## V1 V2 V3 V4 V5 ## 2546 2602 2699 1993 2482 ## ## [[2]] ## V1 V2 V3 V4 V5 ## 5148 4692 4887 5024 5177 ## ## [[3]] ## V1 V2 V3 V4 V5 ## 7780 7554 7470 7826 7033 ## ## [[4]] ## V1 V2 V3 V4 V5 V6 ## 8205 6712 6794 8049 7562 7335
Now we get a list, that is, the same result as with sapply()
. The reason is that not all elements have the same length, before there were three vectors of length five, while now also a vector of length six. This fact seems to be irrelevant, since a result is still being obtained, however depending on the context that result will not always be valid.
Suppose that based on the analysis we are doing, we know that the sum of the columns can only return a vector of length five and that if the result is something else, it may be due to an error in one of the databases, such as additional columns. The vapply()
function allows us, like sapply()
, to apply a function to the elements of a list, but specifying that the expected result, in this case, is a numerical vector of length five. Let’s first make a comparison between sapply()
and vapply()
with the object list
, which has three databases with five columns:
sapply(list1, colSums, na.rm=TRUE) ## [,1] [,2] [,3] ## V1 2546 5148 7780 ## V2 2602 4692 7554 ## V3 2699 4887 7470 ## V4 1993 5024 7826 ## V5 2482 5177 7033 vapply(list1, colSums, numeric(5), na.rm=TRUE) ## [,1] [,2] [,3] ## V1 2546 5148 7780 ## V2 2602 4692 7554 ## V3 2699 4887 7470 ## V4 1993 5024 7826 ## V5 2482 5177 7033
The results are identical. But now let’s repeat the previous example but for the object list2
, which contains a data set with six columns.
sapply(list2, colSums, na.rm=TRUE) ## [[1]] ## V1 V2 V3 V4 V5 ## 2546 2602 2699 1993 2482 ## ## [[2]] ## V1 V2 V3 V4 V5 ## 5148 4692 4887 5024 5177 ## ## [[3]] ## V1 V2 V3 V4 V5 ## 7780 7554 7470 7826 7033 ## ## [[4]] ## V1 V2 V3 V4 V5 V6 ## 8205 6712 6794 8049 7562 7335
The sapply()
function performs the calculation, but under the assumption that the expected result is vectors of length five, this result is incorrect. The vapply()
function helps us control this problem.
vapply(list2, colSums, numeric(5), na.rm=TRUE) ## Error in vapply(list2, colSums, numeric(5), na.rm = TRUE): Los valores deben ser de longitud 5, ## pero el resultado FUN(X [[4]]) es la longitud 6
When trying to execute the code, we get an error, since evaluating the function in element number four of the list returns a vector of length six and not five as expected.
The vapply()
function is generally more advisable than sapply()
because it allows you to have more control over the expected results.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.