Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Overview
In this tutorial on purrr
package in R, you will learn how to use functions from the purrr
package in R to improve the quality of your code and understand the advantages of purrr
functions compared to equivalent base R functions.
Is R Functional Programming Language?
Most of us don’t pay attention to such questions or features of a programming language. However, I have realized that this understanding is fundamental to write efficient and effective code, which is easy to understand and execute.
Although R language is not purely a functional language, it does indeed have some technical properties which allow us to style our code in a way that is centered around solving problems using functions. To learn more about functional programming in regards to R, I encourage you to read Advance R book by Hadley Wickham. For now, we will continue with our tutorial covering essential functions from purrr
package in R.
Installing purrr package
The purr package can be downloaded using three different methods. As it is part of tidyverse
package in R. I guess the easiest of all is to download the tidyverse
package. The other techniques include direct download or downloading the developer version directly from GitHub using install_github()
function from devtool
package in R
# The easiest way - install the tidyverse install.packages("tidyverse") # Install just purrr install.packages("purrr") # Install development version directly from GitHub # install.packages("devtools") devtools::install_github("tidyverse/purrr")
The purrr
package is famous for apply functions as it provides a consistent set of tools for working with functions and vectors in R. So, let’s start the purrr
tutorial by understanding Apply Functions in purrr
package.
Eliminating for loops using map() function
Just like apply family(apply()
, lapply()
, tapply()
, vapply()
, etc) functions in base R purrr
package provides a more consistent and easy to learn functions that can solve similar problems. Here we will look into the following three functions.
Here the consistency is in regards to the output data type. The map() function always returns a list or lists.
- map() – Use if you want to apply a function to each element of the list or a vector.
- map2() – Use if you’re going to apply a function to a pair of elements from two different lists or vectors.
- pmap() – Use if you need to apply a function to a group of elements from a list of lists.
The following example will help you understand each function in a better way. The goal of using functions from the purrr package instead of regular for loop is to divide the complex problem into smaller independent pieces.
Example map() function
In the below example, we will apply a UDF square function to each element of a vector. You will notice that the output here will be a list, as mentioned above.
# defining a function which returns square square <- function(x){ return(x*x) } # Create a vector of number vector1 <- c(2,4,5,6) # Using map() fucntion to generate squares map(vector1, square)
[[1]] [1] 4 [[2]] [1] 16 [[3]] [1] 25 [[4]] [1] 36
Example map2() function
Sometimes the calculations involve two variables or vectors or lists. In that case, you can use the map2() function. The only requirement here is that the two vectors should be of the same length, or otherwise, an error msg will be thrown stating inconsistency between the vector lengths. The snapshot of the error is as given below.
Let’s say we have two vectors x and y. Here we are creating x to the power y. So first, we define a function that returns the desired output. And then use map2() function to get the expected outcome.
x <- c(2, 4, 5, 6) y <- c(2, 3, 4, 5) to_Power <- function(x, y){ return(x**y) } map2(x, y, to_Power)
[[1]] [1] 4 [[2]] [1] 64 [[3]] [1] 625 [[4]] [1] 7776
It is not necessary to pass a function. You can also use arithmetic operators directly, as given below. Say I want to get the sum of values for each value in x and y.
map2(x, y, ~ .x + .y)
[[1]] [1] 4 [[2]] [1] 7 [[3]] [1] 9 [[4]] [1] 11
Example pmap() function
Using the pmap() function, you can map a function over multiple inputs simultaneously. Here each information is processed in parallel with the other. The parallel word here does not mean that it is processed in multiple cores.
The example below is only for illustration purposes. The calculations mentioned may not make sense in the business terms, but that’s fine. Here we are generating a sum of mpg, hp and disp variables from mtcars dataset using pmap()
function
mtcars_sub <- mtcars[1:5,c("mpg", "hp", "disp")] pmap(mtcars_sub, sum)
[[1]] [1] 291 [[2]] [1] 291 [[3]] [1] 223.8 [[4]] [1] 389.4 [[5]] [1] 553.7
Unlike apply functions, you don’t have to worry about different types of outputs when it comes to map() functions from
purrr
package.
Working with lists using purrr package
It is crucial to understand how to be productive while working with purrr functions in R. As most of the functions return a list as output. The tasks related to lists can be put into five buckets as given below:
- Filtering lists
- Summarizing lists
- Transforming lists
- Reshaping Lists
- Join or Combine Lists
We will now look at the number of functions and tasks falling within each group.
Filtering Lists
The three functions which we find of help and interest here are
- pluck() or chuck()– Using these functions, you can extract or select a particular element from a list by using its name or index. The only difference is that in case the element is not present in the list
pluck()
function consistently return NULL whereaschuck()
will always through an error. Let us look at the example given below:
ls1 <- list("R", "Statistics", "Blog") pluck(ls1, 2)
[1] "Statistics"
You will notice that if you pass index as 4, which does not exist in the list. The pluck()
function will return a NULL value.
ls1 <- list("R", "Statistics", "Blog") pluck(ls1, 4)
[1] NULL
Why don’t you go ahead and experiment with the chuck()
function for better understanding and practice.
- keep() – A handy function, as the same suggests, using this function, we can observe only those elements in the list which pass a logical test. Here we will only keep elements that are greater than five into the list.
ls2 <- list(23, 12, 14, 7, 2, 0, 24, 98) keep(ls2, function(x) x > 5)
[[1]] [1] 23 [[2]] [1] 12 [[3]] [1] 14 [[4]] [1] 7 [[5]] [1] 24 [[6]] [1] 98
- discard() – The function drops those values which fail to pass the logical tests. Say we want to drop NA values then you can use
is.na()
to discard observations which are represented NA in the list.
ls3 <- list(23, NA, 14, 7, NA, NA, 24, 98) discard(ls3, is.na)
[[1]] [1] 23 [[2]] [1] 14 [[3]] [1] 7 [[4]] [1] 24 [[5]] [1] 98
- compact() – A simple, straightforward function that drops all the NULL values present in the list. Please do not confuse NA values with that of NULL values. These are two different types in R.
ls4 <- list(23, NULL, NA, 34) compact(ls4)
[[1]] [1] 23 [[2]] [1] NA [[3]] [1] 34
- head_while() – An interesting function, the function kind of checks for the logical condition for each element in the list starting from the top and returns head elements until one does not pass the logical test. In the below example, we check if the element is character or not.
ls5 <- list("R", "Statistics", "Blog", 2, 3, 1) head_while(ls5, is.character)
[[1]] [1] "R" [[2]] [1] "Statistics" [[3]] [1] "Blog"
If you are interested in tail elements, then the purrr package provides tail_while() function. With this, we end the list filtering functions. These are some of the most common functions which you will find of interest in day to day working.
Summarising Lists
There are a couple of functions which purrr provides, but in this purr tutorial, we will talk about the most widely used four functions.
- every() – This function returns TRUE if all the elements in a list pass a condition or test. In the below example,
every()
function returns FALSE as one of the elements inside the list is not a character.
sm1 <- list("R", 2, "Rstatistics", "Blog") every(sm1, is.character)
[1] FALSE
- some() – it is similar to the
every()
as in it checks for a condition towards all the elements inside a list but return TRUE if even one value passes the test or logic.
sm2 <- list("R", 2, "Rstatistics", "Blog") some(sm1, is.character)
[1] TRUE
- has_element() – The function returns true if the list contains the element mentioned.
sm2 <- list("R", 2, "Rstatistics", "Blog") has_element(sm2, 2)
[1] TRUE
- detect() – Returns the first element that passes the test or logical condition. Here the function will return the element itself. Below we are looking for elements that are numeric in the given list. Although we have two elements in the list, the function only returns the first one IE 2.
sm3 <- list("R", 2, "Rstatistics", "Blog", 3) detect(sm3, is.numeric)
[1] 2
- detect_index() – Just like detect this function, also checks for the elements which pass the test and return the index of the first element from the list.
sm4 <- list(2, "Rstatistics", "Blog", TRUE) detect_index(sm4, is.logical)
[1] 4
Reshaping Lists
Flattening and getting transpose of a list are the two tasks that you will find your self doing pretty consistently as part of data wrangling. If you have made so far with this tutorial, you know that flattening is something you will be engaging with too often. The tasks mentioned here can be achieved using the following functions.
flatten()
– The function removes the level hierarchy from the list of lists. The equivalent function to this in Base R would beunlist()
function. Although the two are similar,flatten()
only removes the single layer of hierarchy and is stable. What this means is that you always know the output type. There are subgroup functions which, when used, ensure that you get the desired output. The sub-group functions are as mentioned below:
- flatten_lgl() returns a logical vector
- flatten_int() returns an integer vector
- flatten_dbl() returns a double vector
- flatten_chr() returns a character vector
- flatten_dfr() returns a data frames created by row-binding
- flatten_dfc() returns a data frames created by column-binding
Let’s look at the output generated by flatten() and its subgroup functions. First, let us create a list of numbers. If you want, you can pick any work from the above example code.
x <- rerun(2, sample(6)) x
[[1]] [1] 2 5 1 3 6 4 [[2]] [1] 6 1 4 3 2 5 [[3]] [1] 1 4 6 3 5 2 [[4]] [1] 5 6 4 1 3 2
So our list consists of 4 numerical vectors containing the random numbers between 1 to 6. We will now flatten the list using flatten_int()
function.
flatten_int(x)
[1] 2 5 3 6 4 1 3 1 6 4 2 5
All the functions mentioned have very straight forward and simple syntax. We believe the above example is good enough; however, in case you still face some issue, feel free to drop a comment, and we will assist you with the implementation.
transpose()
– The function converts a pair of lists into a list of pairs. Let us look at an example, and I am sure it will make much sense when you compare the before and after outputs.
x <- rerun(2, x = runif(1), y = runif(3)) x [[1]] [[1]]$x [1] 0.956008 [[1]]$y [1] 0.4784622 0.7901005 0.7429528 [[2]] [[2]]$x [1] 0.8055662 [[2]]$y [1] 0.3681470 0.9886638 0.7591404 x %>% transpose() %>% str()
List of 2 $ x:List of 2 ..$ : num 0.956 ..$ : num 0.806 $ y:List of 2 ..$ : num [1:3] 0.478 0.79 0.743 ..$ : num [1:3] 0.368 0.989 0.759
Join or Combine Lists
You can join two lists in different ways. One is you can append one behind the other, and second, you can append at the beginning of the other list. The purrr package provides functions that help you achieve these tasks. Let us see given two lists, how we can achieve the above-mentioned tasks.
- append() – This function appends the list at the end of the other list. Here we are appending list b to list a. So, let’s first create two lists named a and b. Then we append and finally flatten the list using the
flatten_dbl()
function.
a <- list(22, 11, 44, 55) b <- list(11, 99, 77) flatten_dbl(append(a, b))
[1] 22 11 44 55 11 99 77
- prepend() – Using this function, we can append a list before another list. The following example code illustrates how we can achieve that.
a <- list(22, 11, 44, 55) b <- list(11, 99, 77) flatten_dbl(prepend(a, b))
[1] 11 99 77 22 11 44 55
Other useful functions
In this section, we will cover functions that do not necessarily fall into the above categories. But we believe knowing these functions will improve your programming skills tremendously.
- cross_df() – The function returns a data frame where each row is a combination of list elements.
df <- list( empId = c(100, 101, 102, 103), name = c("John", "Jack", "Jill", "Cathy"), exp = c(4, 10, 6, 8)) df
$empId [1] 100 101 102 103 $name [1] "John" "Jack" "Jill" "Cathy" $exp [1] 4 10 6 8
Here we have three vectors stored in a list. We can now use cross_df()
function to get the data frame.
cross_df(df)
# A tibble: 64 x 3 empId name exp <dbl> <chr> <dbl> 1 100 John 4 2 101 John 4 3 102 John 4 4 103 John 4 5 100 Jack 4 6 101 Jack 4 7 102 Jack 4 8 103 Jack 4 9 100 Jill 4 10 101 Jill 4 # ... with 54 more rows
- rerun() – You can use rerun() an repeat a function n number of times. The function is equivalent to the
repeat()
function. The rerun() function is very useful when it comes to generating sample data in R.
rerun(1, print("Hello, World!"))
- reduce() – The reduce function recursively applies a function or an operation to each element of a list or vector. For example, say I want to add all the numbers of a vector. Notice that we are using backtick instead of inverted commos here.
reduce(c(4,12,30, 16), `+`)
[1] 62
Let’s look at another example. Say I want to concatenate the first element of each vector inside a list. To achieve this, we can use paste function as mentioned below.
x <- list(c(0, 1), c(2, 3), c(4, 5)) reduce(x, paste)
[1] "0 2 4" "1 3 5"
The function also has a variant named reduce2(). If your work involves two vectors or lists, you can use reduce2()
instead of reduce()
.
- accumulate() – The function sequentially applies a function to a vector or list. It works just like
reduce()
, but also returns intermediate results. At each iteration, the function takes two arguments. One is the initial value or the result from the previous step, and the second is the next value in the vector. For further understanding, let’s take a look at the below example, which returns the cumulative sum of values in a vector.
accumulate(c(1,2,3,4,5), sum)
[1] 1 3 6 10 15
The function can be implemented on two different lists through the use of accumulate2().
Bonus – Creating Nested Data Frames
A nested data frame stores multiple tables within the rows of a larger table. You can create nested data for tables where you believe that the groups within the data exist. For example, the world-famous iris dataset contains data about three different types of flowers. Here we will convert iris into nested dataframe. The following are the steps you need to follow to convert any data (with groups) into the nested data frame.
- Group data into groups using dplyr::group_by() function
iris_grouped <- iris %>% group_by(Species)
# A tibble: 150 x 5 # Groups: Species [3] Sepal.Length Sepal.Width Petal.Length <dbl> <dbl> <dbl> 1 5.1 3.5 1.4 2 4.9 3 1.4 3 4.7 3.2 1.3 4 4.6 3.1 1.5 5 5 3.6 1.4 6 5.4 3.9 1.7 7 4.6 3.4 1.4 8 5 3.4 1.5 9 4.4 2.9 1.4 10 4.9 3.1 1.5 # ... with 140 more rows, and 2 more variables: # Petal.Width <dbl>, Species <fct>
- Use nest() function on grouped data to create a nested data frame where each row will have a subset data representing a group.
nested_iris <- iris_grouped %>% nest()
# A tibble: 3 x 2 # Groups: Species [3] Species data <fct> <list> 1 setosa <tibble [50 x 4]> 2 versicolor <tibble [50 x 4]> 3 virginica <tibble [50 x 4]>
Now that we have the tables saved in each row by each species as a tibble, you can call any function on them using map() function.
Practice Question
Develop a linear regression model that predicts the mileage of a car for each cylinder type. Once you have the linear regression model save the intercept in the column named intercept.
First, we create the groups and then get the nested data frame.
mtcars_by_cyl <- mtcars %>% group_by(cyl) nested_mtcars <- mtcars_by_cyl %>% nest() # Defining the lm function lm_fun <- function(data) lm(mpg ~ ., data = data) # Using mutate and map to built model and save result lm_mtcars <- nested_mtcars %>% mutate(model = map(data, lm_fun))
Let’s see what’s inside the model column in lm_mtcars object.
lm_mtcars[[3]]
[[1]] Call: lm(formula = mpg ~ ., data = data) Coefficients: (Intercept) disp hp 32.78649 0.07456 -0.04252 drat wt qsec 1.52367 5.12418 -2.33333 vs am gear -1.75289 NA NA carb NA [[2]] Call: lm(formula = mpg ~ ., data = data) Coefficients: (Intercept) disp hp 60.85893 -0.34522 -0.03325 drat wt qsec -4.19300 4.48273 -0.10639 vs am gear -3.64277 -6.32631 4.06653 carb 3.22483 [[3]] Call: lm(formula = mpg ~ ., data = data) Coefficients: (Intercept) disp hp 6.25438 -0.02342 0.15195 drat wt qsec -5.74240 -0.72632 1.35856 vs am gear NA 4.87476 NA carb -4.77330
You will notice three different models are created and stored as a list inside the column named model. We will write a function to extract the intercept and save that information in the column called intercept.
# Function for extraction of beta coefficients beta_extract_fun <- function(mod) coefficients(mod)[[1]] # Extracting incept values for each model lm_mtcars %>% transmute(data, intercept = map_dbl(model, beta_extract_fun))
# A tibble: 3 x 3 # Groups: cyl [3] cyl data intercept <dbl> <list> <dbl> 1 6 <tibble [7 x 10]> 32.8 2 4 <tibble [11 x 10]> 60.9 3 8 <tibble [14 x 10]> 6.25
In this article on purrr package in R, we learned some very useful functions which will help you write better code with a focus on R programming’s functional aspect. I hope you find this tutorial of help, and going forward you will be able to take a call on when to fallback on functions from the purrr package.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.