One Stop Tutorial On purrr Package In R

datasciencebeginners

2 years ago

[This article was first published on R Statistics Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Overview

In this tutorial on purrr package in R, you will learn how to use functions from the purrr package in R to improve the quality of your code and understand the advantages of purrr functions compared to equivalent base R functions.

Is R Functional Programming Language?

Most of us don’t pay attention to such questions or features of a programming language. However, I have realized that this understanding is fundamental to write efficient and effective code, which is easy to understand and execute.

Although R language is not purely a functional language, it does indeed have some technical properties which allow us to style our code in a way that is centered around solving problems using functions. To learn more about functional programming in regards to R, I encourage you to read Advance R book by Hadley Wickham. For now, we will continue with our tutorial covering essential functions from purrr package in R.

Installing purrr package

The purr package can be downloaded using three different methods. As it is part of tidyverse package in R. I guess the easiest of all is to download the tidyverse package. The other techniques include direct download or downloading the developer version directly from GitHub using install_github() function from devtool package in R

# The easiest way - install the tidyverse
install.packages("tidyverse")

# Install just purrr
install.packages("purrr")

# Install development version directly from GitHub
# install.packages("devtools")
devtools::install_github("tidyverse/purrr")

The purrr package is famous for apply functions as it provides a consistent set of tools for working with functions and vectors in R. So, let’s start the purrr tutorial by understanding Apply Functions in purrr package.

Eliminating for loops using map() function

Just like apply family(apply(), lapply(), tapply(), vapply(), etc) functions in base R purrr package provides a more consistent and easy to learn functions that can solve similar problems. Here we will look into the following three functions.

Here the consistency is in regards to the output data type. The map() function always returns a list or lists.

map() – Use if you want to apply a function to each element of the list or a vector.
map2() – Use if you’re going to apply a function to a pair of elements from two different lists or vectors.
pmap() – Use if you need to apply a function to a group of elements from a list of lists.

The following example will help you understand each function in a better way. The goal of using functions from the purrr package instead of regular for loop is to divide the complex problem into smaller independent pieces.

Example map() function

In the below example, we will apply a UDF square function to each element of a vector. You will notice that the output here will be a list, as mentioned above.

# defining a function which returns square
square <- function(x){
  return(x*x)
}


# Create a vector of number
vector1 <- c(2,4,5,6)

# Using map() fucntion to generate squares
map(vector1, square)

[[1]]
[1] 4

[[2]]
[1] 16

[[3]]
[1] 25

[[4]]
[1] 36

Example map2() function

Sometimes the calculations involve two variables or vectors or lists. In that case, you can use the map2() function. The only requirement here is that the two vectors should be of the same length, or otherwise, an error msg will be thrown stating inconsistency between the vector lengths. The snapshot of the error is as given below.

Let’s say we have two vectors x and y. Here we are creating x to the power y. So first, we define a function that returns the desired output. And then use map2() function to get the expected outcome.

x <- c(2, 4, 5, 6)
y <- c(2, 3, 4, 5)

to_Power <- function(x, y){
  return(x**y)
}

map2(x, y, to_Power)

[[1]]
[1] 4

[[2]]
[1] 64

[[3]]
[1] 625

[[4]]
[1] 7776

It is not necessary to pass a function. You can also use arithmetic operators directly, as given below. Say I want to get the sum of values for each value in x and y.

map2(x, y, ~ .x + .y)

[[1]]
[1] 4

[[2]]
[1] 7

[[3]]
[1] 9

[[4]]
[1] 11

Example pmap() function

Using the pmap() function, you can map a function over multiple inputs simultaneously. Here each information is processed in parallel with the other. The parallel word here does not mean that it is processed in multiple cores.

The example below is only for illustration purposes. The calculations mentioned may not make sense in the business terms, but that’s fine. Here we are generating a sum of mpg, hp and disp variables from mtcars dataset using pmap() function

mtcars_sub <- mtcars[1:5,c("mpg", "hp", "disp")]
pmap(mtcars_sub, sum)

[[1]]
[1] 291

[[2]]
[1] 291

[[3]]
[1] 223.8

[[4]]
[1] 389.4

[[5]]
[1] 553.7

Unlike apply functions, you don’t have to worry about different types of outputs when it comes to map() functions from purrr package.

Working with lists using purrr package

It is crucial to understand how to be productive while working with purrr functions in R. As most of the functions return a list as output. The tasks related to lists can be put into five buckets as given below:

Filtering lists
Summarizing lists
Transforming lists
Reshaping Lists
Join or Combine Lists

We will now look at the number of functions and tasks falling within each group.

Filtering Lists

The three functions which we find of help and interest here are

pluck() or chuck()– Using these functions, you can extract or select a particular element from a list by using its name or index. The only difference is that in case the element is not present in the list pluck() function consistently return NULL whereas chuck() will always through an error. Let us look at the example given below:

ls1 <- list("R", "Statistics", "Blog")
pluck(ls1, 2)

[1] "Statistics"

You will notice that if you pass index as 4, which does not exist in the list. The pluck() function will return a NULL value.

ls1 <- list("R", "Statistics", "Blog")
pluck(ls1, 4)

[1] NULL

Why don’t you go ahead and experiment with the chuck() function for better understanding and practice.

keep() – A handy function, as the same suggests, using this function, we can observe only those elements in the list which pass a logical test. Here we will only keep elements that are greater than five into the list.

ls2 <- list(23, 12, 14, 7, 2, 0, 24, 98)
keep(ls2, function(x) x > 5)

[[1]]
[1] 23

[[2]]
[1] 12

[[3]]
[1] 14

[[4]]
[1] 7

[[5]]
[1] 24

[[6]]
[1] 98

discard() – The function drops those values which fail to pass the logical tests. Say we want to drop NA values then you can use is.na()to discard observations which are represented NA in the list.

ls3 <- list(23, NA, 14, 7, NA, NA, 24, 98)
discard(ls3, is.na)

[[1]]
[1] 23

[[2]]
[1] 14

[[3]]
[1] 7

[[4]]
[1] 24

[[5]]
[1] 98

compact() – A simple, straightforward function that drops all the NULL values present in the list. Please do not confuse NA values with that of NULL values. These are two different types in R.

ls4 <- list(23, NULL, NA, 34)
compact(ls4)

[[1]]
[1] 23

[[2]]
[1] NA

[[3]]
[1] 34

head_while() – An interesting function, the function kind of checks for the logical condition for each element in the list starting from the top and returns head elements until one does not pass the logical test. In the below example, we check if the element is character or not.

ls5 <- list("R", "Statistics", "Blog", 2, 3, 1)
head_while(ls5, is.character)

[[1]]
[1] "R"

[[2]]
[1] "Statistics"

[[3]]
[1] "Blog"

If you are interested in tail elements, then the purrr package provides tail_while() function. With this, we end the list filtering functions. These are some of the most common functions which you will find of interest in day to day working.

Summarising Lists

There are a couple of functions which purrr provides, but in this purr tutorial, we will talk about the most widely used four functions.

every() – This function returns TRUE if all the elements in a list pass a condition or test. In the below example, every() function returns FALSE as one of the elements inside the list is not a character.

sm1 <- list("R", 2, "Rstatistics", "Blog")
every(sm1, is.character)

[1] FALSE

some() – it is similar to the every() as in it checks for a condition towards all the elements inside a list but return TRUE if even one value passes the test or logic.

sm2 <- list("R", 2, "Rstatistics", "Blog")
some(sm1, is.character)

[1] TRUE

has_element() – The function returns true if the list contains the element mentioned.

sm2 <- list("R", 2, "Rstatistics", "Blog")
has_element(sm2, 2)

[1] TRUE

detect() – Returns the first element that passes the test or logical condition. Here the function will return the element itself. Below we are looking for elements that are numeric in the given list. Although we have two elements in the list, the function only returns the first one IE 2.

sm3 <- list("R", 2, "Rstatistics", "Blog", 3)
detect(sm3, is.numeric)

[1] 2

detect_index() – Just like detect this function, also checks for the elements which pass the test and return the index of the first element from the list.

sm4 <- list(2, "Rstatistics", "Blog", TRUE)
detect_index(sm4, is.logical)

[1] 4

Reshaping Lists

Flattening and getting transpose of a list are the two tasks that you will find your self doing pretty consistently as part of data wrangling. If you have made so far with this tutorial, you know that flattening is something you will be engaging with too often. The tasks mentioned here can be achieved using the following functions.

flatten() – The function removes the level hierarchy from the list of lists. The equivalent function to this in Base R would be unlist() function. Although the two are similar, flatten() only removes the single layer of hierarchy and is stable. What this means is that you always know the output type. There are subgroup functions which, when used, ensure that you get the desired output. The sub-group functions are as mentioned below:

flatten_lgl() returns a logical vector
flatten_int() returns an integer vector
flatten_dbl() returns a double vector
flatten_chr() returns a character vector
flatten_dfr() returns a data frames created by row-binding
flatten_dfc() returns a data frames created by column-binding

Let’s look at the output generated by flatten() and its subgroup functions. First, let us create a list of numbers. If you want, you can pick any work from the above example code.

x <- rerun(2, sample(6))
x

[[1]]
[1] 2 5 1 3 6 4

[[2]]
[1] 6 1 4 3 2 5

[[3]]
[1] 1 4 6 3 5 2

[[4]]
[1] 5 6 4 1 3 2

So our list consists of 4 numerical vectors containing the random numbers between 1 to 6. We will now flatten the list using flatten_int() function.

flatten_int(x)

[1] 2 5 3 6 4 1 3 1 6 4 2 5

All the functions mentioned have very straight forward and simple syntax. We believe the above example is good enough; however, in case you still face some issue, feel free to drop a comment, and we will assist you with the implementation.

transpose() – The function converts a pair of lists into a list of pairs. Let us look at an example, and I am sure it will make much sense when you compare the before and after outputs.

x <- rerun(2, x = runif(1), y = runif(3))
x


[[1]]
[[1]]$x
[1] 0.956008

[[1]]$y
[1] 0.4784622 0.7901005 0.7429528


[[2]]
[[2]]$x
[1] 0.8055662

[[2]]$y
[1] 0.3681470 0.9886638 0.7591404


x %>% transpose() %>% str()

List of 2
 $ x:List of 2
  ..$ : num 0.956
  ..$ : num 0.806
 $ y:List of 2
  ..$ : num [1:3] 0.478 0.79 0.743
  ..$ : num [1:3] 0.368 0.989 0.759

Join or Combine Lists

You can join two lists in different ways. One is you can append one behind the other, and second, you can append at the beginning of the other list. The purrr package provides functions that help you achieve these tasks. Let us see given two lists, how we can achieve the above-mentioned tasks.

append() – This function appends the list at the end of the other list. Here we are appending list b to list a. So, let’s first create two lists named a and b. Then we append and finally flatten the list using the flatten_dbl() function.

a <- list(22, 11, 44, 55)
b <- list(11, 99, 77)

flatten_dbl(append(a, b))

[1] 22 11 44 55 11 99 77

prepend() – Using this function, we can append a list before another list. The following example code illustrates how we can achieve that.

a <- list(22, 11, 44, 55)
b <- list(11, 99, 77)

flatten_dbl(prepend(a, b))

[1] 11 99 77 22 11 44 55

Other useful functions

In this section, we will cover functions that do not necessarily fall into the above categories. But we believe knowing these functions will improve your programming skills tremendously.

cross_df() – The function returns a data frame where each row is a combination of list elements.

df <- list( empId = c(100, 101, 102, 103),
            name = c("John", "Jack", "Jill", "Cathy"),
            exp = c(4, 10, 6, 8))

df

$empId
[1] 100 101 102 103

$name
[1] "John"  "Jack"  "Jill"  "Cathy"

$exp
[1]  4 10  6 8

Here we have three vectors stored in a list. We can now use cross_df() function to get the data frame.

cross_df(df)

# A tibble: 64 x 3
   empId name    exp
   <dbl> <chr> <dbl>
 1   100 John      4
 2   101 John      4
 3   102 John      4
 4   103 John      4
 5   100 Jack      4
 6   101 Jack      4
 7   102 Jack      4
 8   103 Jack      4
 9   100 Jill      4
10   101 Jill      4
# ... with 54 more rows

rerun() – You can use rerun() an repeat a function n number of times. The function is equivalent to the repeat() function. The rerun() function is very useful when it comes to generating sample data in R.

rerun(1, print("Hello, World!"))

reduce() – The reduce function recursively applies a function or an operation to each element of a list or vector. For example, say I want to add all the numbers of a vector. Notice that we are using backtick instead of inverted commos here.

reduce(c(4,12,30, 16), `+`)

[1] 62

Let’s look at another example. Say I want to concatenate the first element of each vector inside a list. To achieve this, we can use paste function as mentioned below.

x <- list(c(0, 1), c(2, 3), c(4, 5))
reduce(x, paste)

[1] "0 2 4" "1 3 5"

The function also has a variant named reduce2(). If your work involves two vectors or lists, you can use reduce2() instead of reduce().

accumulate() – The function sequentially applies a function to a vector or list. It works just like reduce(), but also returns intermediate results. At each iteration, the function takes two arguments. One is the initial value or the result from the previous step, and the second is the next value in the vector. For further understanding, let’s take a look at the below example, which returns the cumulative sum of values in a vector.

accumulate(c(1,2,3,4,5), sum)

[1]  1  3  6 10 15

The function can be implemented on two different lists through the use of accumulate2().

Bonus – Creating Nested Data Frames

A nested data frame stores multiple tables within the rows of a larger table. You can create nested data for tables where you believe that the groups within the data exist. For example, the world-famous iris dataset contains data about three different types of flowers. Here we will convert iris into nested dataframe. The following are the steps you need to follow to convert any data (with groups) into the nested data frame.

Group data into groups using dplyr::group_by() function

iris_grouped <- iris %>% 
  group_by(Species)

# A tibble: 150 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length
          <dbl>       <dbl>        <dbl>
 1          5.1         3.5          1.4
 2          4.9         3            1.4
 3          4.7         3.2          1.3
 4          4.6         3.1          1.5
 5          5           3.6          1.4
 6          5.4         3.9          1.7
 7          4.6         3.4          1.4
 8          5           3.4          1.5
 9          4.4         2.9          1.4
10          4.9         3.1          1.5
# ... with 140 more rows, and 2 more variables:
#   Petal.Width <dbl>, Species <fct>

Use nest() function on grouped data to create a nested data frame where each row will have a subset data representing a group.

nested_iris <- iris_grouped %>%
  nest()

# A tibble: 3 x 2
# Groups:   Species [3]
  Species    data             
  <fct>      <list>           
1 setosa     <tibble [50 x 4]>
2 versicolor <tibble [50 x 4]>
3 virginica  <tibble [50 x 4]>

Now that we have the tables saved in each row by each species as a tibble, you can call any function on them using map() function.

Practice Question

Develop a linear regression model that predicts the mileage of a car for each cylinder type. Once you have the linear regression model save the intercept in the column named intercept.

< details>< summary class="wp-block-coblocks-accordion-item__title">Solution

First, we create the groups and then get the nested data frame.

mtcars_by_cyl <- mtcars %>% 
  group_by(cyl)

nested_mtcars <- mtcars_by_cyl %>%
  nest()

# Defining the lm function
lm_fun <- function(data)
 lm(mpg ~ ., data = data) 

# Using mutate and map to built model and save result
lm_mtcars <- nested_mtcars %>%
 mutate(model = map(data, lm_fun))

Let’s see what’s inside the model column in lm_mtcars object.

lm_mtcars[[3]]

[[1]]

Call:
lm(formula = mpg ~ ., data = data)

Coefficients:
(Intercept)         disp           hp  
   32.78649      0.07456     -0.04252  
       drat           wt         qsec  
    1.52367      5.12418     -2.33333  
         vs           am         gear  
   -1.75289           NA           NA  
       carb  
         NA  


[[2]]

Call:
lm(formula = mpg ~ ., data = data)

Coefficients:
(Intercept)         disp           hp  
   60.85893     -0.34522     -0.03325  
       drat           wt         qsec  
   -4.19300      4.48273     -0.10639  
         vs           am         gear  
   -3.64277     -6.32631      4.06653  
       carb  
    3.22483  


[[3]]

Call:
lm(formula = mpg ~ ., data = data)

Coefficients:
(Intercept)         disp           hp  
    6.25438     -0.02342      0.15195  
       drat           wt         qsec  
   -5.74240     -0.72632      1.35856  
         vs           am         gear  
         NA      4.87476           NA  
       carb  
   -4.77330

You will notice three different models are created and stored as a list inside the column named model. We will write a function to extract the intercept and save that information in the column called intercept.

# Function for extraction of beta coefficients
beta_extract_fun <- function(mod)
 coefficients(mod)[[1]]

# Extracting incept values for each model 
lm_mtcars %>% transmute(data,
 intercept = map_dbl(model, beta_extract_fun))

# A tibble: 3 x 3
# Groups:   cyl [3]
    cyl data               intercept
  <dbl> <list>                 <dbl>
1     6 <tibble [7 x 10]>      32.8 
2     4 <tibble [11 x 10]>     60.9 
3     8 <tibble [14 x 10]>      6.25

In this article on purrr package in R, we learned some very useful functions which will help you write better code with a focus on R programming’s functional aspect. I hope you find this tutorial of help, and going forward you will be able to take a call on when to fallback on functions from the purrr package.

To leave a comment for the author, please follow the link and comment on their blog: R Statistics Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.