Site icon R-bloggers

Automating exploratory plots with ggplot2 and purrr

[This article was first published on Very statisticious on Very statisticious, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When you have a lot of variables and need to make a lot exploratory plots it’s usually worthwhile to automate the process in R instead of manually copying and pasting code for every plot. However, the coding approach needed to automate plots can look pretty daunting to a beginner R user. It can look so daunting, in fact, that it can appear easier to manually make the plots (like in Excel) rather than using R at all.

Unfortunately making plots manually can backfire. The efficiency of using a software program you already know is quickly out-weighed by being unable to easily reproduce the plots when needed. I know I invariably have to re-make even exploratory plots, and it’d be a bummer if I had to remake them all manually rather than re-running some code.

So while I often assure students working under time constraints that it is perfectly OK to use software they already know rather than spending the time to learn how to do something in R, making many plots is a special case. To get them started I will provide students who need to automate plotting in R some example code (with explanation).

This post is based on an example I was working on recently, which involves plotting bivariate relationships between many continuous variables.

Table of Contents

Load R packages

I’ll be plotting with ggplot2 and looping with purrr. I’ll also be using package cowplot later to combine individual plots into one, but will use the package functions via cowplot:: instead of loading the package.

library(ggplot2) # v. 3.2.0
library(purrr) # v. 0.3.2

The set-up

Today I’m going to make an example dataset with 3 response (y) variables and 4 explanatory (x) variables for plotting. (The real dataset had 9 response and 9 explanatory variables.)

set.seed(16)
dat = data.frame(elev = round( runif(20, 100, 500), 1),
                 resp = round( runif(20, 0, 10), 1),
                 grad = round( runif(20, 0, 1), 2),
                 slp = round( runif(20, 0, 35),1),
                 lat = runif(20, 44.5, 45),
                 long = runif(20, 122.5, 123.1),
                 nt = rpois(20, lambda = 25) )
head(dat)
#    elev resp grad  slp      lat     long nt
# 1 373.2  9.7 0.05  8.8 44.54626 122.8547 18
# 2 197.6  8.1 0.42 33.3 44.79495 122.5471 26
# 3 280.0  5.4 0.38 19.3 44.99027 122.9645 18
# 4 191.8  4.3 0.07 29.6 44.95022 122.7290 19
# 5 445.4  2.3 0.43 16.5 44.79784 122.9836 15
# 6 224.5  6.5 0.78  4.1 44.96576 122.9836 21

The goal is to make scatterplots for every response variable vs every explanatory variable. I’ve deemed the first three variables in the dataset to be the response variables (elev, resp, grad).

The plan is to loop through the variables and make the desired plots. I’m going to use vectors of the variable names for this, one vector for the response variables and one for the explanatory variables.

If all of your response or explanatory variables share some unique pattern in the variable names there are some clever ways to pull out the names with some of the select helper functions in dplyr::select(). Alas, my variable names are all unique. My options are to either write the vectors out manually or pull the names out by index. I’ll do the latter since the different types of variables are grouped together.

response = names(dat)[1:3]
expl = names(dat)[4:7]

When I know I’m going to be looping through character vectors I like to use named vectors. This helps me keep track of things in the output.

The set_names() function in purrr is super handy for naming character vectors, since it can use the values of the vector as names (i.e., the vector will be named by itself). (I don’t recommend trying this with lists of data.frames like I have in the past, though, since it turns out that naming a data.frame with a data.frame isn’t so useful. ????)

response = set_names(response)
response
#   elev   resp   grad 
# "elev" "resp" "grad"
expl = set_names(expl)
expl
#    slp    lat   long     nt 
#  "slp"  "lat" "long"   "nt"

Create a plotting function

Since I’m going to make a bunch of plots that will all have the same basic form, I will make a plotting function. I am going to make a function where only the x and y variables can vary (so are arguments to the function).

Since I’m making a function to plot variables from a single dataset I’m going to hard-code the dataset into the function. If you have multiple datasets or you are making a function for use across projects you’ll probably want to add the dataset as a function argument.

My functions inputs are based on the variable names, so I need to pass strings into the ggplot2 functions. Strings cannot be used directly in aes(), but can be used with the .data pronoun.

I’m making pretty basic graphs since these are exploratory plots, not publication-ready plots. I will make a scatterplot and add locally weighted regression (loess) lines via geom_smooth(). I use such lines with great caution, as it can be easy to get too attached any pattern the loess line shows.

scatter_fun = function(x, y) {
     ggplot(dat, aes(x = .data[[x]], y = .data[[y]]) ) +
          geom_point() +
          geom_smooth(method = "loess", se = FALSE, color = "grey74") +
          theme_bw() +
          labs(x = x,
               y = y)
}

If using older versions of ggplot2 (and/or rlang), use the now deprecated aes_string() for working with strings.

scatter_fun = function(x, y) {
     ggplot(dat, aes_string(x = x, y = y ) ) +
          geom_point() +
          geom_smooth(method = "loess", se = FALSE, color = "grey74") +
          theme_bw() 
}

Here’s an example of the function output, passing in x and y as strings.

scatter_fun("lat", "elev")

Looping through one vector of variables

One way to make all the plots I want is to loop through each explanatory variable for a fixed response variable. With this approach I would need a separate loop for each response variable.

I will use map() from package purrr for the looping.

I pass each explanatory variable to the first argument in scatter_fun() and I fix the second argument to "elev". I use the formula coding in map() and so refer to the element of the explanatory vector via .x within scatter_fun().

elev_plots = map(expl, ~scatter_fun(.x, "elev") )

The output is a list of 4 plots (since there are 4 explanatory variables). You’ll notice that each element of the list has the variable name associated with it. This is why I used set_names() earlier, since this is convenient for printing the plots and, you’ll see later, is convenient when saving the plots in files with understandable names.

elev_plots
# $slp

# 
# $lat

# 
# $long

# 
# $nt

Looping through both vectors

For only a few response variables we could easily copy and paste the code above, changing the hard-coded response variable each time. This process can get burdensome if there are a lot of response variables, though.

Another option is to loop through both vectors of variables and make all the plots at once. Because we want a plot for each combination of variables, this is a job for a nested loop. This means one map() loop will be nested inside another. I will refer to the first map() loop as the outer loop and the second one as the inner loop.

I’m going to have the response variables in the outer loop and the explanatory variables in the inner loop. That way I can graph all of the explanatory variables for each response variable before moving on to the next response variable. This puts the output, a nested list, in a logical order.

A nested loop involves more complicated code, of course.. For example, it took some effort for me to wrap my head around how to refer to the list element from the outer loop within the inner loop when using the map() formula coding. I found the answers/comments to this question on Stack Overflow to be helpful. Note that one approach is to avoid the formula coding all together and use anonymous functions for either or both the inner and outer loops.

Since my scatterplot function is so simple I ended up using formula coding for the outer loop and the function as is in the inner loop. The inner list elements are fed to the first argument of scatter_fun() by default, which works out great since the first argument is the x variable and the inner loop loops through the explanatory variables. The .x then refers to the outer list elements (the response variable names), and is passed to the y argument of the function in the inner loop.

all_plots = map(response,
                ~map(expl, scatter_fun, y = .x) )

The output is a list of lists. Each sublist contains all the plots for a single response variable. Because I set the names for both vectors of variable names, the inner and outer lists both have names. These names can be used to pull out individual plots.

For example, if I want to see all the plots for the grad response variable I can print that sublist by name. (I’m going to display only two of four grad plots here to save space.)

all_plots$grad[1:2]
# $slp

# 
# $lat

If I want to print a single plot, I can first extract one of the sublists using an outer list name and then extract the individual plot via an inner list name.

all_plots$grad$long

I find the names convenient, but you can also extract plots via position. Here’s the same graph, the third element of the third list.

all_plots[[3]][[3]]

Saving the plots

Once all the graphs are made we can look at them in R by printing the list or parts of the list as above. But if you want to peruse them at your leisure later or send them to a collaborator you’ll want to save them outside of R.

This next section is dedicated to exploring some of the ways you can do this.

Saving all plots to one PDF

If you want to save every plot as a separate page in a PDF, you can do so with the pdf() function. The code below shows an example of how this works. First, a graphics device to save the plots into is created and given a name via pdf(). Then all the plots are put into that device. Finally, the device is turned off with dev.off(). The last step is important, as you can’t open the file until the device is turned off.

This is a pretty coarse way to save everything, but it allows you to easily page through all the plots. I’ve used this method when I had many exploratory plots for a single response variable that I wanted to share with collaborators.

In this example code I save the file, which I name all_scatterplots.pdf, into the working directory.

pdf("all_scatterplots.pdf")
all_plots
dev.off()

Saving groups of plots together

Another option is to save each group of plots in a separate document. This might make sense in a case like this where there are a set of plots for each response variable and we might want a separate file for each set.

To save each sublist separately we’ll need to loop through all_plots and save the plots for each response variable into a separate file. The list names can be used in the file names to keep the output organized.

The functions in purrr that start with i are special functions that loop through a list and the names of that list simultaneously. This is useful here where we want to use the list names to identify the output files while we save them.

The walk() function is part of the map family, to be used when you want a function for its side effect instead of for a return value. Saving plots is a classic example of when we want walk() instead of map().

Combining the i and the walk gives us the iwalk() function. In the formula interface, .x refers to the list elements and .y refers to the names of the list. You can see I create the plot file names using the list name combined with “scatterplots.pdf”, using _ as the separator.

The code below makes three files, one for each response variable, with four plots each. The files are named “elev_scatterplots.pdf”, “resp_scatterplots.pdf”, and “grad_scatterplots.pdf”.

iwalk(all_plots, ~{
     pdf(paste0(.y, "_scatterplots.pdf") )
     print(.x)
     dev.off()
})

Saving all plots separately

All plots can be saved separately instead of combined in a single document. This might be necessary if you want to insert the plots into some larger document later.

We’ll want to use the names of both the outer and inner lists to appropriately identify each plot we save. I decided to do this by looping through the all_plots list and the names of the list via imap() to make the file names in a separate step. This time I’m going to save these as PNG files so use .png at the end of the file name.

The result is a list of lists, so I flatten this into a single list via flatten(). If I were to use flatten() earlier in the process I’d lose the names of the outer list. This process of combining names prior to flattening should be simplified once the proposed flatten_names() function is added to purrr.

plotnames = imap(all_plots, ~paste0(.y, "_", names(.x), ".png")) %>%
     flatten()
plotnames
# [[1]]
# [1] "elev_slp.png"
# 
# [[2]]
# [1] "elev_lat.png"
# 
# [[3]]
# [1] "elev_long.png"
# 
# [[4]]
# [1] "elev_nt.png"
# 
# [[5]]
# [1] "resp_slp.png"
# 
# [[6]]
# [1] "resp_lat.png"
# 
# [[7]]
# [1] "resp_long.png"
# 
# [[8]]
# [1] "resp_nt.png"
# 
# [[9]]
# [1] "grad_slp.png"
# 
# [[10]]
# [1] "grad_lat.png"
# 
# [[11]]
# [1] "grad_long.png"
# 
# [[12]]
# [1] "grad_nt.png"

Once the file names are created I can loop through all the file names and plots simultaneously with walk2() and save things via ggsave(). The height and width of each output file can be set as needed in ggsave().

You can see I flattened the nested list of plots into a single list to use in walk2().

walk2(plotnames, flatten(all_plots), ~ggsave(filename = .x, plot = .y, 
                                             height = 7, width = 7))

Combining plots

Another way to get a set of plots together is to combine them into one plot. How useful this is will depend on how many plots you have per set. This option is a lot like faceting, except we didn’t reshape our dataset to allow the use faceting.

I like the cowplot function plot_grid() for combining multiple plots into one. A list of plots can be passed via the plotlist argument.

Here’s what that looks like for the first response variable, elev.

cowplot::plot_grid(plotlist = all_plots[[1]])

We can use a loop to combine the plots for each response variable sublist. The result could then be saved using any of the approaches shown above. If you have many subplots per combined plot you likely will want to save the plots at a larger size so the individual plots can be clearly seen.

response_plots = map(all_plots, ~cowplot::plot_grid(plotlist = .x))
response_plots
# $elev

# 
# $resp

# 
# $grad

Just the code, please

Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.

library(ggplot2) # v. 3.2.0
library(purrr) # v. 0.3.2

set.seed(16)
dat = data.frame(elev = round( runif(20, 100, 500), 1),
                 resp = round( runif(20, 0, 10), 1),
                 grad = round( runif(20, 0, 1), 2),
                 slp = round( runif(20, 0, 35),1),
                 lat = runif(20, 44.5, 45),
                 long = runif(20, 122.5, 123.1),
                 nt = rpois(20, lambda = 25) )
head(dat)

response = names(dat)[1:3]
expl = names(dat)[4:7]

response = set_names(response)
response

expl = set_names(expl)
expl

scatter_fun = function(x, y) {
     ggplot(dat, aes(x = .data[[x]], y = .data[[y]]) ) +
          geom_point() +
          geom_smooth(method = "loess", se = FALSE, color = "grey74") +
          theme_bw() +
          labs(x = x,
               y = y)
}

scatter_fun = function(x, y) {
     ggplot(dat, aes_string(x = x, y = y ) ) +
          geom_point() +
          geom_smooth(method = "loess", se = FALSE, color = "grey74") +
          theme_bw() 
}

scatter_fun("lat", "elev")

elev_plots = map(expl, ~scatter_fun(.x, "elev") )
elev_plots

all_plots = map(response,
                ~map(expl, scatter_fun, y = .x) )

all_plots$grad[1:2]
all_plots$grad$long
all_plots[[3]][[3]]

pdf("all_scatterplots.pdf")
all_plots
dev.off()

iwalk(all_plots, ~{
     pdf(paste0(.y, "_scatterplots.pdf") )
     print(.x)
     dev.off()
})

plotnames = imap(all_plots, ~paste0(.y, "_", names(.x), ".png")) %>%
     flatten()
plotnames

walk2(plotnames, flatten(all_plots), ~ggsave(filename = .x, plot = .y, 
                                             height = 7, width = 7))

cowplot::plot_grid(plotlist = all_plots[[1]])

response_plots = map(all_plots, ~cowplot::plot_grid(plotlist = .x))
response_plots

To leave a comment for the author, please follow the link and comment on their blog: Very statisticious on Very statisticious.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.