Writing better R functions part one – April 6, 2018

[This article was first published on Chuck Powell, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the nicest things about working with R is that with very little effort you can customize and automate activities to produce the output you want – just the way you want it. You can contrast that with more monolithic packages that may allow you to do a bit of scripting, but for the most part, the price of a GUI or packaging everything in one package is that you lose the ability to have things just your way. Since everything in R is pretty much a function already, you may as well invest a little time and energy in making functions… your way, and to exactly your tastes and needs. This post is not meant to be an exhaustive or complete treatment of writing a function. For that you probably want a book, or at least a Chapter like the one Hadley has in Advanced R. This post will focus on a very practical, and hopefully useful, single example.

In my last three posts I have been writing about automating activities in R. You can review everything that happened in the first post, as well as the second, or the third (which I strongly recommend), or you can start on this page. There is no need to save the same dataset or go through the process of building that dataset if you don’t want to. For our purposes in this post we’re going to make use of the built in dataframe known as mtcars. We’re doing that to make sure that whatever we do in this post today, it works on a known start point so we can compare and contrast. One of the most painful learning experiences you can have in R is to discover that you have written something so specific it won’t generalize to other data or other situations.

Background and catch-up

In our earlier postings we dealt with our desire to have some automated tools (functions) that took pairs of variables from a dataset and produced some nice useful ggplot plots from them. We started with the simplest case like plotting counts of how two variables cross-tabulate and then worked our way up to being able to automate the process of plotting lots of pairings of variables from the same dataframe. Today we’ll improve our functions even more and add some features.

First some basic setup.

knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
theme_set(theme_bw()) # set theme to my personal preference

Where we left off

At the end of the last post we had accomplished two important feats:

  1. We had a function called PlotMe that took the name of a dataframe and two variables, cross-tabulated their counts, and created a nice plot for us.
  2. We had some lines of code (not yet a function) that took the name of a dataframe and the numbers of the columns we were interested in, and created two lists that we could feed to mapply so that we could make lots of plots with little additional effort

Along the way we learned the “tricks” of working with dplyr and ggplot2 inside of functions ^[1]^. So using the mtcars dataset as our example data we started here with something that works in the console:

### with dplyr and ggplot manually
mtcars %>%
  filter(!is.na(am), !is.na(cyl))  %>%
  group_by(am,cyl) %>%
  count() %>%
  ggplot(aes(fill=am, y=n, x=cyl)) +
    geom_bar(position="dodge", stat="identity")

then turned it into a function after we learned about NSE:

PlotMe <- function(dataframe,x,y){
   aaa <- enquo(x)
   bbb <- enquo(y)
   dataframe %>%
      filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
      group_by(!! aaa,!! bbb) %>%
      count() %>%
      ggplot(aes_(fill=aaa, y=~n, x=bbb)) +
         geom_bar(position="dodge", stat="identity") ->p
   plot(p)
}
PlotMe(mtcars,am,cyl)

Note that with dplyr if we don’t filter out NA’s we will see them plotted which may or may not be what you want substantively!

From this point forward I’m going to print the plots in a smaller size. I’m doing that via RMarkdown and it won’t happen automatically for you if you download and use the code.

We also wrote some code that allows us to be more efficient if we want to print multiple pairings in the same data set. The cat statement is unnecessary I simply inserted it so you can see how the loops provide what we need. We’ll remove it in the final version of the function most likely.

# Build two vectors
xwhich <- c(2,10:11) # let's put cyl, gear, and carb in here
ywhich <- c(8:9) # let's put vs and am in here
indvars<-list() # create empty list to add to
depvars<-list() # create empty list to add to
totalcombos <- 1 # keep track of where we are
# loop through the vectors and build our lists
for (j in seq_along(xwhich)) {
  for (k in seq_along(ywhich)) {
    depvars[[totalcombos]] <- as.name(colnames(mtcars[xwhich[[j]]]))
    indvars[[totalcombos]] <- as.name(colnames(mtcars[ywhich[[k]]]))
    cat("iteration #", totalcombos, 
        " xwhich=", xwhich[[j]], " depvars = ", as.name(colnames(mtcars[xwhich[[j]]])),
        " ywhich=", ywhich[[k]], " indvars = ", as.name(colnames(mtcars[ywhich[[k]]])),
        "\n", sep = "")
    totalcombos <- totalcombos +1
  }
}

## iteration #1 xwhich=2 depvars = cyl ywhich=8 indvars = vs
## iteration #2 xwhich=2 depvars = cyl ywhich=9 indvars = am
## iteration #3 xwhich=10 depvars = gear ywhich=8 indvars = vs
## iteration #4 xwhich=10 depvars = gear ywhich=9 indvars = am
## iteration #5 xwhich=11 depvars = carb ywhich=8 indvars = vs
## iteration #6 xwhich=11 depvars = carb ywhich=9 indvars = am

This code produces two lists with the column names varying in the way we want them. Then we can pass it to our PlotMe function to get our 6 plots back as desired. So mapply(PlotMe, x=indvars, y=depvars, MoreArgs = list(dataframe=mtcars)).

Making our function better

Other things we’d like to accomplish:

  1. Do a better job of labeling the plot properly.
  2. Add some basic error checking and simple fixes
  3. Let the user choose from different options for which graph type communicates their points about the data.
  4. Convert the second block of code into a proper function.

Let’s start with the first item. As a minimum we can add a title with ggtitle("Crosstabulation of mtcars variables") and ylab("Count")

PlotMe <- function(dataframe,x,y){
   aaa <- enquo(x)
   bbb <- enquo(y)
   dataframe %>%
      filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
      group_by(!! aaa,!! bbb) %>%
      count() %>%
      ggplot(aes_(fill=aaa, y=~n, x=bbb)) +
         geom_bar(position="dodge", stat="identity") +
         ggtitle("Crosstabulation of mtcars variables") +
         ylab("Count") ->p
   plot(p)
}
PlotMe(mtcars,am,cyl)

Totally uninspired but serviceable. Better yet is to use bquote and the .() notation to make it more pertinent and portable. Notice that we had to create a new object called dfname to hold the name of the dataframe and that the name is quoted. So that means that inside our function dataframe actually refers to the whole dataset mtcars all the rows and columns and data itself. dfname on the other hand, is just a way for us to print out the word mtcars without having to hard code it in. No matter what dataframe we pass in to the function the right name gets printed. NB friendly reminder that if you try and cheat and pass x or y to bquote they will fail miserably the enquo is essential.

PlotMe <- function(dataframe,x,y){
   aaa <- enquo(x)
   bbb <- enquo(y)
   dfname <- enquo(dataframe)
   dataframe %>%
      filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
      group_by(!! aaa,!! bbb) %>%
      count() %>%
      ggplot(aes_(fill=aaa, y=~n, x=bbb)) +
         geom_bar(position="dodge", stat="identity") +
         ggtitle(bquote("Crosstabs"~.(dfname)*.(aaa)~"by"*.(bbb))) +
         ylab("Count") ->p
   plot(p)
}
PlotMe(mtcars,am,cyl)

Okay enough for now. Maybe later we’ll do something about am and cyl as labels. If the mtcars dataframe used better column names we wouldn’t have this problem but we’re good enough for now. We have a much “bigger” problem in my mind. Because am is an integer our ggplot is sort of ugly. Remember we built it for a different data set where the variables of interest were already factors not integers. am is a factor (whether or not the car has an automatic transmission) posing as an integer but ggplot doesn’t know that and tries to helpfully give us a display suitable for a number not a factor.

That’s the fun of building and testing a function, it’s looking for all the ways you can go wrong. So let’s fix this because that funny shaded bar for am is driving me crazy.

Now if we were doing this with dplyr outside of a function it would be simple. What we want is just…

mtcars %>%
  filter(!is.na(am), !is.na(cyl))  %>%
  mutate(am = factor(am), cyl = factor(cyl)) %>%
  group_by(am,cyl) %>%
  count()

## # A tibble: 6 x 3
## # Groups:   am, cyl [6]
##   am    cyl       n
##   <fct> <fct> <int>
## 1 0     4         3
## 2 0     6         4
## 3 0     8        12
## 4 1     4         8
## 5 1     6         3
## 6 1     8         2

The problem is we’re inside a function and just as with our filter and group_by commands we need to make it clear to mutate exactly what objects we’re talking about. We know that !!aaa is what we have used so factor(!!aaa) makes sense on the right hand side of any mutate because we are trying to make a factor of the variable aaa in the dataframe. The left hand side is in no way intuitive to me but the right answer is !!quo_name(aaa) and to make matters even more complex (and I’m quoting the help pages here) you can’t use the equals sign:

Unfortunately R is very strict about the kind of expressions supported on the LHS of =. This is why we have made the more flexible := operator an alias of =. You can use it to supply names, e.g. a := b is equivalent to a = b. Since its syntax is more flexible you can unquote on the LHS:

so instead of = we’ll use := but most importantly as you’ll see from the code below it works and produces the tibble we need to drive ggplot to produce the output we’d like…

PlotMe <- function(dataframe,x,y){
  aaa <- enquo(x)
  bbb <- enquo(y)
  dfname <- enquo(dataframe)
  dataframe %>%
    filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
    mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>%
    group_by(!! aaa,!! bbb) %>%
    count() %>%
    ggplot(aes_(fill=aaa, y=~n, x=bbb)) +
      geom_bar(position="dodge", stat="identity") +
      ggtitle(bquote("Crosstabs"~.(dfname)*.(aaa)~"by"*.(bbb))) +
      ylab("Count") ->p
  plot(p)
}
PlotMe(mtcars,am,cyl)

Okay, we’ve made improvements to our labeling. We’ve caught a minor fix that was because we made an assumption that our variables would always be factors (because that’s what you cross-tabulate) and forced them to factors.

Seems a good time to add some error checking inside the function to make sure it works across a variety of situations.

Everyone makes mistakes

The first few functions I wrote I didn’t worry about error-checking. After all I was the only user and I’d be fine. Little did I know that I would forget the next time I used a function months later. Or that a simple typo would drive me to distraction because the error message R threw would offer me no understanding. So these days unless it is a very simple function, I add some error checking early on. The sorts of things I check for are in this list:

  1. If your function relies on certain libraries, test for them with a require statement.
  2. Good chance to set some defaults you like, such as theme_set(theme_bw()).
  3. Did the user pass you the right number of arguments?
  4. Is the first argument a valid dataframe?
  5. If, like me, you’re asking for a dataframe and some columns in it, are the variables present in the dataframe?
  6. What if anything will you do about missing values?

Typically I have statements I simply cut and paste from one function to the next as needed (feel free to borrow anything of mine you see you like). Some easy examples I’ll pass along to you in this next version of the function. I also find it useful to try the function on different datasets just to make sure I’m not building something that only works on mtcars so I’ve added a plot for ToothGrowth even if it is a bit contrived.

PlotMe <- function(dataframe,x,y){
# error checking
  if (!require(ggplot2)) {
    stop("Can't continue can't load ggplot2")
  }
  theme_set(theme_bw())
  if (!require(dplyr)) {
    stop("Can't continue can't load dplyr")
  }
  dfname <- enquo(dataframe)
  if (length(match.call()) <= 3) {
    stop("Not enough arguments passed... requires a dataframe, plus two variables")
  }
  if (!exists(deparse(substitute(dataframe)))) {
     stop("The first item in your list does not exist")
  }
  if (!is(dataframe, "data.frame")) {
    stop("The first name you passed does not appear to be a data frame")
  }
  if (!deparse(substitute(x)) %in% names(dataframe)) {
    stop(paste0("'", deparse(substitute(x)), "' is not the name of a variable in '",deparse(substitute(dataframe)),"'"))
  }
  if (!deparse(substitute(y)) %in% names(dataframe)) {
    stop(paste0("'", deparse(substitute(y)), "' is not the name of a variable in '",deparse(substitute(dataframe)),"'"), call. = FALSE)
  }
  missing <- apply(is.na(dataframe[,c(deparse(substitute(x)),deparse(substitute(y)))]), 1, any)
  if (any(missing)) {
    warning(paste(sum(missing)), " row(s) not plotted because of missing data")
  }
  aaa <- enquo(x)
  bbb <- enquo(y)
  dataframe %>%
    filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
    mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>%
    group_by(!! aaa,!! bbb) %>%
    count() %>%
    ggplot(aes_(fill=aaa, y=~n, x=bbb)) +
      geom_bar(position="dodge", stat="identity") +
      ggtitle(bquote("Crosstabs"~.(dfname)*.(aaa)~"by"*.(bbb))) +
      ylab("Count") ->p
  plot(p)
}
PlotMe(mtcars,am,cyl)

PlotMe(ToothGrowth,supp,dose)

It may seem silly to have more lines of error-checking than code but trust me it’s worth it in the long haul. I like to test and see what happens for at least some common or likely mistakes that I or another user might make. Part of my planning is to try and return the most helpful or useful error or warning message I can. Try them if you like…

PlotMe(mtcars) # too few parameters
# Error in PlotMe(mtcars) : 
#   Not enough arguments passed... requires a dataframe, plus two variables
PlotMe(MtCaRs,am,cyl) # dataframe doesn't exist
# Error in PlotMe(MtCaRs, am, cyl) : 
#  The first item in your list does not exist
PlotMe(PlotMe,am,Cyl) # it exists but it's not a data frame
# Error in PlotMe(PlotMe, am, Cyl) : 
#  The first name you passed does not appear to be a data frame
PlotMe(mtcars,AM,cyl) # one variable doesn't exist
# Error in PlotMe(mtcars, AM, cyl) : 
#  'AM' is not the name of a variable in 'mtcars'
PlotMe(mtcars,am,Cyl) # the other doesn't exist
# Error: 'Cyl' is not the name of a variable in 'mtcars'
## Create a copy of mtcars 
MtCaR <- mtcars
# insert a missing value
MtCaR[1,2] <- NA
PlotMe(MtCaR,am,cyl) # warn about missings
#     Warning message:
#     In PlotMe(MtCaR, am, cyl) : 1 case(s) not plotted because of missing data

One of the reasons the function is much longer is that for my own sanity I like to make the error checking as explicit and sequential and thorough as I can. No nested if statements for me thanks. And first I test if the user passed me a valid R object as the first parameter and then I test to see if it is actually a dataframe.

All done (not yet!)

This has become a very long post so I’m going to end here. Next post I’ll address letting the user choose which type of plot they’d like, as well as turning our other bunch of code into a proper function.

I hope you’ve found this useful. I am always open to comments, corrections and suggestions.

Chuck (ibecav at gmail dot com)

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

  1. Now, I happen to love using dplyr, it is so elegant, and the syntax, plus piping, is just a joy to work with. But the downside is that it was originally designed to be used at the command prompt interactively. It makes heavy use of non standard evaluation NSE which makes it tricky to program functions with. Not impossible, but tricky. Hadley Wickham has written about it extensively and the Stack Overflow is full of questions about it, so I’m not sure I’m the person to explain it. But I can show a practical example of how to use it. And if you’re like me that is sometimes very helpful.

To leave a comment for the author, please follow the link and comment on their blog: Chuck Powell.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)