Writing better R functions part two – April 10, 2018

[This article was first published on Chuck Powell, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post I started to build two functions that took pairs of variables from a dataset and produced some nice useful ggplot plots from them. We started with the simplest case, plotting counts of how two variables cross-tabulate. Then we worked our way up to being able to automate the process of plotting lots of pairings of variables from the same dataframe. Today we’ll improve our functions and add a feature. For now, we’re going to continue to make use of the built in dataframe known as mtcars. We’re doing that to make sure that whatever we do in this post today, it works on a known start point so we can compare and contrast.

I’m going to output all the plots in a smaller size for the benefit of you the readers. I’m doing that via RMarkdown and it won’t happen automatically for you if you download and use the code. I’ll be using , fig.width=4.5, fig.height=2

Background and catch-up

Some quick setup.

library(dplyr)
library(ggplot2)
theme_set(theme_bw()) # set theme to my personal preference

We started with a simple task in the console. Take two of the mtcars variables, in this case am and cyl, and conduct a cross tabulation and then plot it.

### with dplyr and ggplot manually
mtcars %>%
  filter(!is.na(am), !is.na(cyl))  %>%
  group_by(am,cyl) %>%
  count() %>%
  ggplot(aes(fill=am, y=n, x=cyl)) +
    geom_bar(position="dodge", stat="identity")

By the end of the last post we had accomplished the following tasks:

  1. Our function (called PlotMe) was a lot more robust. It checked for basic user entry errors, added a useful title and generally did what we wanted.
  2. Along the way we learned the “tricks” of working with dplyr and ggplot2 inside of functions.

Here’s the PlotMe function as we left it.

PlotMe <- function(dataframe,x,y){
# error checking
  if (!require(ggplot2)) {
    stop("Can't continue can't load ggplot2")
  }
  theme_set(theme_bw())
  if (!require(dplyr)) {
    stop("Can't continue can't load dplyr")
  }
  dfname <- enquo(dataframe)
  aaa <- enquo(x)
  bbb <- enquo(y)
  if (length(match.call()) <= 3) {
    stop("Not enough arguments passed... requires a dataframe, plus two variables")
  }
  if (!exists(deparse(substitute(dataframe)))) {
     stop("The first item in your list does not exist")
  }
  if (!is(dataframe, "data.frame")) {
    stop("The first name you passed does not appear to be a data frame")
  }
  if (!deparse(substitute(x)) %in% names(dataframe)) {
    stop(paste0("'", deparse(substitute(x)), "' is not the name of a variable in '",deparse(substitute(dataframe)),"'"))
  }
  if (!deparse(substitute(y)) %in% names(dataframe)) {
    stop(paste0("'", deparse(substitute(y)), "' is not the name of a variable in '",deparse(substitute(dataframe)),"'"), call. = FALSE)
  }
  missing <- apply(is.na(dataframe[,c(deparse(substitute(x)),deparse(substitute(y)))]), 1, any)
  if (any(missing)) {
    warning(paste(sum(missing)), " row(s) not plotted because of missing data")
  }
# end of error checking
  dataframe %>%
    filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
    mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>%
    group_by(!! aaa,!! bbb) %>%
    count() %>%
    ggplot(aes_(fill=aaa, y=~n, x=bbb)) +
      geom_bar(position="dodge", stat="identity") +
      ggtitle(bquote("Crosstabs"*.(dfname)*.(aaa)~"by"*.(bbb))) +
      ylab("Count") ->p
  plot(p)
}
PlotMe(mtcars,am,cyl)

Making our function better

Our next step is to allow the user to choose which of three plot types they’d like. We have been focusing on simple side by side grouped bars that are obtained with geom_bar(position="dodge", stat="identity")

But we can also “stack” the bars with by changing to geom_bar(stat="identity")

Or focus on percentages of the category totals with geom_bar(stat="identity", position="fill")

The key is that our entire function is the same, except for one part of one line in the function the geom_bar(). The last thing we want to do is write three different functions, or replicate too much code within one function. Let’s say, for the sake of argument, we’d like to simply have the user specify whether they want type “side”, “stack” or “percent”. (We could choose any names we like but these will do for now). Let’s say, that we’d like to call this variable within our function plottype. We’d make side the default, but allow the user to override that and choose either stack or percent instead. So, at the beginning of the function we’ll add…

PlotMeX <- function(dataframe, x, y, plottype = "side"){

Depending on their choice we’ll set the geom_bar() details. So it’s a kind of a switch hint, hint. Inside of the switch we’ll use the fact that we can pass ggplot list items. Each choice "side", “stack”, or “percent” causes us to set two things:

  1. the parameters to geom_bar and,
  2. the Y axis label ylab (since in two cases it’s a count and in one it’s a percent).
PlotMeX <- function(dataframe, x, y, plottype = "side"){
  switch(plottype,
       side =  list(geom_bar(position="dodge", stat="identity"),
                    ylab("Count")) -> whichbar,
       stack = list(geom_bar(stat="identity"),
                    ylab("Count")) -> whichbar,
       percent = list(geom_bar(stat="identity", position="fill"),
                    ylab("Percent")) -> whichbar
    )

Notice that based upon the value going into the switch we are assigning the proper information to an object called whichbar. Later in our function we’ll insert that into our ggplot flow…

    ggplot(aes_(fill=aaa, y=~n, x=bbb)) +
      whichbar +
      ggtitle(bquote("Crosstabs"~.(dfname)*.(aaa)~"by"*.(bbb))) ->p

Here’s our new version of the function. Temporarily named PlotMeX, and with the error checking removed just for brevity, but it’s what we want. Let’s test it against am by vs just for a variable change of pace.

PlotMeX <- function(dataframe, x, y, plottype = "side"){
  switch(plottype,
       side =  list(geom_bar(position="dodge", stat="identity"),
                    ylab("Count")) -> whichbar,
       stack = list(geom_bar(stat="identity"),
                    ylab("Count")) -> whichbar,
       percent = list(geom_bar(stat="identity", position="fill"),
                    ylab("Percent")) -> whichbar
    )
  
  aaa <- enquo(x)
  bbb <- enquo(y)
  dfname <- enquo(dataframe)
  dataframe %>%
    filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
    mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>%
    group_by(!! aaa,!! bbb) %>%
    count() %>%
    ggplot(aes_(fill=aaa, y=~n, x=bbb)) +
      whichbar +
      ggtitle(bquote("Crosstabs"*.(dfname)*.(aaa)~"by"*.(bbb))) ->p
  plot(p)
}
PlotMeX(mtcars,am,vs)

PlotMeX(mtcars,am,vs, "stack")

PlotMeX(mtcars,am,vs, "percent")

Looks good!

Automating even more

A few posts back we also recognized that if we need to work on multiple pairs of variables in the same dataframe (not uncommon in my field). It would be very tedious to keep running the PlotMeX function over and over again. So we developed a block of code that does something important for us. We feed it the column numbers of the variables we want to pair, accounting for which side of the plot (dependent or independent) we want them on, and the code produces two lists that we can systematically feed to our PlotMeX function and produce lots of plots.

There are any number of ways we could pass the columns, by quoted name, by bare name or as I have done with simple column numbers. I honestly prefer just looking up the numbers or a range of numbers. Far less typing for me and the ability to incorporate lots of variables with the colon : shortcut, e.g. 3:333.

So the process would be to choose some variables for one side of the by, in this case let’s say gear and carb those are columns 10 & 11 so we can write c(10:11) on the other side cyl & am written as c(2,9).

# Build two vectors
xwhich <- c(10:11) # let's put gear and carb in here
ywhich <- c(2,9) # let's put cyl and am in here

Then we create two empty lists that we will insert the variable names into and a counter variable (not strictly necessary but I’ll actually use it in a minute). We could of course rewrite our function to accept column numbers instead of variable names right from the outset. But I’m choosey. When I work in small batches I want to be able to use names, it’s only when I have a lot of names I prefer to operate by number. It is essential to initialize the lists, and if I were ever to intend to have large lists it would be preferable to specifiy a length in advance for performance reasons. Something like indvars <- vector("list", length(xwhich)*length(ywhich)) but for now I’m going to keep it simple.

indvars<-list() # create empty list to add to
depvars<-list() # create empty list to add to
totalcombos <- 1 # keep track of where we are

Now we take xwhich and ywhich and run them through some nested loops and see if we get the four possible pairings we are expecting.

# loop through the vectors and build our lists
for (j in seq_along(xwhich)) {
  for (k in seq_along(ywhich)) {
    depvars[[totalcombos]] <- as.name(colnames(mtcars[xwhich[[j]]]))
    indvars[[totalcombos]] <- as.name(colnames(mtcars[ywhich[[k]]]))
    cat("iteration #", totalcombos, 
        " xwhich is ", xwhich[[j]], " so depvars = ", as.name(colnames(mtcars[xwhich[[j]]])),
        " ywhich is ", ywhich[[k]], " so indvars = ", as.name(colnames(mtcars[ywhich[[k]]])),
        "\n", sep = "")
    totalcombos <- totalcombos +1
  }
}

## iteration #1 xwhich is 10 so depvars = gear ywhich is 2 so indvars = cyl
## iteration #2 xwhich is 10 so depvars = gear ywhich is 9 so indvars = am
## iteration #3 xwhich is 11 so depvars = carb ywhich is 2 so indvars = cyl
## iteration #4 xwhich is 11 so depvars = carb ywhich is 9 so indvars = am

Yes that looks like what we want. It’s a bit trivial for 4 plots but imagine the effort savings if you had 5 variables on each side. Keeping track of 25 possibilities by hand is annoying. So the simplest case of our function is PlotMeX(mtcars, am, cyl). We’ll feed our function and our two lists to mapply and let it do the tedious work for us. N.B. If you don’t wrap it with invisible mapply will actually return two copes of the plots. More on that in a bit.

#PlotMeX(mtcars,am,cyl)
invisible(mapply(PlotMeX, x=indvars, y=depvars, MoreArgs = list(dataframe=mtcars), SIMPLIFY = FALSE, USE.NAMES = FALSE))

Notice that MoreArgs is a list. We are passing it the name of the mtcars dataset but we can also pass it other parameters such as our plottype preference.

invisible(mapply(PlotMeX, x=indvars, y=depvars, MoreArgs = list(dataframe=mtcars, plottype = "percent"), SIMPLIFY = FALSE, USE.NAMES = FALSE))

Now we can turn this working code into a function so we can reuse it. To do that we’ll have the user specify the dataframe, xwhich and ywhich as parameters to the function. Then inside the function we’ll do the work and return the user the two lists that can be used to “feed” mapply. I’m personally a fan of providing some feedback in the console to inform the user of what’s going on, even though the actual output product of the function is just a list with two sublists nested inside. On the other hand we’ll return the list “invisibly” since we don’t really need to print out the list structure.

CrossXYs <- function(dataframe, xwhich, ywhich){
  # Build two vectors
indvars<-list() # create empty list to add to
depvars<-list() # create empty list to add to
totalcombos <- 1 # keep track of where we are
message("Creating the variable pairings...")
for (j in seq_along(xwhich)) {
  for (k in seq_along(ywhich)) {
    depvars[[totalcombos]] <- as.name(colnames(dataframe[xwhich[[j]]]))
    indvars[[totalcombos]] <- as.name(colnames(dataframe[ywhich[[k]]]))
    cat("Pairing #", totalcombos, " ", as.name(colnames(dataframe[xwhich[[j]]])),
        " with ", as.name(colnames(dataframe[ywhich[[k]]])), "\n", sep = "")
    totalcombos <- totalcombos +1
  }
}
return(invisible(list(depvars=depvars,indvars=indvars)))
}
CrossXYs(mtcars,c(10:11),c(2,9))

## Creating the variable pairings...

## Pairing #1 gear with cyl
## Pairing #2 gear with am
## Pairing #3 carb with cyl
## Pairing #4 carb with am

This code produces the two lists with the column names varying in the way we want them. So the next step it to put them somewhere (we’ll call that bothlists)

bothlists <- CrossXYs(mtcars,c(10:11),c(2,9))

## Creating the variable pairings...

## Pairing #1 gear with cyl
## Pairing #2 gear with am
## Pairing #3 carb with cyl
## Pairing #4 carb with am

Then we can use them as part of our mapply call by addressing them as bothlists$indvars & bothlists$depvars, otherwise our command remains the same.

invisible(mapply(PlotMeX, x=bothlists$indvars, y=bothlists$depvars, MoreArgs = list(dataframe=mtcars, plottype = "stack"), SIMPLIFY = FALSE, USE.NAMES = FALSE))

Appears to be working as we want it to. Which means it’s a good time to test it on some other data and see if anything breaks. Let’s try the esoph dataset… First we’ll call PlotMeX with alcgp by tobgp. Then we’ll make CrossXYs give us lists for agegp by alcgp & tobgp. That should reassure us that all is well.

str(esoph)

## 'data.frame':    88 obs. of  5 variables:
##  $ agegp    : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ alcgp    : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ tobgp    : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ ncases   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ncontrols: num  40 10 6 5 27 7 4 7 2 1 ...

PlotMeX(esoph,alcgp,tobgp)

bothlists <- CrossXYs(esoph,1,c(2:3))

## Creating the variable pairings...

## Pairing #1 agegp with alcgp
## Pairing #2 agegp with tobgp

invisible(mapply(PlotMeX, x=bothlists$indvars, y=bothlists$depvars, MoreArgs = list(dataframe=esoph, plottype = "percent"), SIMPLIFY = FALSE, USE.NAMES = FALSE))

Very nice! We seem to have accomplished what we set out to do.

All done (not yet!)

This has become a very long post so I’m going to end here. But the astute among you have probably detected we’re still being inefficient. Be nice if we could integrate these two functions a little better so we didn’t have to think of it as a two step process. Great idea for a next post on a future day.

I hope you’ve found this useful. I am always open to comments, corrections and suggestions.

Chuck (ibecav at gmail dot com)

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

To leave a comment for the author, please follow the link and comment on their blog: Chuck Powell.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)