Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my last four posts I have been working at automating a process, that
I am likely to repeat many times, by turning it into a proper R
function
. In my last post
I overcame some real performance problems, combined two sub-functions
into one and generally had a workable piece of code. In the final post
in this series today I’ll accomplish two more important tasks. I’ll once
again refactor the code to streamline it, and I’ll give the user a lot
more flexibility on how they input their request.
I’m going to output all the plots in a smaller size for the benefit of you the readers. I’m doing that via RMarkdown and it won’t happen automatically for you if you download and use the code. I’ll initially be using, fig.width=4.5, fig.height=2 and later slighly larger than that
Background and catch-up
We originally started with a simple task – using dplyr
and ggplot2
in the console. Take two of the mtcars
variables, in this case am
and cyl
, and conduct a cross tabulation and then plot it. Since it’s
the sort of thing I’m likely to do often seemed like a good candidate to
write a function for. Then I decided that I might very well want the
function to allow me to input more than two variables at a time. It
would be very tedious to execute the command 25 times if I had 5
dependent variables and 5 independent variables and needed to fully
cross them.
So the code grew and you can see below I address those two use cases in the code. I have very deliberately left comments in the code so that you trace the process. Roughly speaking if you follow the code here’s what’s happening.
- Error checking. Do we have the right packages? The right number of arguments passed in? Does the first object even exist and is it a data frame?
- Which type of plot does the user want?
- Some
if
logic that says in effect if both the parameters passed in as bare names are columns (variables) in the dataframe, make the plot and leave the function orreturn
. - If the user has given us two column numbers, or vectors e.g.
c(8:15)
of column numbers as input then run them through somefor-loops
to completely cross them and print the plots. - If you make it that far then exit the function and
return
the list of arguments we were passed in to begin with.
PlotMeX <- function(dataframe, xwhich, ywhich, plottype = "side"){ # error checking if (!require(ggplot2)) { stop("Can't continue can't load ggplot2") } theme_set(theme_bw()) if (!require(dplyr)) { stop("Can't continue can't load dplyr") } if (length(match.call()) <= 3) { stop("Not enough arguments passed... requires a dataframe, plus at least two variables") } argList <- as.list(match.call()[-1]) if (!exists(deparse(substitute(dataframe)))) { stop("The first object in your list does not exist. It should be a dataframe") } if (!is(dataframe, "data.frame")) { stop("The first name you passed does not appear to be a data frame") } switch(plottype, side = list(geom_bar(position="dodge", stat="identity"), ylab("Count")) -> whichbar, stack = list(geom_bar(stat="identity"), ylab("Count")) -> whichbar, percent = list(geom_bar(stat="identity", position="fill"), ylab("Percent")) -> whichbar ) # If both variables are found in the dataframe immediately print the plot if (deparse(substitute(xwhich)) %in% names(dataframe) & deparse(substitute(ywhich)) %in% names(dataframe)) { aaa <- enquo(xwhich) bbb <- enquo(ywhich) xname <- deparse(substitute(xwhich)) yname <- deparse(substitute(ywhich)) dfname <- deparse(substitute(dataframe)) dataframe %>% filter(!is.na(!! aaa), !is.na(!! bbb)) %>% mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>% group_by(!! aaa,!! bbb) %>% count() -> tempdf tempdf %>% ggplot(aes_(fill=aaa, y=~n, x=bbb)) + whichbar + ggtitle(bquote("Crosstabs dataset: "*.(dfname)*" variables "*.(xname)~"by "*.(yname))) -> p return(p) } # If the user has given us integers indicating the column numbers rather than bare variable names # we need to build a list of what is to be plotted and then do the plotting # Build two lists indvars<-list() # create empty list to add to depvars<-list() # create empty list to add to totalcombos <- 1 # keep track of where we are message("Creating the variable pairings...") for (j in seq_along(xwhich)) { for (k in seq_along(ywhich)) { depvarsbare <- as.name(colnames(dataframe[xwhich[[j]]])) indvarsbare <- as.name(colnames(dataframe[ywhich[[k]]])) cat("Pairing #", totalcombos, " ", as.name(colnames(dataframe[xwhich[[j]]])), " with ", as.name(colnames(dataframe[ywhich[[k]]])), "\n", sep = "") aaa <- enquo(depvarsbare) bbb <- enquo(indvarsbare) xname <- deparse(substitute(depvarsbare)) yname <- deparse(substitute(indvarsbare)) dfname <- deparse(substitute(dataframe)) dataframe %>% filter(!is.na(!! aaa), !is.na(!! bbb)) %>% mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>% group_by(!! aaa,!! bbb) %>% count() -> tempdf tempdf %>% ggplot(aes_(fill=aaa, y=~n, x=bbb)) + whichbar + ggtitle(bquote("Crosstabs dataset: "*.(dfname)*" variables "*.(xname)~"by "*.(yname))) -> p print(p) totalcombos <- totalcombos +1 } } return(argList) } PlotMeX(mtcars, am, cyl)
# exact same thing using column numbers rather than names PlotMeX(mtcars, 9, 2) ## Creating the variable pairings... ## Pairing #1 am with cyl
## $dataframe ## mtcars ## ## $xwhich ## [1] 9 ## ## $ywhich ## [1] 2 # a more complex example and using a different plot type PlotMeX(mtcars, c(9, 2), c(8,10), "percent") ## Creating the variable pairings... ## Pairing #1 am with vs
## Pairing #2 am with gear
## Pairing #3 cyl with vs
## Pairing #4 cyl with gear
## $dataframe ## mtcars ## ## $xwhich ## c(9, 2) ## ## $ywhich ## c(8, 10) ## ## $plottype ## [1] "percent"
If you’d like a detailed review of the sub components it’s covered in
earlier posts. In particular, if you’re not familiar with the
“tricks” of working with dplyr
and ggplot2
inside of
functions you may want to look at this
post if you’re not familiar
with enquo
or the !!
notation.
As you can see from the output, it works fine as far as we can tell on
the mtcars
data set for the two cases we have identified so far. It’s
not pretty and there is clearly some repetition but hey it is good
enough for my needs at this point.
Except it isn’t…
Complexity and simplicity and user choices
Turns out I needed more. In my line of work it’s not uncommon to have a
small number of dependent variables and a much larger number of
independent variables that you want to examine. Although it’s a very
contrived example let’s keep using the mtcars
dataset (I promise I’ll
use something more fun in a bit). Let’s say that I’m interested in
plotting the relationship between cyl
(the number of cylinders) and
all of vs
, am
, gear
& carb
. That will be a total of four plots.
My current function would allow me to call PlotMeX(mtcars, 2, c(8:11),
"percent")
which is all well and good. But, I know my working
preferences and I know at some point I will think PlotMeX(mtcars, cyl,
c(8:11), "percent")
. So lets see what happens.
PlotMeX(mtcars, cyl, c(8:11), "percent") # Quitting from lines 32-116 (betterfunctions4git.Rmd) # Error in PlotMeX(mtcars, cyl, c(8:11), "percent") : # object 'cyl' not found # Calls: <Anonymous> ... withCallingHandlers -> withVisible -> eval -> eval -> PlotMeX # Execution halted
Ugh! Not only does it fail but it fails ugly. What the heck does it
mean object cyl not found
? Everyone knows that cyl is a column in
mtcars!
One approach is simply make it clear to the user that she can choose either one way of calling the function (with bare variables) or the other by using integers to proxy for column numbers. But they can’t mix and match like that. That’s a very classic response to this issue. But I actually see it as an opportunity to make the function more flexible and useful.
That means we have four major possibilities not two. So we’re going to have to figure out which of those possible paths we’re on. That’s what I set out to do.
Refactoring again
It didn’t take me long to realize that as I moved from two possibilities to four possibilities if I wasn’t careful I’d wind up repeating myself (by cutting and pasting the same code) over and over again. In particular the central piece of code in the function, the one that does the real work is likely to be the same not just in two cases or four cases but literally in almost any case. That is the very snippet of code we developed several posts ago.
dataframe %>% filter(!is.na(!! aaa), !is.na(!! bbb)) %>% mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>% group_by(!! aaa,!! bbb) %>% count() -> tempdf tempdf %>% ggplot(aes_(fill=aaa, y=~n, x=bbb)) + whichbar + ggtitle(bquote("Crosstabs dataset: "*.(dfname)*" variables "*.(xname)~"by "*.(yname))) -> p
It takes our dataframe (dataframe
) and two variables (aaa
and bbb
)
and does magic to produce the plot (p
). tempdf
is just a temporary
dataframe (well tibble
actually) to hold our cross-tabulation.
dfname
, xname
, and yname
are really just character versions of the
names that we used to make nice labels.
Now, I already told you as I rewrote the code on my first pass I simply
did just cut and paste this chunk a few times. It’s not hard at all in
RStudio
. Could have stayed that way. But here’s the real rub. A week
from now when someone asks me to change the Plot title or make some
other change I’ll have to do the same thing at least four times. That’s
annoying. And prone to error.
In my mind I always new the smart thing to do was to make this a function within the main function and just call it when needed. And that what I did. Here’s what that looks like.
PlotMagic <- function(dataframe,aaa,bbb,whichbar,dfname,xname,yname) { dataframe %>% filter(!is.na(!! aaa), !is.na(!! bbb)) %>% mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>% group_by(!! aaa,!! bbb) %>% count() -> tempdf tempdf %>% ggplot(aes_(fill=aaa, y=~n, x=bbb)) + whichbar + ggtitle(bquote("Crosstabs dataset = "*.(dfname)*" and variables = "*.(xname)~"by "*.(yname))) -> p print(p) }
Now no matter what path I’m on I know what information I have to gather and just pass it along to do the work.
So before I do the big reveal and lay out all the code let’s address some other issues as well.
Back in this post we
introduced a switch
to allow the user to choose a plot type.
switch(plottype, side = list(geom_bar(position="dodge", stat="identity"), ylab("Count")) -> whichbar, stack = list(geom_bar(stat="identity"), ylab("Count")) -> whichbar, percent = list(geom_bar(stat="identity", position="fill"), ylab("Percent")) -> whichbar )
It works reliably, but there is a strong possibility that along the way users will make a typo or some other mistake. So let’s add a fail-safe default to the switch. We do that by adding a final unnamed entry, it could be anything, but in our case we’ll set it so that if all else fails the user will get a side by side plot.
switch(plottype, side = list(geom_bar(position="dodge", stat="identity"), ylab("Count")) -> whichbar, stack = list(geom_bar(stat="identity"), ylab("Count")) -> whichbar, percent = list(geom_bar(stat="identity", position="fill"), ylab("Percent")) -> whichbar, list(geom_bar(position="dodge", stat="identity"), ylab("Count")) -> whichbar )
And as long as we are rewriting we may as well be defensive in our list
building. As I mentioned last time for performance reasons it is a best
practice to specify list lengths in advance rather than constantly
copying the list and appending a new entry. We have several entries
scattered around that look like indvars<-list() # create empty list to
add to
that we’re going to change to the form indvars <-
vector("list", length = length(ywhich))
. While it’s unlikely I’ll
personally ever feed this function a very long vector better safe than
sorry.
The logical tree
On to the main event. We need some logic that generally follows this pattern.
Take the user’s input and parse it into one of four known possibilities and then take appropriate action. The possibilities are:
- If both are
bare
variables and found in the dataframe immediately print the plot - At least one of the variables is
bare
and found in the dataframe (variable x) and the other is one or more column numbers (variable y) - At least one of the variables is
bare
and found in the dataframe (variable y) and the other is one or more column numbers (variable x) - Both the variables were passed to us as numbers. Could be one or more numbers for either variable.
We already have logic in our current version for #1 above and #4 happens by default.
# If both variables are found in the dataframe immediately print the plot if (deparse(substitute(xwhich)) %in% names(dataframe) & deparse(substitute(ywhich)) %in% names(dataframe)) # If the user has given us integers indicating the column numbers rather than bare variable names
My first plan was simply to write a series of if statements
to lay out
these conditions. I like to avoid nested if statements whenever possible
since personally when I go back to look at them later I usually have a
hard time following what becomes relatively complex logic.
I couldn’t manage it. I’d be thrilled is someone comes along and shows me how I should have, but personally I had to resort to some amount of nesting.
Tip:
Rstudio
has a great feature that allows you to fold the code while you’re editing it. Small triangles over by the line numbers. They were immensely helpful! Try them.
I’m not going to try and describe the if
structure in detail. Instead
I am going to show you the code and let you examine it. For your
convenience I have tried to place as many comments within the code as
possible to help you parse the logic yourself. I acknowledge in advance
if I really tried I’m sure there is other duplication or inefficiency
here I could eliminate, but it’s good enough for now. Take a look at the
code and we’ll see how it runs in a minute.
PlotMeX <- function(dataframe, xwhich, ywhich, plottype = "side"){ # error checking if (!require(ggplot2)) { stop("Can't continue can't load ggplot2") } theme_set(theme_bw()) if (!require(dplyr)) { stop("Can't continue can't load dplyr") } if (length(match.call()) <= 3) { stop("Not enough arguments passed... requires a dataframe, plus at least two variables") } argList <- as.list(match.call()[-1]) if (!exists(deparse(substitute(dataframe)))) { stop("The first object in your list does not exist. It should be a dataframe") } if (!is(dataframe, "data.frame")) { stop("The first name you passed does not appear to be a data frame") } # process plottype logic -- default is side anything mispelled or not listed is also side switch(plottype, side = list(geom_bar(position="dodge", stat="identity"), ylab("Count")) -> whichbar, stack = list(geom_bar(stat="identity"), ylab("Count")) -> whichbar, percent = list(geom_bar(stat="identity", position="fill"), ylab("Percent")) -> whichbar, list(geom_bar(position="dodge", stat="identity"), ylab("Count")) -> whichbar ) PlotMagic <- function(dataframe,aaa,bbb,whichbar,dfname,xname,yname) { dataframe %>% filter(!is.na(!! aaa), !is.na(!! bbb)) %>% mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>% group_by(!! aaa,!! bbb) %>% count() -> tempdf tempdf %>% ggplot(aes_(fill=aaa, y=~n, x=bbb)) + whichbar + ggtitle(bquote("Crosstabs dataset = "*.(dfname)*" and variables = "*.(xname)~"by "*.(yname))) -> p print(p) } # If both are bare variables and found in the dataframe immediately print the plot if (deparse(substitute(xwhich)) %in% names(dataframe) & deparse(substitute(ywhich)) %in% names(dataframe)) { # both are names in the dataframe aaa <- enquo(xwhich) bbb <- enquo(ywhich) xname <- deparse(substitute(xwhich)) yname <- deparse(substitute(ywhich)) dfname <- deparse(substitute(dataframe)) PlotMagic(dataframe,aaa,bbb,whichbar,dfname,xname,yname) return(message(paste("Plotted dataset", argList$dataframe, "variables", argList$xwhich, "by", argList$ywhich))) } else { # is at least one in the dataframe? # Is at least one of them a bare variable in the dataframe if (deparse(substitute(xwhich)) %in% names(dataframe)) { # xwhich is in the dataframe aaa <- enquo(xwhich) if (class(try(eval(ywhich))) %in% c("integer","numeric")) { # ywhich is column numbers indvars <- vector("list", length = length(ywhich)) totalcombos <- 1 # keep track of where we are xname <- deparse(substitute(xwhich)) dfname <- deparse(substitute(dataframe)) message("Creating the variable pairings from dataframe ", dfname) for (k in seq_along(ywhich)) { #for loop indvarsbare <- as.name(colnames(dataframe[ywhich[[k]]])) cat("Plot #", totalcombos, " ", xname, " with ", as.name(colnames(dataframe[ywhich[[k]]])), "\n", sep = "") bbb <- enquo(indvarsbare) yname <- deparse(substitute(indvarsbare)) PlotMagic(dataframe,aaa,bbb,whichbar,dfname,xname,yname) totalcombos <- totalcombos +1 } # end of for loop return(message("Plotting complete")) } else { # ywhich is NOT suitable stop("Sorry I don't understand your ywhich variable(s)") } # } else { # xwhich wasn't try ywhich if (deparse(substitute(ywhich)) %in% names(dataframe)) { # yes ywhich is bbb <- enquo(ywhich) if (class(try(eval(xwhich))) %in% c("integer","numeric")) { # then xwhich a suitable number # Build one list two ways depvars <- vector("list", length = length(xwhich)) totalcombos <- 1 # keep track of where we are yname <- deparse(substitute(ywhich)) dfname <- deparse(substitute(dataframe)) message("Creating the variable pairings from dataframe ", dfname) for (j in seq_along(xwhich)) { depvarsbare <- as.name(colnames(dataframe[xwhich[[j]]])) cat("Plot #", totalcombos, " ", as.name(colnames(dataframe[xwhich[[j]]])), " with ", yname, "\n", sep = "") aaa <- enquo(depvarsbare) xname <- deparse(substitute(depvarsbare)) PlotMagic(dataframe,aaa,bbb,whichbar,dfname,xname,yname) totalcombos <- totalcombos +1 } #end of for loop return(message("Plotting complete")) } else { # xwhich is NOT suitable stop("Sorry I don't understand your xwhich variable(s)") } #end of else because xwhich not suitable } #end of if } } # If both variables are numeric print the plot(s) if (class(try(eval(xwhich))) %in% c("integer","numeric") & class(try(eval(ywhich))) %in% c("integer","numeric")) { indvars <- vector("list", length = length(ywhich)) depvars <- vector("list", length = length(xwhich)) dfname <- deparse(substitute(dataframe)) totalcombos <- 1 # keep track of where we are message("Creating the variable pairings from dataframe ", dfname) for (j in seq_along(xwhich)) { for (k in seq_along(ywhich)) { depvarsbare <- as.name(colnames(dataframe[xwhich[[j]]])) indvarsbare <- as.name(colnames(dataframe[ywhich[[k]]])) cat("Plot #", totalcombos, " ", as.name(colnames(dataframe[xwhich[[j]]])), " with ", as.name(colnames(dataframe[ywhich[[k]]])), "\n", sep = "") aaa <- enquo(depvarsbare) bbb <- enquo(indvarsbare) xname <- deparse(substitute(depvarsbare)) yname <- deparse(substitute(indvarsbare)) PlotMagic(dataframe,aaa,bbb,whichbar,dfname,xname,yname) totalcombos <- totalcombos +1 } # end of inner for loop } # end of outer for loop return(message("Plotting complete")) } # end of if case where all are numeric } # end of function
Not mentioned above but clear when we run the code is that I’ve also tried to rationalize the messaging you see in the console and the labeling of the plots themselves.
Are you happy I’m all done?
For most of this series of posts I have focused on using the mtcars
built-in dataset. It’s handy, convenient, and it’s installed by default.
To actually show the function in action I’m going to use a different
dataset. Something that should allow you to better see the value of
making plots of the crosstabs rather than simple tables. It also has the
happy property of being much much larger than mtcars
so we can see if
there are lags in performance due to the number of rows.
Rather than provide my own or make anyone work too hard I selected that
happy
dataset that comes bundled with several R
packages including
productplots
and GGally
. From the description:
The data is a small sample of variables related to happiness from the general social survey (GSS). The GSS is a yearly cross-sectional survey of Americans, run from 1976. We combine data for 25 years to yield 51,020 observations, and of the over 5,000 variables, we select nine related to happiness.
We’ll be focusing on the non numeric variables. I certainly can’t claim to do a detailed analysis here but at least the questions will be fun I hope…
PackageList <- .packages(all.available = TRUE) if ("productplots" %in% PackageList) { data("happy",package = "productplots") } else { stop("Can't load productplots can't use the following examples") } # who's happier by gender PlotMeX(happy,happy,sex) ## Plotted dataset happy variables happy by sex
# same thing using column numbers and a stacked bar PlotMeX(happy,2,5,"stack") ## Creating the variable pairings from dataframe happy ## Plot #1 happy with sex ## Plotting complete
# happiness by a variety of possible factors as a percent PlotMeX(happy, 2, c(5:9), plottype = "percent") ## Creating the variable pairings from dataframe happy ## Plot #1 happy with sex
## Plot #2 happy with marital
## Plot #3 happy with degree
## Plot #4 happy with finrela
## Plot #5 happy with health ## Plotting complete
# turn the numbers around and change them up basically just showing all # the permutations PlotMeX(happy, c(2,5), 9, plottype = "side") ## Creating the variable pairings from dataframe happy ## Plot #1 happy with health
## Plot #2 sex with health ## Plotting complete
PlotMeX(happy, c(2,5), c(6:9), plottype = "percent") ## Creating the variable pairings from dataframe happy ## Plot #1 happy with marital
## Plot #2 happy with degree
## Plot #3 happy with finrela
## Plot #4 happy with health
## Plot #5 sex with marital
## Plot #6 sex with degree
## Plot #7 sex with finrela
## Plot #8 sex with health ## Plotting complete
PlotMeX(happy, happy, c(6,7,9), plottype = "percent") ## Creating the variable pairings from dataframe happy ## Plot #1 happy with marital
## Plot #2 happy with degree
## Plot #3 happy with health ## Plotting complete
PlotMeX(happy, c(6,7,9), happy, plottype = "percent") ## Creating the variable pairings from dataframe happy ## Plot #1 marital with happy
## Plot #2 degree with happy
## Plot #3 health with happy ## Plotting complete
It’s probably the case that no function is ever truly “done” but this one is good enough for now. This has become yet another very long post so I’m going to end here.
I hope you’ve found this useful. I am always open to comments, corrections and suggestions.
Chuck (ibecav at gmail dot com)
License
This
work is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.