Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Scatterplots are one of the most common types of data visualizations you will encounter as a biologist. They present the relationship between two continuous variables. We might take them for granted by their simplicity, but we shouldn’t assume the seeming intuition with which we can see and comprehend these figures. They are a powerful tool, but one that I believe merits a bit more attention. Check out this really cool article from the New Yorker about ‘When graphs are a matter of life and death’ for more history on the subject.
All through my grad school years and beyond, I’ve repeatedly come across scatterplots that almost defeat the purpose of helping us easily understand the relationship between two variables. Here’s a typical example of the type of plot I’ve seen one-too-many times:
There are several issues here, but without elaborating, here are the same data after a few visual tweaks:
In other words, while the data may be accurate, the actual visual design of scatterplots is often overlooked and unattended. Unlike a statistical test, the goal of data visualizations is subjective—to help a viewer understand a particular relationship or story. For that reason, it is important that we take a subjective, and dare I say aesthetic, approach towards ensuring scatterplots (and all other plot types, really) are visually appealing and easy to understand on a quick glance.
I have a hunch that the main reason plots such as the first one above are so common is simply due to a lack of knowing how to easily customize plots in R. Unfortunately, even ggplot2—which is commended for the ease with which one can make good quality visualizations—is not so pretty right out of the box.
Hence this blog post 😉
Here is a simple tutorial on how to re-create the nice version of the plot above using the ‘base’ R package. The key is just to include a few additional parameters and functions. In the future I may update this post with how to do this using ggplot2.
First, let’s load the data. In this case we are using the built-in dataset on air quality measurements in New York from May through September in 1973:
# Load the built-in data: data(airquality) help(airquality) head(airquality) ## Ozone Solar.R Wind Temp Month Day ## 1 41 190 7.4 67 5 1 ## 2 36 118 8.0 72 5 2 ## 3 12 149 12.6 74 5 3 ## 4 18 313 11.5 62 5 4 ## 5 NA NA 14.3 56 5 5 ## 6 28 NA 14.9 66 5 6
For this tutorial we are only interested in ozone concentration, which is measured in parts per million (ppm), and wind speed, which is measured in miles per hour (MPH). To get this info, I just ran the ‘help(airquality)’ function to pull up a description of these data.
Next, let’s start with the plot created using the ‘plot()’ function right out of the box:
plot(Ozone ~ Wind, data=airquality)
Next, remove all the axes and tick marks from the plot so that we can start with a clean slate:
plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="")
plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="") # add the wind speed axis: axis(side=1, at=seq(0, max(airquality$Wind, na.rm=T), 2), padj=-0.8) # add the Ozone axis: axis(side=2, at=seq(0, max(airquality$Ozone, na.rm=T), 20), hadj=0.8, las=2)
# This finds the maximum value for wind: max(airquality$Wind, na.rm=T) ## [1] 20.7 # Then use that in the 'seq()' function to create the sequence of places for the tickmarks: seq(from = 0, to = max(airquality$Wind, na.rm=T), by = 2) ## [1] 0 2 4 6 8 10 12 14 16 18 20
The ‘padj’ and ‘hadj’ (perpendicular adjust and horizontal adjust) arguments are used to nudge the axis tickmark label text so that it lines up more neatly. Play around with those values to see what you get. Finally, the ‘las’ argument when set to 2, turns the y axis tick marks horizontally so that they are more easily readable and all fit on the axis neatly.
Next, add in the axis name labels using the ‘mtext’ function. It’s always important to add units to these labels, which I did. The ‘line’ argument is how far from the edge of the plot you want the label to appear. ‘cex’ affects the size of the text, and finally, ‘’ is used to make the text bold. You can set ‘’ to 1, 2, 3, or 4, for normal, bold, italic, or italic + bold respectively. Play around with all those parameters to see how it changes the figure.
plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="") # add the wind speed axis: axis(side=1, at=seq(0, max(airquality$Wind, na.rm=T), 2), padj=-0.8) # add the Ozone axis: axis(side=2, at=seq(0, max(airquality$Ozone, na.rm=T), 20), hadj=0.8, las=2) # add in the labels for each axis: mtext(side=1, "Wind Speed (mph)", line=2.8, cex=1.5, =2) mtext(side=2, "Ozone Concentration (ppb)", line=2.8, cex=1.4, =2)
plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="", pch=16, cex=1.5) # add the wind speed axis: axis(side=1, at=seq(0, max(airquality$Wind, na.rm=T), 2), padj=-0.8) # add the Ozone axis: axis(side=2, at=seq(0, max(airquality$Ozone, na.rm=T), 20), hadj=0.8, las=2) # add in the labels for each axis: mtext(side=1, "Wind Speed (mph)", line=2.8, cex=1.5, =2) mtext(side=2, "Ozone Concentration (ppb)", line=2.8, cex=1.4, =2)
### transparent colors function t_col <- function(color, opacity = 0.5) { rgb.val <- col2rgb(color) t.col <- rgb(rgb.val[1], rgb.val[2], rgb.val[3], max = 255, alpha = (opacity)*255) invisible(t.col) }
This function essentially takes in two arguments: ‘color’ which is the color you want to make transparent, and then ‘opacity’ which goes from 0 to 1, with 0 being totally transparent, and 1 being no transparency. Adding this function to our plot looks like this:
plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="", pch=16, cex=1.5, col=t_col("black",0.6)) # add the wind speed axis: axis(side=1, seq(0, max(airquality$Wind, na.rm=T), 2), padj=-0.8) # add the Ozone axis: axis(side=2, at=seq(0, max(airquality$Ozone, na.rm=T), 20), hadj=0.8, las=2) # add in the labels for each axis: mtext(side=1, "Wind Speed (mph)", line=2.8, cex=1.5, =2) mtext(side=2, "Ozone Concentration (ppb)", line=2.8, cex=1.4, =2)
plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="", pch=16, cex=1.5, col=t_col("black",0.6), ylim=c(0,185), xlim=c(0,22)) # add the wind speed axis: axis(side=1, seq(0, max(airquality$Wind, na.rm=T), 2), padj=-0.8) # add the Ozone axis: axis(side=2, at=seq(0, max(airquality$Ozone, na.rm=T), 20), hadj=0.8, las=2) # add in the labels for each axis: mtext(side=1, "Wind Speed (mph)", line=2.8, cex=1.5, =2) mtext(side=2, "Ozone Concentration (ppb)", line=2.8, cex=1.4, =2)
We’ll end by using the ‘par()’ function to set the margins of the plot. I don’t like how close the Y axis label is to the edge of the figure. The ‘mar’ argument is to set the margins around the edge of the figure. I’m not sure what units those are in, but play around with the numbers until you get something that looks good. The four values in the vector represent the four sides in the same order as the ‘side’ argument used for the axes: bottom, left, top, and right:
Here is the plot again with a background color so that you can see what I mean:
And after we added the ‘par(mar=c(5,5,2,2))’:
par(mar=c(5,5,2,2)) plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="", pch=16, cex=1.5, ylim=c(0,185), xlim=c(0,22), col=t_col("black",0.6)) mtext(side=1, "Wind Speed (mph)", line=2.8, cex=1.5, =2) axis(side=1, seq(0, max(airquality$Wind)+2, 2), padj=-0.8) mtext(side=2, "Ozone Concentration (ppb)", line=2.8, cex=1.4, =2) axis(side=2, at=seq(0, 185, 20), hadj=0.8, las=2)
Finally (and maybe most importantly), when you save your figure by clicking ‘Export’ above the ‘Plots’ pane in R Studio, you’ll have the option to resize the figure dimensions and see a preview of how it looks with different dimensions. Don’t neglect the important step of ensuring the figure dimensions are set to a size that considers the proportion of all the elements in the figure. Just play around with the sizing and you’ll see what I mean. This is what you are going for:
Alternatively, if you want the image size to also remain in the code, create a ‘quartz()’ window (if using a mac) or windows() window (if using a PC). Set ‘height’ and ‘width’ in those functions to the desired size (I believe the units are inches) and run that function before the code that creates the plot. This will open up an external graphics window that is sized to your specifications and you can then go to the file menu at the top of your screen to save the figure as a PDF.
You can also save directly to a graphic window file. Here is the final code for how to do this:
pdf(file="my_scatterplot.pdf",width=7,height=4.5) par(mar=c(5,5,2,2)) plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="", pch=16, cex=1.5, ylim=c(0,185), xlim=c(0,22), col=t_col("black",0.6)) mtext(side=1, "Wind Speed (mph)", line=2.8, cex=1.5, =2) axis(side=1, seq(0, max(airquality$Wind)+2, 2), padj=-0.8) mtext(side=2, "Ozone Concentration (ppb)", line=2.8, cex=1.4, =2) axis(side=2, at=seq(0, 185, 20), hadj=0.8, las=2) dev.off()
Simply use the ‘pdf()’ function to set the name of the file and directory where it will be saved and specify the height and width in inches. You can play around with those measurements until you find something that works. Then run the code that creates the plot. And finally, run ‘dev.off()’ to close that graphic device.
I recommend always saving your figures as PDFs to retain maximum quality. Check out this other excellent blog post by David Smith with more details about how and why to save your figures in particular formats.
Well done! That’s it for now. Do you think this is easier to do with ggplot? I’ll follow up with an update or post on that as well.
If you liked this article, let me know what you might want to see next in the comments down below.
Also be sure to check out R-bloggers for other great tutorials on learning R
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.