Site icon R-bloggers

Using R for Introductory Statistics, Chapter 4, Model Formulae

[This article was first published on Digithead's Lab Notebook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Several R functions take model formulae as parameters. Model formulae are symbolic expressions. They define a relationship between variables rather than an arithmetic expression to be evaluated immediately. Model formulae are defined with the tilde operator. A simple model formula looks like this:

response ~ predictor

Functions that accept formulae typically also take a data argument to specify a data frame in which to look up model variables and a subset argument to select certain rows in the data frame.

We’ve already seen model formula used for simple linear regression and with plot and boxplot, to show that American cars are heavy gas guzzlers. Two common uses of formula are:

The Lattice graphics package can accept more complicated model formulas of this form:

response ~ predictor | condition

We’ll try this out with a dataset called kid.weights from the UsingR package. We get age, weight, height and gender for 250 kids ranging from 3 month to 12 years old.

library(UsingR)
library(lattice)
dim(kid.weights)
[1] 250   4

We expect weight and height to be related, but we’re wondering if this relationship changes over time as kids grow. Often, when we want to condition on a quantitative variable (like age), we turn it into a categorical variable by binning. Here, we’ll create 4 bins by taking age in 3 year intervals.

age.classes = cut(kid.weights$age/12, 3*(0:4))
unique(age.classes)
[1] (3,6]  (6,9]  (9,12] (0,3] 
Levels: (0,3] (3,6] (6,9] (9,12]

With age as a factor, we can express our question as the model formula:

height ~ weight | age.classes

The lattice graphics function xyplot accepts this kind of formula and draws a panel for each level of the conditioning variable. The panels contain scatterplots of the response and predictor, in this case height and weight, divided into subsets by the conditioning variable. The book shows a little trick that let’s us customize xyplot, adding a regression line to each scatterplots.

plot.regression = function(x,y) {
  panel.xyplot(x,y)
  panel.abline(lm(y~x))
}

We pass the helper function plot.regression as a custom panel function in xyplot.

xyplot( height ~ weight | age.classes, data=kid.weights, panel=plot.regression)

There’s quite a bit more to model formulae, but that’s all I’ve figured out so far.

More on formulae

To leave a comment for the author, please follow the link and comment on their blog: Digithead's Lab Notebook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.