Grouped means (again)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
So, the post I did yesterday on aggregate seemed to go down well.
One of the comments suggested I add an example. Other comments had other useful hints which I thought I’d pass on more formally. So here goes…
The mtcars dataset in base has data on various aspects of cars – miles per gallon, number of cylinders etc etc etc. This makes it a nice example dataset for aggregate as some of the variables make for suitable grouping factors.
First we get hold of the dataset. Thats easy as its a base dataset:
data(mtcars)
If we look at the help file (?mtcars) for the dataset we notice that the last column is the number of carburettors (“carb”), which could be an good grouping variable. Another is whether the car is an automatic or not (“am”). As our dependant variable we’ll use the horsepower (“hp”)
To make the group means we simply use aggregate as I mentioned yesterday.
aggregate(mtcars$hp, by=list(carb=mtcars$carb, am=mtcars$am), mean) # carb am x #1 1 0 104.0000 #2 2 0 134.5000 #3 3 0 180.0000 #4 4 0 198.0000 #5 1 1 72.5000 #6 2 1 91.2500 #7 4 1 161.3333 #8 6 1 175.0000 #9 8 1 335.0000
We can make this code a bit shorter though by using the formula interface:
aggregate(hp ~ carb + am, mtcars, mean)
Easy huh? Of course in order to use that table in further procedures (plotting or whatever) you should give it a name:
mtcarsagg <- aggregate(hp ~ carb + am, mtcars, mean)
We can also use other functions in aggregate. Min, max, range for example. A natural one that comes to mind is the standard error. First we’ll define a function to calculate standard error*, then apply it to the example above:
st.err <- function(x) { sd(x)/sqrt(length(x)) } SE <- aggregate(hp ~ carb+am, mtcars, st.err) # carb am hp #1 1 0 3.785939 #2 2 0 18.777202 #3 3 0 0.000000 #4 4 0 20.136439 #5 1 1 6.837397 #6 2 1 13.930632 #7 4 1 51.333333 #8 6 1 NA #9 8 1 NA
It could be that you come up with NA values in the table if you have NAs in the dependant variable or is for st.err for example there is only a single value like for automatic cars with 6 or 8 carburettors. If so, adding the na.rm=TRUE argument to the end of the argument list should clear it if theres NAs**. If theres only a single value, then mean and SE are probably not so useful anyway.
We can of course bind them together:
mtcarsagg <- cbind(mtcarsagg, SE[,3]) mtcarsagg # carb am hp SE$hp #1 1 0 104.0000 3.785939 #2 2 0 134.5000 18.777202 #3 3 0 180.0000 0.000000 #4 4 0 198.0000 20.136439 #5 1 1 72.5000 6.837397 #6 2 1 91.2500 13.930632 #7 4 1 161.3333 51.333333 #8 6 1 175.0000 NA #9 8 1 335.0000 NA
A note about using aggregate with the by argument – it doesnt give the dependant variable column a name, so you have to do that manually
(names(mtcarsagg)[3] <- “hp” for instance). Using the formula interface does name the column automatically though, and as its shorter, its probably easier to use anyway!
As with anything in R, theres multiple ways to do everything. Other methods for the above include the ave function, Hadley Wickham’s plyr package and aggregate(data["Y"], data[c("trt", "alt")], mean)
Hope that helps!!!
* A slightly more advanced standard error function with NA removal support is:
st.err <- function(x, na.rm=FALSE) { if(na.rm==TRUE) x <- na.omit(x) sd(x)/sqrt(length(x)) }
**I attempted to find a dataset with NAs to provide an example, but it ignored the NAs anyway, despite mean giving an NA result (which is weird). Sorry about that.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.