How to Summarize a Data Frame by Groups in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Sometimes, when you’re analyzing a data set and you want to get a complete picture of it, you want calculate the metrics on all the observations for each variable. Let’s say, for example, that you run a small zoo and want to inventory the cost of all your animals. To calculate this in a spreadsheet, you simply, sum the cost of all the rows in the “cost” column.
But, what do you do if you want to know the cost broken down by each category of animal? In a spreadsheet, you would subtotal the “cost” column by the column referencing the animal’s class. The result would be something like below–with the original data set on the left and the subtotaled data set on the right.
If you’re working with databases, it’s even easier to achieve the result with a simple SQL query. If the table you’re working in is called “animals,” the query would look something like this:
SELECT Class, sum(Cost) FROM animals GROUP_BY Class;
And the result is pretty much the same. So, the question is, if you can do this in spreadsheets and databases, can you do it in R? You bet you can.
In the dplyr package, you can create subtotals by combining the group_by() function and the summarise() function. Let’s start with an example. Below is the first part of the mtcars data frame that is provided in the base R package.
Now, suppose we interested in purchasing a car. We’re interested in 3 things regarding the car we’re seeking to purchase: the fuel economy, the power, and the speed. However, we’re unsure how the number of cylinders relates to these variables. How can we view the averages by cylinder? Let’s look at some code.
### group mtcars by cylinders and return some averages cars <- mtcars %>% select(cyl, mpg, hp, qsec) %>% group_by(cyl) %>% summarise(mpg = mean(mpg), hp = mean(hp), qsec = mean(qsec))
Here, we’ve used piping with dplyr functions to crew a data set showing us the average mpg, hp, and qsec (seconds it takes to go 1/4 a mile) for each amount of cylinders. When we print the new “cars” data frame, here’s what we get:
This gives us a summary that can help us more easily make a decision. If we want to see this plotted on graphs, we can run this code…
### draw plots of cylinders vs other variables par(mfrow = c(1,3), lwd = 4) plot(cars$cyl,cars$mpg, main = "Cylinders Vs Miles Per Gallon", xlab = "Miles Per Gallon", ylab = "Number of Cylinders", type = "h", col = "Red") plot(cars$cyl,cars$hp, main = "Cylinders vs Horsepower", xlab = "Horsepower", ylab = "Number of Cylinders", type = "h", col = "Blue") plot(cars$cyl,cars$qsec, main = "Cylinders vs Quarter Mile Time", xlab = "Quarter Mile Time in Seconds", ylab = "Number of Cylinders", type = "h", col = "Green")
And here’s what we end up with…
It looks like increasing the number of cylinders decreases the fuel economy, but it increases the power and speed.
Also worth noting, the mpg and qsec change at roughly the same rate going from 6 to 8 cylinders than they do going from 4 to 6 cylinders. However, the horsepower jumps up more dramatically when going from 6 to 8 cylinders than going from 4 to 6 cylinders. So, if power is relatively more important to you, you may want to go ahead and go with 8 cylinders.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.