Functions ddply and melt make plotting summary stats in R more tolerable
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The main reason why I have usually chosen to use excel to make my plots at work is because I had difficulty feeding the summary stats in R into a plotting function. One thing I learned this week is how to make summary stats into a data frame suitable for plotting, making the whole process of plotting in R more tolerable for me. Below I show the process using the ever-popular iris dataset. I use the functions ddply and melt to both summarize and restructure the data into a form amenable to plotting.
length.by.species = ddply(iris, "Species", function (x) quantile(x$Sepal.Length, c(.25,.5,.75))) > length.by.species Species 25% 50% 75% 1 setosa 4.800 5.0 5.2 2 versicolor 5.600 5.9 6.3 3 virginica 6.225 6.5 6.9 length.by.species = melt(length.by.species, variable.name="Quantile",value.name="Sepal.Length") length.by.species Species Quantile Sepal.Length 1 setosa 25% 4.800 2 versicolor 25% 5.600 3 virginica 25% 6.225 4 setosa 50% 5.000 5 versicolor 50% 5.900 6 virginica 50% 6.500 7 setosa 75% 5.200 8 versicolor 75% 6.300 9 virginica 75% 6.900
One thing you can see in my call to ddply is that the main qualitative variable, whose values are used to subset your data frame, is referred to using quotes. Somehow I find that a bit weird (I’m used to referring to variables without quotes, I suppose!). Other than that, the syntax for the ddply command is similar enough to the apply family of functions, so no more complaints here. You can also see that once I call the function, it gives me a nice neat data frame where the quantiles I asked for are columns, and the values of the Species variable represent different rows (or subsets of the data frame).
The melt command is easy enough, simply wanting to know what to call the column that will represent the values in the column titles (Quantile!) and what to call the numeric measure that the values come from (Sepal.Length).
Now that the summary stats are in a “Long” form data frame, with one column representing the numbers, and two columns containing text, it’s just a simple one liner to create a graph (here done in ggplot). Below I show one line to create a dodged bar graph, and another line to create a dot plot, both showing the 1st to 3rd quantiles of Sepal.Length by Species.
ggplot(length.by.species, aes(y=Sepal.Length, x=Species, fill=Quantile, stat="identity")) + geom_bar(position="dodge") ggplot(length.by.species, aes(x=Sepal.Length, y=Species, colour=Quantile, stat="identity")) + geom_point(size=4)
Thank you ddply and melt!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.