[This article was first published on Shifting sands, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is a follow on from the post Using apply sapply and lappy in R.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The dataset we are using was created like so:
m <- matrix(data=cbind(rnorm(30, 0), rnorm(30, 2), rnorm(30, 5)), nrow=30, ncol=3)
Three columns of 30 observations, normally distributed with means of 0, 2 and 5. We want a density plot to compare the distributions of the three columns using ggplot.
First let’s give our matrix some column names:
colnames(m) <- c(‘method1’, ‘method2’, ‘method3’)
head(m)
# method1 method2 method3
#[1,] 0.06288358 2.7413567 4.420209
#[2,] -0.11240501 3.4126550 4.827725
#[3,] 0.02467713 1.0868087 4.044101
ggplot has a nice function to display just what we were after geom_density and it’s counterpart stat_density which has more examples.
ggplot likes to work on data frames and we have a matrix, so let’s fix that first
df <- as.data.frame(m)
df
# method1 method2 method3
#1 0.06288358 2.7413567 4.420209
#2 -0.11240501 3.4126550 4.827725
#3 0.02467713 1.0868087 4.044101
#4 -0.73854932 -0.4618973 3.668004
Enter stack
What we would really like is to have our data in 2 columns, where the first column contains the data values, and the second column contains the method name.
Enter the base function stack, which is a great little function giving just what we need:
dfs <- stack(df)
dfs
# values ind
#1 0.06288358 method1
#2 -0.11240501 method1
#…
#88 5.55704736 method3
#89 6.40128267 method3
#90 3.18269138 method3
We can see the values are in one column named values, and the method names (the previous column names) are in the second column named ind. We can confirm they have been turned into a factor as well:
is.factor(dfs[,2])
#[1] TRUE
stack has a partner in crime, unstack, which does the opposite:
unstack(dfs)
# method1 method2 method3
#1 0.06288358 2.7413567 4.420209
#2 -0.11240501 3.4126550 4.827725
#3 0.02467713 1.0868087 4.044101
#4 -0.73854932 -0.4618973 3.668004
Back to ggplot
So, lets try plot our densities with ggplot:
ggplot(dfs, aes(x=values)) + geom_density()
The first argument is our stacked data frame, and the second is a call to the aes function which tells ggplot the ‘values’ column should be used on the x-axis.
However, our plot is not quite looking how we wish:
Hmm.
We want to group the values by each method used. To do this we will use the ‘ind‘ column, and we tell ggplot about this by using aes in the geom_density call:
ggplot(dfs, aes(x=values)) + geom_density(aes(group=ind))
This is getting closer, but it’s not easy to tell each one apart. Let’s try colour the different methods, based on the ind column in our data frame.
ggplot(dfs, aes(x=values)) + geom_density(aes(group=ind, colour=ind))
Looking better. I’d like to have the density regions stand out some more, so will use fill and an alpha value of 0.3 to make them transparent.
ggplot(dfs, aes(x=values)) + geom_density(aes(group=ind, colour=ind, fill=ind), alpha=0.3)
That is much more in line with what I wanted to see. Note that the alpha argument is passed to geom_density() rather than aes().
That’s all for now.
To leave a comment for the author, please follow the link and comment on their blog: Shifting sands.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.