Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my last post I described some of my commonly done ggplot2
graphs. It seems as though some people are interested in these, so I was going to follow this up with other plots I make frequently.
Scatterplot colored by continuous variable
The setup of the data for the scatterplots will be the same as the previous post, one x
variable and one y
variable.
library(ggplot2) set.seed(20141106) data = data.frame(x = rnorm(1000, mean=6), batch = factor(rbinom(1000, size=4, prob = 0.5))) data$group1 = 1-rbeta(1000, 10, 2) mat = model.matrix(~ batch, data=data) mat = mat[, !colnames(mat) %in% "(Intercept)"] betas = rbinom(ncol(mat), size=20, prob = 0.5) data$quality = rowSums(t(t(mat) * sample(-2:2))) data$dec.quality = cut(data$quality, breaks = unique(quantile(data$quality, probs = seq(0, 1, by=0.1))), include.lowest = TRUE) batch.effect = t(t(mat) * betas) batch.effect = rowSums(batch.effect) data$y = data$x * 5 + rnorm(1000) + batch.effect + data$quality * rnorm(1000, sd = 2) data$group2 = runif(1000)
I have added 2 important new variables, quality
and batch
. The motivation for these variables is akin to an RNAseq analysis set where you have a quality measure like read depth, and where the data were processed in different batches. The y
variable is based both on the batch effect and the quality.
We construct the ggplot2
object for plotting x
against y
as follows:
g = ggplot(data, aes(x = x, y=y)) + geom_point() print(g)
Coloring by a 3rd Variable (Discrete)
Let's plot the x
and y
data by the different batches:
print({ g + aes(colour=batch)})
We can see from this example how to color by another third discrete variable. In this example, we see that the relationship is different by each batch (each are shifted).
Coloring by a 3rd Variable (Continuous)
Let's color by quality
, which is continuous:
print({ gcol = g + aes(colour=quality)})
The default option is to use black as a low value and blue to be a high value. I don't always want this option, as I cannot always see differences clearly. Let's change the gradient of low to high values using scale_colour_gradient
:
print({ gcol + scale_colour_gradient(low = "red", high="blue") })
This isn't much better. Let's call the middle quality
gray and see if we can see better separation:
print({ gcol_grad = gcol + scale_colour_gradient2(low = "red", mid = "gray", high="blue") })
Scatterplot with Coloring by a 3rd Variable (Continuous broken into Discrete)
Another option is to break the quality
into deciles (before plotting) and then coloring by these as a discrete variable:
print({ gcol_dec = g + aes(colour=dec.quality) })
Scatterplot with Coloring by 3rd Continuous Variable Faceted by a 4th Discrete Variable
We can combine these to show each batch
in different facets and coloring by quality
:
print({ gcol_grad + facet_wrap(~ batch )})
We can compound all these operations by passing transformations to scale_colour_gradient
such as scale_colour_gradient(trans = "sqrt")
. But enough with scatterplots.
Distributions! Lots of them.
One of the gaping holes in my last post was that I did not do any plots of distributions/densities of data. I ran the same code from the last post to get the longitudinal data set named dat
.
Histograms
Let's say I want a distribution of my yij
variable for each person across times:
library(plyr) g = ggplot(data=dat, aes(x=yij, fill=factor(id))) + guides(fill=FALSE) ghist = g+ geom_histogram(binwidth = 3) print(ghist)
Hmm, that's not too informative. By default, the histograms stack on top of each other. We can change this by setting position
to be identity
:
ghist = g+ geom_histogram(binwidth = 3, position ="identity") print(ghist)
There are still too many histograms. Let's plot a subset.
Aside: Using the %+% operator
The %+%
operator allows you to reset what dataset is in the ggplot2
object. The data must have the same components (e.g. variable names); I think this is most useful for plotting subsets of data.
nobs = 10 npick = 5
Let's plot the density of (5) people people with (10) or more observations both using geom_density
and geom_line(stat = "density")
. We will also change the binwidth:
tab = table(dat$id) ids = names(tab)[tab >= nobs] ids = sample(ids, npick) sub = dat[ dat$id %in% ids, ] ghist = g+ geom_histogram(binwidth = 5, position ="identity") ghist %+% sub
Overlaid Histograms with Alpha Blending
Let's alpha blend these histograms to see the differences:
ggroup = ggplot(data=sub, aes(x=yij, fill=factor(id))) + guides(fill=FALSE) grouphist = ggroup+ geom_histogram(binwidth = 5, position ="identity", alpha = 0.33) grouphist
Similarly, we can plot over the 3 groups in our data:
ggroup = ggplot(data=dat, aes(x=yij, fill=factor(group))) + guides(fill=FALSE) grouphist = ggroup+ geom_histogram(binwidth = 5, position ="identity", alpha = 0.33) grouphist
These histograms are something I commonly do when I want overlay the data in some way. More commonly though, espeically with MANY distributions, I plot densities.
Densities
We can again plot the distribution of (y_{ij}) for each person by using kernel density estimates, filled a different color for each person:
g = ggplot(data=dat, aes(x=yij, fill=factor(id))) + guides(fill=FALSE) print({gdens = g+ geom_density() })
As the filling overlaps a lot and blocks out other densities, we can use just different colors per person/id/group:
g = ggplot(data=dat, aes(x=yij, colour=factor(id))) + guides(colour=FALSE) print({gdens = g+ geom_density() })
I'm not a fan that the default for stat_density
is that the geom = "area"
. This creates a line on the x-axis that closes the object. This is very important if you want to fill the density with different colors. Most times though, I want simply the line of the density with no bottom line. We can achieve this with:
print({gdens2 = g+ geom_line(stat = "density")})
What if we set the option to fill
the lines now? Well lines don't take fill, so it will not colour each line differently.
gdens3 = ggplot(data=dat, aes(x=yij, fill=factor(id))) + geom_line(stat = "density") + guides(colour=FALSE) print({gdens3})
Now, regardless of the coloring, you can't really see the difference in people since there are so many. What if we want to do the plot with a subset of the data and the object is already constructed? Again, use the %+%
operator.
Overlaid Densities with Alpha Blending
Let's take just different subsets of groups, not people, and plot the densities, with alpha blending:
print({ group_dens = ggroup+ geom_density(alpha = 0.3) })
That looks much better than the histogram example for groups. Now let's show these with lines:
print({group_dens2 = ggroup+ geom_line(stat = "density")})
What happened? Again, lines don't take fill
, they take colour
:
print({group_dens2 = ggroup+ geom_line(aes(colour=group), stat = "density")})
I'm a firm believer of legends begin IN plots, so let's push that in there and make it blend in:
print({ group_dens3 = group_dens2 + theme(legend.position = c(.75, .75), legend.background = element_rect(fill="transparent"), legend.key = element_rect(fill="transparent", color="transparent")) })
Lastly, I'll create a dataset of the means of the datasets and put vertical lines for the mean:
gmeans = ddply(dat, .(group), summarise, mean = mean(yij)) group_dens3 + geom_vline(data=gmeans, aes(colour = group, xintercept = mean))
Conclusion
Overall, this post should give you a few ways to play around with densities and such for plotting. All the same changes as the previous examples with scatterplots, such as facetting, can be used with these distribution plots. Many times, you want to look at the data in very different ways. Histograms can allow you to see outliers in some ways that densities do not because they smooth over the data. Either way, the mixture of alpha blending, coloring, and filling (though less useful for many distributions) can give you a nice description of what's going on a distributional level in your data.
PS: Boxplots
You can also do boxplots for each group, but these tend to separate well and colour relatively well using defaults, so I wil not discuss them here. My only note is that you can (and should) overlay points on the boxplot rather than just plot the histogram. You may need to jitter the points, alpha blend them, subsample the number of points, or clean it up a bit, but I try to display more DATA whenever possible.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.