Step By Step: recreating a volcano plot in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We have an analysis routine for proteomics data written for IgorPro. One output is a volcano plot. These plots show the fold change in one sample compared to another and plot that against a p-value to estimate how reproducible any changes observed are. This post is not about that software, but on the topic of how we can recreate this plot in R. What steps need to be considered?
Quick note about volcano plots in R
There are plenty of ways to make volcano plots in R. One of the best is EnhancedVolcano which is available in Bioconductor. There is also a shiny app VolcaNoseR by Joachim Goedhart. If you want to make a volcano plot with your own data, start with one of those. The idea of this post is to walkthrough the steps to get a desired result.
The original plot
An example output from VolcanoPlot is shown below. Each point represents a protein detected by mass spectrometry. The co-ordinates come from a Log2 representation of the fold-change on the x-axis, and on the y, the p-value from a t-test (typically there are three replicates for the test and for the control).
Some things to note about the plot before we try to recreate it in R.
- Points are coloured grey if the fold change is less than 2 and greater than 0.5 (i.e. 1 and -1 on a Log2 scale), not of interest.
- If the p-value is less than 0.05, the points are either blue or pink. Blue, the fold change is lower than 0.5 (enriched in the control); pink, the fold change is greater than 2 (enriched in the test condition). Changes greater than 2 but below the significance threshold are coloured red.
- Vertical dotted line shows the boundary between enrichment in test or control
- Horizontal dotted line shows the significance threshold.
- y axis is decreasing and log10-scaled (1 to 1e-4)
- x axis is on a log2 scale, symmetrical (-7 to 7) with respect to dotted line and labels show the exponent not the real value.
Aesthetically, we have 0.5 alpha on the points to help with visualisation. No stroke on the points. Helvetica throughout. There’s plenty of space on the plot, and it is tall. This is to aid labelling of proteins, none of which are shown here.
It also looks “Igor-like”: there’s a standoff on the axes, engineering style ticks on y to help show the log scale, open axes (no mirroring), no grid lines.
Our mission is to recreate this plot in R
I’ll go through the major steps that need to be considered and worked through to get something close to the plot above (using ggplot).
We start with a tab-separated text file from Igor. This file has several columns including the names of the proteins, the p-value, the ratio of test vs control, and a number corresponding to the colouring. Let’s go!
library(tidyverse) df <- read.delim("path/to/file.txt") # so_ratioWave and so_allTWave contain the ratios and p-values ggplot(df, aes(x = so_ratioWave, y = so_allTWave)) + geom_point()
This doesn’t look good! The p-values and ratios are their true values and we would like to represent them differently. In Igor, this is a simple option to change the axis type, set limits and you’re done. With ggplot, it is more tricky. We can try to play with the scales using scale_y_log10
and so on. The simplest approach though is to make new columns in our data frame with the converted versions of these data. We need to do a -log10()
transformation on the p-values and a log2()
transformation on the ratios.
df$minus_log10_pvalue <- -log10(df$so_allTWave) df$log2_fc <- log2(df$so_ratioWave) ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) + geom_point()
Let’s make it look a tad more presentable. Add some labels and change the theme. Give the points an alpha of 0.5 like in the original plot. We can use shape = 16
to give stroke-less points.
ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) + geom_point(shape = 16, alpha = 0.5) + labs(x = "Fold change (log2)", y = "P-value") + theme_bw(9) + theme(legend.position = "none")
We’ll stick with this basic theme, although the Igor version had open axes and no gridlines. But we now have our basic plot sorted out.
Next: How do we do the colours?
The dataframe has a column called so_colorWave which has 0, 1, 2, 3 to designate the grey, red, blue and pink colouring of the points.
We use aes(colour = as.factor(so_colorWave))
to specify that we want to use this column to colour the points. I call as.factor()
because when the column was imported it is an integer type.
Then we can use scale_colour_manual()
to specify the colours. I use hex representation for each (matching the original). The values
argument is supplied with a list of these values. Note that this is a named list. Without specifying which factor is coloured by which hex colour, we would get the same result in this case. However, if the dataframe did not have any 2s (blue points) for example, without a named list, the colours change order. So this is a more robust way to colour the points.
ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) + geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) + scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) + labs(x = "Fold change (log2)", y = "P-value") + theme_bw(9) + theme(legend.position = "none")
Let’s add the dashed lines. We’ll put them behind the points rather than in front as the original has it. Remember that ggplot adds things to the plot in the order you add them!
ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) + geom_vline(xintercept = 0, linetype = "dashed", colour = "grey") + geom_hline(yintercept = -log10(0.05), linetype = "dashed", colour = "grey") + geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) + scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) + labs(x = "Fold change (log2)", y = "P-value") + theme_bw(9) + theme(legend.position = "none")
Getting closer… now let’s pay the axes some attention
Next: scaling and labelling of axes
We would like symmetrical scaling about 0 on the x-axis. To do this we need to find the largest or smallest value in the dataset and use that to set the limits.
max(abs(df$log2_fc))
will give us this value, we use abs()
so that negative values become positive and we then find the maximum of all values in one direction. We could set this as a value and use that, or we can get ggplot to calculate it on the fly, like this:
ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) + geom_vline(xintercept = 0, linetype = "dashed", colour = "grey") + geom_hline(yintercept = -log10(0.05), linetype = "dashed", colour = "grey") + geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) + scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) + scale_x_continuous(limits = c(-max(abs(df$log2_fc)),max(abs(df$log2_fc)))) + labs(x = "Fold change (log2)", y = "P-value") + theme_bw(9) + theme(legend.position = "none")
Now for the y-axis. We can fix the labelling like this:
ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) + geom_vline(xintercept = 0, linetype = "dashed", colour = "grey") + geom_hline(yintercept = -log10(0.05), linetype = "dashed", colour = "grey") + geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) + scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) + scale_x_continuous(limits = c(-max(abs(df$log2_fc)),max(abs(df$log2_fc)))) + scale_y_continuous(limits = c(0,NA), labels = function(i) 10^-i) + labs(x = "Fold change (log2)", y = "P-value") + theme_bw(9) + theme(legend.position = "none")
Adding the log ticks is difficult. We can use annotation_logticks()
and specify sides = "l", outside = TRUE
to give us log ticks in the same place as the original. As default the plot clipping means that they are not visible, so we can add coord_cartesian(clip = "off")
to give us this:
Now we have three problems. The log ticks are the wrong way around (and don’t match the axis), they overlap with the tick labels, and there are ticks beyond 1 which doesn’t make sense. There is no reverse option for annotation_logticks()
, although an issue was filed on GitHub some time ago.
Decision time: how do we solve the y axis problem?
Can we fix this problem? We can go back to the original y data, plot that (rather than the transform) and transform the axis.
library(scales) ggplot(df, aes(x = log2_fc, y = so_allTWave)) + geom_vline(xintercept = 0, linetype = "dashed", colour = "grey") + geom_hline(yintercept = -log10(0.05), linetype = "dashed", colour = "grey") + geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) + scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) + scale_x_continuous(limits = c(-max(abs(df$log2_fc)),max(abs(df$log2_fc)))) + scale_y_continuous(trans = c("log10", "reverse")) + annotation_logticks(sides = "l", outside = TRUE) + coord_cartesian(clip = "off") + labs(x = "Fold change (log2)", y = "P-value") + theme_bw(9) + theme(legend.position = "none")
Use of scale_y_continuous(trans = c("log10", "reverse"))
works to transform the data correctly. However, annotation_logticks()
is not aware of this transformation so this doesn’t solve the problem. Note that the horizontal line is now misplaced and that would need fixing too.
So, decision time: I will drop the log ticks since it will be too much effort to fix this.
Since I failed with the ticks, let’s fix the gridlines a bit since it doesn’t make sense to have the minor gridlines at a non-integer location. And let’s mimic the Igor style a little closer.
ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) + geom_vline(xintercept = 0, linetype = "dashed", colour = "grey") + geom_hline(yintercept = -log10(0.05), linetype = "dashed", colour = "grey") + geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) + scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) + scale_x_continuous(limits = c(-max(abs(df$log2_fc)),max(abs(df$log2_fc)))) + scale_y_continuous(limits = c(0,NA), labels = function(i) 10^-i) + labs(x = "Fold change (log2)", y = "P-value") + theme_classic(9) + theme(legend.position = "none")
I will change the scales so they match the original better. Oh, and the x-axis title needs fixing.
ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) + geom_vline(xintercept = 0, linetype = "dashed", colour = "grey") + geom_hline(yintercept = -log10(0.05), linetype = "dashed", colour = "grey") + geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) + scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) + scale_x_continuous(limits = c(-7.1,7.1), breaks = seq(-6,6,2)) + scale_y_continuous(limits = c(0,4), labels = function(i) 10^-i) + xlab(bquote(Test / Control (Log[2]))) + ylab("P-value") + theme_classic(9) + theme(legend.position = "none")
A final side-by-side
Here is my attempt at a recreation, with the original. It’s reasonably close. Drag the slider to compare.
The aim here was to show the steps needed to get a desired plot result. We were not going for an exact recreation of the original plot, just something close. I wanted to highlight how sometimes things don’t work out and you have to make some choices along the way – the important thing is a) to get a plot, and b) that you are happy with it.
—
The post title comes from “Step By Step” by The Alan Parsons Project from “Eye in the Sky”. Sorry to disappoint anyone who thought I would pick New Kids on The Block or Whitney Houston.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.