A guide to designs and contrasts in DESeq2
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Many statistical analysis packages in R utilize design matrices for setting up comparisons between data subsets. The two Bioconductor packages most commonly used for transcriptomics data analysis, DESeq2 and limma, are no exception. Although the design matrices and contrasts are intuitive to understand for simple cases, things can get confusing when more complex multi-factorial studies are involved. You can find dozens of questions asked on Biostars and Bioconductor forums seeking for help in this matter.
To be fair, the DESeq2 and limma vignettes have dedicated sections explaining designs and contrasts, but I found these not very easy to follow the first time I saw them. In this note-to-self (and to-my-students) post, I intend to explain how to construct designs in various study contexts and access specific comparisons of interest using DESeq2. I benefited greatly from the DESeq2 vignette itself and the tutorial written by Hugo Tavares. The visuals embedded in this post are adapted from the accompanying slide show to Hugo Tavarez’s tutorial.
I will start with the simplest case where there are only two groups to be compared (i.e. one factor, two levels). Here, I will discuss different ways of constructing the design matrix and accessing the results. Then, I will exemplify a scenario with three groups (i.e. one factor, three levels) apply the approaches we discussed in the first case. Lastly, we will see an example involving 4 groups (i.e. two factors, two levels each). So, here goes nothin’…
1 One experimental condition, two groups
This is a classic case where you might have a treatment group and a control group, and you are interested in figuring out how the treatment affects the gene expression. This type of data is referred to in vignettes as “one factor two levels”.
1.1 Simulate data
Code
# for reproducibility set.seed(1) suppressPackageStartupMessages(library(DESeq2)) # simulate data dds1 <- makeExampleDESeqDataSet(n = 1000, m = 6, betaSD = 2) dds1$condition <- factor(rep(c("treatment", "control"), each = 3)) # observe the samples and groups colData(dds1)
DataFrame with 6 rows and 1 column condition <factor> sample1 treatment sample2 treatment sample3 treatment sample4 control sample5 control sample6 control
1.2 Contrasts from design with intercept
When we use the formula ~ condition
the model matrix is set up with an intercept, that is, the first level of the condition
is considered as the reference group. Since R orders factor levels based on the alphabetical order, in our case, control will be the reference. The order can be explicitly changed using relevel()
or factor(..., levels=c(...))
, if needed.
Here, the gene expression is modeled as seen below where is the intercept and is the coefficient for the treatment’s effect:
We can directly get the results we are interested in using the contrasts
argument as seen below. contrasts
argument can accept a number of inputs:
A character vector with exactly three elements: the name of a factor in design formula (eg. “condition” in our case), the name of the numerator (ie. group of interest, “treatment” in our case), and the name of the denominator (ie. reference group, “control” in our case). This would look like this for our example:
c("condition", "treatment", "control"))
A list of two character vectors: the names of the fold changes for the numerator, and the names of fold changes for the denominator. This approach allows you to group comparisons together in the numerator and denominator for extracting specific comparisons (more below). In our case we only have one comparison, which is
condition_treatment_vs_control
, so we will provide it alone aslist(condition_treatment_vs_control)
. Note that these names came from thecolData
indds
object automatically.
Code
# design with intercept (control group is the intercept, or the reference) design(dds1) <- ~ condition # note the zeros for under treatment for samples 4-6 model.matrix(~condition, colData(dds1))
(Intercept) conditiontreatment sample1 1 1 sample2 1 1 sample3 1 1 sample4 1 0 sample5 1 0 sample6 1 0 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")$condition [1] "contr.treatment"
Code
# differential expression analysis dds1 <- DESeq(dds1) # see the comparisons resultsNames(dds1)
[1] "Intercept" "condition_treatment_vs_control"
Code
res <- results(dds1, contrast = list("condition_treatment_vs_control")) # res <- results(dds1, contrast = c("condition", "treatment", "control")) # alternatively # res
1.3 Contrasts from design without intercept
Sometimes it is easier to work with no-reference designs. You can tell the algorithm not to have an intercept (ie. control group) by using ~ 0 + condition
formula. The way you specify the contrast
argument will change, but the results will be identical with the first case. You may not see the benefit of using this approach in this simple example, but when multiple factors and factor levels are involved, it may be easier to construct no-intercept designs.
This is how a zero-intercept model looks like:
NOTE: When you have two factors in your data and use no-intercept design, you will have no reference group for the first factor (as seen here), but for the second factor a reference group is still defined. Although this is sensible, I find it somewhat confusing when it comes to specifying contrasts for specific pairwise comparisons. But the next section will help out with this.
Code
dds2 <- dds1 # design without intercept (no reference) design(dds2) <- ~ 0 + condition # note the zeros for under treatment for samples 4-6 model.matrix(~0+condition, colData(dds2))
conditioncontrol conditiontreatment sample1 0 1 sample2 0 1 sample3 0 1 sample4 1 0 sample5 1 0 sample6 1 0 attr(,"assign") [1] 1 1 attr(,"contrasts") attr(,"contrasts")$condition [1] "contr.treatment"
Code
# differential expression analysis dds2 <- DESeq(dds2) # see the comparisons resultsNames(dds2)
[1] "conditioncontrol" "conditiontreatment"
Code
res <- results(dds2, contrast = list("conditiontreatment", "conditioncontrol")) # res
Above you see that we are providing list("conditiontreatment", "conditioncontrol")
to contrasts
argument. This way, we are telling the algorithm to take fold changes of differentially expressed genes in the conditiontreatment
(differential expression is measured against the hypothetical zero) and divide that by the log fold changes of conditioncontrol
. Thus you get the treatment vs control:
The results should be equivalent between the first the second design. Conceptually, the zero intercept model in DESeq2 does not compare conditions against each other, instead, each condition’s coefficient represents its own log2-fold expression estimate relative to zero.
1.4 A new approach for defining contrasts
It was easy to specify the contrasts in the examples above, but Mike Love and Hugo Tavares shares an interesting way of generalizing this concept, which is especially useful for more complex designs. This approach works for designs with or without an intercept. This approach involves three steps:
- Getting the model matrix
- Subsetting the matrix for each group of interest and calculate its column means - this results in a vector of coefficients for each group
- Subtracting the group vectors from each other according to the comparison we’re interested in
Code
# get the model matrix mod_mat <- model.matrix(design(dds1), colData(dds1)) mod_mat
(Intercept) conditiontreatment sample1 1 1 sample2 1 1 sample3 1 1 sample4 1 0 sample5 1 0 sample6 1 0 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")$condition [1] "contr.treatment"
Code
control <- colMeans(mod_mat[dds1$condition == "control", ]) control
(Intercept) conditiontreatment 1 0
Code
treatment <- colMeans(mod_mat[dds1$condition == "treatment", ]) treatment
(Intercept) conditiontreatment 1 1
Code
# The comparison of interest treatment - control
(Intercept) conditiontreatment 0 1
Code
res <- results(dds1, contrast = treatment - control)
The code may look overly complex to get a simple numeric vector to specify groups to compare. In this example, it is maybe so. But when there is a more complex experimental setup and you are comparing various subgroups, this can be helpful. Hugo Tavares created a function in a GitHub Gist to facilitate contrasts. But here I wanted to improve on that to avoid creation of unneccessary objects in the global environment and allow more complex contrasts.
The contraster()
function below allows the user to extract contrasts from the DESeq object after specifying custom groups. group1
and group2
arguments can take lists of character vectors each having 2 or more items. The first item of the character vector should point to one of the columns in the colData(dds)
, that is the grouping variable. The second item of the character vector (and onwards) should be the sample subgroups to be included in the analysis. The differences are calculating subtracting group2 from group1. Read on for a discussion on the weighted
argument.
Code
contraster <- function(dds, # should contain colData and design group1, # list of character vectors each with 2 or more items group2, # list of character vectors each with 2 or more items weighted = F ){ mod_mat <- model.matrix(design(dds), colData(dds)) grp1_rows <- list() grp2_rows <- list() for(i in 1:length(group1)){ grp1_rows[[i]] <- colData(dds)[[group1[[i]][1]]] %in% group1[[i]][2:length(group1[[i]])] } for(i in 1:length(group2)){ grp2_rows[[i]] <- colData(dds)[[group2[[i]][1]]] %in% group2[[i]][2:length(group2[[i]])] } grp1_rows <- Reduce(function(x, y) x & y, grp1_rows) grp2_rows <- Reduce(function(x, y) x & y, grp2_rows) mod_mat1 <- mod_mat[grp1_rows, ,drop=F] mod_mat2 <- mod_mat[grp2_rows, ,drop=F] if(!weighted){ mod_mat1 <- mod_mat1[!duplicated(mod_mat1),,drop=F] mod_mat2 <- mod_mat2[!duplicated(mod_mat2),,drop=F] } return(colMeans(mod_mat1)-colMeans(mod_mat2)) }
We can use this new function to get the exact results from before:
Code
res <- results(dds1, contrast = contraster(dds1, group1 = list(c("condition", "treatment")), group2 = list(c("condition", "control"))))
So far we have learned a simple two-group analysis. We will next use these in more complex experimental contexts.
2 One experimental condition, three levels
This is a classic case when you have a control group and two treatment groups (e.g. treatmentA and treatmentB). From now on, I won’t be Let’s see how things work.
2.1 Simulate data
In this design the gene expression is modeled as follows where is the intercept, is the coefficient (effect) of treatmentA, and the is the coefficient of treatmentB:
Code
# simulate data dds3 <- makeExampleDESeqDataSet(n = 1000, m = 9, betaSD = 2) dds3$condition <- factor(rep(c("control", "treatmentA", "treatmentB"), each = 3)) design(dds3) <- ~ condition dds3 <- DESeq(dds3) # check the coefficients estimated by DEseq resultsNames(dds3)
[1] "Intercept" "condition_treatmentA_vs_control" [3] "condition_treatmentB_vs_control"
2.2 Getting contrasts
Now we have three coefficients: 1) control group as intercept, 2) TreatmentA vs control, 3) TreatmentB vs control. We can get different results as follows:
Code
# treatmentA vs control res1 <- results(dds3, contrast = list("condition_treatmentA_vs_control")) # treatmentB vs control res2 <- results(dds3, contrast = list("condition_treatmentB_vs_control")) # treatmentA vs treatmentB res3 <- results(dds3, contrast= list("condition_treatmentA_vs_control", "condition_treatmentB_vs_control"))
Or we can alternatively use our new function as well:
Code
# Alternatively we can use our new function res1 <- results(dds3, contrast = contraster(dds3, group1 = list(c("condition","treatmentA")), group2 = list(c("condition", "control")))) res2 <- results(dds3, contrast = contraster(dds3, group1 = list(c("condition","treatmentB")), group2 = list(c("condition", "control")))) res3 <- results(dds3, contrast = contraster(dds3, group1 = list(c("condition","treatmentA")), group2 = list(c("condition", "treatmentB"))))
This function also allows us to perform more customized analyses. For instance, if you are trying to compare treatment groups as a whole (treatmentA and treatmentB combined) against reference, you would need to write group1 = list(c("condition", "treatmentA", "treatmentB"))
and group2 = list(c("condition", "control")
. The analysis will be performed by subtracting group2 from group1, that is you are comparing the average of the treatment groups to the control.
Code
res4 <- results(dds3, contrast = contraster(dds3, group1 = list(c("condition", "treatmentA", "treatmentB")), group2 = list(c("condition", "control"))))
Let’s see how the contrast is set up in this approach. As you can see below, each treatment group has a weight of 0.5, that is, their average is being compared against the control group.
Code
contraster(dds3, group1 = list(c("condition", "treatmentA", "treatmentB")), group2 = list(c("condition", "control")))
(Intercept) conditiontreatmentA conditiontreatmentB 0.0 0.5 0.5
If the groups were unbalanced, by setting weighted = TRUE
, you can assign weights according to group sizes. However, there are GitHub issues on the matter (1, 2) pointing out that this approach may be flawed where replicates don’t occur in all subgroups and numbers of replicates are different. You can see the behavior of contraster function by creating an unbalanced data set:
Code
# Convert the last control to treatmentA dds3$condition[3] <- "treatmentA" contraster(dds3, group1 = list(c("condition", "treatmentA", "treatmentB")), group2 = list(c("condition", "control")), weighted = TRUE)
(Intercept) conditiontreatmentA conditiontreatmentB 0.0000000 0.5714286 0.4285714
3 Two factors with interaction
In some experiments grouping variables may interact with one another. An example for this can be that a treatment affects males and females differently. In this case we are looking at something like this:
3.1 Simulate data
Code
dds4 <- makeExampleDESeqDataSet(n = 1000, m = 12, betaSD = 2) dds4$sex <- factor(rep(c("female", "male"), each = 6)) dds4$condition <- factor(rep(c("treatment", "control"), 6)) dds4 <- dds4[, order(dds4$sex, dds4$condition)] colnames(dds4) <- paste0("sample", 1:ncol(dds4)) design(dds4) <- ~ sex + condition + sex:condition dds4 <- DESeq(dds4) resultsNames(dds4)
[1] "Intercept" "sex_male_vs_female" [3] "condition_treatment_vs_control" "sexmale.conditiontreatment"
3.2 Getting contrasts
3.2.1 Male vs female (in the control):
Code
res1 <- results(dds4, contrast = contraster(dds4, group1 = list(c("sex", "male"), c("condition", "control")), group2 = list(c("sex", "female"), c("condition", "control")))) # or equivalently res2 <- results(dds4, contrast = list("sex_male_vs_female")) # head(res1,3); head(res2, 3)
3.2.2 Male vs female (in the treatment):
Code
res1 <- results(dds4, contrast = contraster(dds4, group1 = list(c("sex", "male"), c("condition", "treatment")), group2 = list(c("sex", "female"), c("condition", "treatment")))) # or equivalently res2 <- results(dds4, contrast = list(c("sex_male_vs_female", "sexmale.conditiontreatment"))) # head(res1,3); head(res2, 3)
3.2.3 Treatment vs control (for females):
Code
res1 <- results(dds4, contrast = contraster(dds4, group1 = list(c("sex", "female"), c("condition", "treatment")), group2 = list(c("sex", "female"), c("condition", "control")))) # or equivalently res2 <- results(dds4, contrast = list(c("condition_treatment_vs_control"))) # head(res1,3); head(res2, 3)
3.2.4 Treatment vs control (for males):
Code
res1 <- results(dds4, contrast = contraster(dds4, group1 = list(c("sex", "male"), c("condition", "treatment")), group2 = list(c("sex", "male"), c("condition", "control")))) # or equivalently res2 <- results(dds4, contrast = list(c("condition_treatment_vs_control", "sexmale.conditiontreatment"))) head(res1,3); head(res2, 3)
log2 fold change (MLE): 0,0,+1,+1 Wald test p-value: 0,0,+1,+1 DataFrame with 3 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> gene1 0.381213 -1.489653 4.130413 -0.360655 0.718358 0.999452 gene2 17.028473 -0.271556 0.771766 -0.351863 0.724941 0.999452 gene3 84.349943 0.205398 0.685823 0.299491 0.764566 0.999452
log2 fold change (MLE): condition_treatment_vs_control+sexmale.conditiontreatment effect Wald test p-value: condition_treatment_vs_control+sexmale.conditiontreatment effect DataFrame with 3 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> gene1 0.381213 -1.489653 4.130413 -0.360655 0.718358 0.999452 gene2 17.028473 -0.271556 0.771766 -0.351863 0.724941 0.999452 gene3 84.349943 0.205398 0.685823 0.299491 0.764566 0.999452
3.2.5 Interaction between sex and condition (i.e. do males and females respond differently to the treatment?)
This will be a little more wordy, because we are trying to get the “difference of the differences”. For this analysis, we are interested in genes that are differentially expressed in males in treatment condition, but not in the control condition (ie. genes uniquely changing in males due to treatment). In other words, we are isolating by doing the following:
Code
# male_vs_female in the treatment condition diff1 <- contraster(dds4, group1 = list(c("sex", "male"), c("condition", "treatment")), group2 = list(c("sex", "female"), c("condition", "treatment"))) # male_vs_female in the control condition diff2 <- contraster(dds4, group1 = list(c("sex", "male"), c("condition", "control")), group2 = list(c("sex", "female"), c("condition", "control"))) res1 <- results(dds4, contrast = diff1-diff2) # or equivalently res2 <- results(dds4, contrast = list("sexmale.conditiontreatment")) # head(res1,3) # head(res2, 3)
4 Multiple factors multiple levels
Below is a more complex example involving an experimental condition (control, treatmentA, treatmentB), two sexes, and two batches.
4.1 Simulate data
Code
# simulate data dds <- makeExampleDESeqDataSet(n = 1000, m = 24, betaSD = 2) dds$condition <- factor(rep(c("control", "treatmentA", "treatmentB"), each = 8)) dds$sex <- factor(rep(c("male","male","male", "female", "female", "female"), 4)) dds$batch <- factor(rep(c(1,2), 12)) design(dds) <- ~ condition + sex + batch dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Code
# check the coefficients estimated by DEseq resultsNames(dds)
[1] "Intercept" "condition_treatmentA_vs_control" [3] "condition_treatmentB_vs_control" "sex_male_vs_female" [5] "batch_2_vs_1"
Code
contraster(dds, group1=list(c("batch", 2), c("sex", "male"), c("condition", "treatmentB")), group2=list(c("batch", 1), c("sex", "female"), c("condition", "treatmentA")))
(Intercept) conditiontreatmentA conditiontreatmentB sexmale 0 -1 1 1 batch2 1
Code
contraster(dds, group1=list(c("batch", 1), c("sex", "male"), c("condition", "treatmentB", "treatmentA")), group2=list(c("batch", 1), c("sex", "male"), c("condition", "control")))
(Intercept) conditiontreatmentA conditiontreatmentB sexmale 0.0 0.5 0.5 0.0 batch2 0.0
Code
contraster(dds, group1=list(c("batch", 1), c("sex", "male"), c("condition", "treatmentB", "treatmentA")), group2=list(c("batch", 1), c("sex", "male"), c("condition", "control")), weighted = T)
(Intercept) conditiontreatmentA conditiontreatmentB sexmale 0.0 0.6 0.4 0.0 batch2 0.0
5 Further reading
- https://www.biostars.org/p/395926/
- https://genomicsclass.github.io/book/pages/expressing_design_formula.html
- https://genomicsclass.github.io/book/pages/interactions_and_contrasts.html
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7873980/
- https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/designmatrices.html
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.