[This article was first published on
Gloor lab musings, and kindly contributed to
R-bloggers]. (You can report issue about the content on this page
here)
Want to share your content on R-bloggers?
click here if you have a blog, or
here if you don't.
I’m in the throes of submitting a paper on effect sizes in ALDEx2, and so I thought I would take a stab at a nice concise blog post to summarize and simplify.
Effect sizes are standardized mean differences between groups, and are designed to convey what the experimentalist wants to know: how different are my groups given the treatment. In contrast, a very un-nuanced interpretation is that P-values provide a measure of whether the difference observed is due to chance or not.
More formally effect sizes measure the standardized mean difference between groups, and are equivalent to a Z score; how many standard deviations separate the mean of group 1 and group 2? Not surprisingly, effect size measures are very sensitive to the underlying distribution of the data. In datasets where it is not possible to assume normality, effect size measures are very difficult to generate and interpret.
In high throughput sequencing data the data are often not Normally-distributed as shown in the ‘Group Distribution’ plot which shows the density of the S:E:W:E feature in the ALDEx2 selex test dataset. The problem is how to find a sensible effect size for these two ugly data distributions.
Enter the ALDEx2 effect size measure which gives an estimate of the median standardized difference between groups. Note the wording, we take the media of the standardized group difference. How does this work?
If we have two distributions contained in two vectors
a = [a1, a2, … an]
b = [b1, b2, … bm]
then the difference between the two groups is simply a new vector,
diff = a-b.
If the two vectors are different sizes, R will simply recycle the shorter vector. The distribution of the difference vector is shown in the ‘Btw Grp Difference’ density plot.
Taking the median of diff gives us a non-parametric measure of the between group difference, and this is about -2. Note that this is different than asking what is the difference between the medians of the two groups, but for infinite sized datasets these measures would be the same.
How do we scale this? One way would be to estimate a summary statistic of dispersion such as the
pooled median absolute deviation, but this would not use all of the information in the distributions and makes assumptions. So what we do in
ALDEx2, of course, is to generate a dispersion vector,
disp = max(a-Pa, b-Pb),
where P denotes a random permutation of vector a or b. The distribution of the dispersion is shown in the ‘Win Grp Dispersion’ density plot. Taking the median of disp results in a non-parametric measure of the dispersion, and is just over 2. If the dispersion of the two vectors is roughly equal, then the estimate is derived from both jointly. However if the dispersion of the two vectors is not equal, then the dispersion estimate is skewed toward being drawn from the most disperse vector.
So now we have what we need.
eff = median(diff/disp),
and we have a nice clean summary effect size statistic that makes no assumptions about the distribution of the data, and the distribution of effect sizes is shown in the ‘Effect size’ density plot, and the median value is about 0.9 in this example.
Modelling shows that eff is approximately 1.4 times that of Cohen’s d, as expected for a well-behaved non-parametric estimator and importantly gives sensible answers when the underlying data are derived from Normal, uniform, skewed, Cauchy, or multimodal distributions.
These measures are what you see when you look at the aldex.effect output:
diff.btw = median(diff)
diff.win = median(disp)
effect = eff
ALDEx2 will even give you the 95% confidence interval of the effect size estimate if you ask it; try aldex.effect(x, CI=TRUE).