Site icon R-bloggers

Randomization thoughts

[This article was first published on distributed ecology, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Le Grand Casino of Monte Carlo
On Monday I’m going to be leading a little stats workshop on randomization tests and null models. In preparation for this I wrote up code for null model examples I wanted to write a post that introduced the basics of these models (Null models, bootstrapping, jack-knifing etc…) that are all specific classes of a general method known as Monte Carlo methods. Put simply, a Monte Carlo method is any approach that generates random numbers and then seeing how different fractions of them behave. Its a powerful method that can be used for a wide variety of situations, and its commonly used for solving complex integrals among other things.

A simple integration example

Let’s start with a trivial example, integrating a function we use all the time as ecologists, the normal distribution. Maybe you want to integrate the normal probability density function (PDF) from -1 to 1, because you’re curious about how likely an event within 1 standard deviation is. To get the area under the curve we simply integrate the PDF from -1 to 1.
MC integration of the normal
PDF between -1 and 1
In this case its silly because we already have an analytical solution, but it can be necessary for more complicated integrals. The simplest method is “hit or miss” integration where we create an x,y grid and sample randomly from it and ask: “Is this random point under my curve or not. To approximate the integral we multiply the fraction of our samples under the curve (fc) by the total area we sampled, A. Using R we can do both the actual integral and the Monte Carlo version easily. The actual answer is 0.682, and the approximate answer I got was 0.688, so pretty close. You can see the full code at this gist.

A simple statistical example
Another place we can use these methods in statistical hypothesis testing. The simplest case is as an alternative to a t-test. Imagine you have a data set with measurements of plant height for the same species in shaded and unshaded conditions. Your data might look like this:
Shaded13.3
Shaded12.1
Shaded 14.7 
Shaded12.8 
Unshaded17.8
Unshaded19.4  
Unshaded18.5 
Unshaded18.5 
An easy way to test this is with a parametric t-test. The Monte Carlo approach involves a several basic steps that are the architecture of any randomization test no matter how complicated. First calculate the true value of your hypothesis. In this case we are interested in if there is a difference between these two groups. We can calculate the mean of the shaded and unshaded group and subtract them. We want to know if this value is significantly different from 0. To determine this we need to construct a null distribution and here is where the Monte Carlo method comes in. If there is no difference between the groups, then which group a given height is associated with shouldn’t matter. So we reshuffle the values of the labels from the existing data set and each time we calculate our test statistic, the difference in the means and store it. This will create a distribution of test statistic values. We then compare the true value to the null distribution. Usually the test being performed is two-sided, so we check against the 95% confidence interval of the null distribution. If the value is within that interval then we will fail to reject the null hypothesis that there is a difference in the two means. Here’s a gist with all the relevant code and some pretty figures in ggplot2.

“All that code for a t-test?”

That must be what you’re thinking, and you’re right, its certainly unwieldy to write all that code for something so simple. But its a good starting point for when we begin talking about null models. You may not have realized it but in the previous example there’s two assumptions behind our inference. The first is that some process or mechanism has caused a difference between our groups. The fact that plants grow to different heights in shaded and unshaded conditions says something about the way plants use light, or the way they compete for light, or maybe some other mechanism I haven’t thought about. The second is that by randomizing our existing data, we can simulate a situation where we have collected the data under completely stochastic conditions, e.g. the process causing plant height is random. So there are our two assumptions: A.) Our data set represents the outcome of some process and B). by randomizing we can create a null model without process to test our own data against. Here’s where things become a bit more gray. In our example above the the null hypothesis is pretty clear, and we can all agree on it, but problems arise with more complicated questions. Traditionally null models have been used to make inferences about community assembly rules. The use of these models was prevalent during the “battles” constituting what is tongue-in-cheek called the “null model wars”. I won’t take up any space rehashing the null model wars of the 70’s and 80’s but links to good resources can be found here on Jeremy Fox’s Oikos blog and his post about refighting the null model wars. Suffice to say careful attention needs to be paid to the selection of the proper null model. Nick Gotelli has lots of good papers about null models if you peruse his work . I’ve worked up several examples from his 2000 and 2010 papers, and sometimes the algorithms can be challenging. I’ll cover more advanced methods in a future post going over some methods from Nick’s papers.
To leave a comment for the author, please follow the link and comment on their blog: distributed ecology.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.