[This article was first published on Political Methodology » R-Bloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Nathan Danneman (a co-author and one of my graduate students from Emory) recently sent me a New Yorker article from 2010 about the “decline effect,” the tendency for initially promising scientific results to get smaller upon replication. Wikipedia can summarize the phenomenon as well as I can:
In his article, Lehrer gives several examples where the decline effect is allegedly showing. In the first example, the development of second generation anti-psychotic drugs, reveals that the first tests had demonstrated a dramatic decrease in the subjects’ psychiatric symptoms. However, after repeating tests this effect declined and in the end it was not possible to document that these drugs had any better effect than the first generation anti-psychotics.
Experiments done by Jonathan Schooler were trying to prove that people describing their memories were less able to remember them than people not describing their memories. His first experiments were positive, proving his theory about verbal overshadowing but repeated studies showed a significant declining effect.
In 1991, Danish zoologist Anders Møller discovered a connection between symmetry and sexual preference of females in nature. This sparked a huge interest in the topic and a lot of follow-up research was published. In three years following the original discovery, 90% of studies confirmed Møller’s hypothesis. However, the same outcome was published in just four out of eight research papers in 1995, and only a third in next three years.
Why would a treatment that shows a huge causal effect in an experiment seem to get weaker when that experiment is repeated later on? “‘This was profoundly frustrating,’ he [Schooler] says. ‘It was as if nature gave me this great result and then tried to take it back.’”
The cosmos may be indifferent to our plight, but I don’t think it’s actually vindictive (or at least does not express its malice through toying with our experimental results). PZ Myers proposes multiple, less vindictive explanations; two of them make a great deal of sense to me.
Regression to the mean: As the number of data points increases, we expect the average values to regress to the true mean…and since often the initial work is done on the basis of promising early results, we expect more data to even out a fortuitously significant early outcome.
The file drawer effect: Results that are not significant are hard to publish, and end up stashed away in a cabinet. However, as a result becomes established, contrary results become more interesting and publishable.
These are common, well-known and well-understood phenomena. But as far as I know, no one’s really tried to formally assess the impact of these phenomena or to propose any kind of diagnostic of how susceptible any particular result is to these threats to inference.
Let’s start with a simple example. Suppose that the data generating process is , where . If we repeatedly generate data sets of size 1000 out of this DGP, run an appropriate linear model , and save only those estimated coefficients that are statistically significant in a one-tailed test, .
# what the distribution of statistically significant results looks like
set.seed(23409)
all.beta<-c()
sig.beta<-c()
for(i in 1:5000){
x<-runif(1000)
y<-0.5*x+rnorm(1000, mean=0, sd=4)
sig.beta[i]<-ifelse(summary(lm(y~x))$coefficients[2,3]>qnorm(0.95), summary(lm(y~x))$coefficients[2,1], NA)
all.beta[i]<-summary(lm(y~x))$coefficients[2,1]
}
hist(sig.beta, xlim=c(0,2.5), ylim=c(0, 400), xlab=expression(paste("Estimated Coefficient ", hat(beta))), main=c("Distribution of Statistically", "Significant Coefficients, beta = 0.5"))
abline(v=0.5, lty=2)
mean(sig.beta, na.rm=T)
And what do we get?
In short, we find that none of the statistically significant results are near the actual coefficient of 0.5. In fact, the statistically significant coefficients are biased upward (the mean coefficient is 1.20 1.008 in this simulation). This makes sense: only the largest slopes are capable of overcoming the intrinsic noise in the DGP and being detected at this sample size (1000).
What does this mean? Well… the estimator is not itself intrinsically biased: if you plotted all the coefficients from our 1000 simulated samples, they would be normally distributed around the true mean of 0.5 with appropriate variance. But we’re not talking about the distribution of an estimator given a true value, ; we’re talking about the distribution of scientifically notable, publishable results . This is the distribution of results we expect to see in journal articles and in the media. And that distribution is biased because the scientific review process requires that results reach a certain signal-to-noise ratio (viz., a p-value smaller than 0.05) before they deserve scientific attention: