I wrote a short post yesterday about a huge attempt to answer the question, “What proportion of results reported in psychology journals can be repeated?” This was a massive study in which dozens of psychology researchers simply went and repeated 100 studies published in three respectable experimental psychology journals: Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology: Learning, Memory, and Cognition. The full paper, along with a one page summary, is published in Science (see reference and free download below); the authors call themselves the “Open Science Collaboration” (OSC). There’s also a summary piece in the New York Times, a sub-article highlighting three famous studies that couldn’t be repeated by OSC (including one on free will, which I wrote about yesterday), and a newer op-ed in the Times arguing that this failure to replicate doesn’t constitute a scientific crisis, but simply shows science behaving as it should: always scrutinizing whether published results are reliable.
Even before this paper was published, I argued that people should do in biology what these folks did in psychology: test experimental results that are impressive but rarely repeated. In psychology, as in evolutionary biology and ecology, significant findings aren’t often repeated, for doing so takes hard-to-come-by money and a concerted effort— an effort that isn’t rewarded. (You don’t get much naches or professional advancement by simply repeating someone else’s work.) Further, in biology (and presumably in experimental psychology), work isn’t often repeated as the normal by-product of building on previous results. For example, if you want to use new gene-replacement methods, you are obliged to indirectly replicate other people’s protocols before you can begin to insert your own favorite gene.
It’s thus been my contention that about half of published studies in my own field (I include ecology along with evolution) would probably not yield the same results if they were replicated. I’m excluding those studies that use genetics, as genetic work is easily repeated, particularly if it involves sequencing DNA.
Failures to repeat a published result don’t mean that the experimenters cheated, or even that the work was faulty. They could mean, for instance, that the results are peculiar to a particular location, time, or experimental setup, or that there’s a publication bias towards impressive results, so only the ones whose results are highly statistically significant get published. Finally, given the conventional probability ceiling of 0.05, 5% of all experiments will yield a significant deviation from chance (thus falsifying the null hypothesis), even when that null hypothesis is true.
On to the experiment. The OSC decided to finally test reproducibility in a quite rigorous way. A whole group of people agreed to test a passel of papers taken from three journals, winding up with 100 replicated experiments. To enforce rigor, they chose papers from only prominent journals (they wound up with exactly 100 replicates), replicated only the last study in each paper (so that they weren’t just replicating preliminary results, which are often reported first), and then did each replication, as far as they could, in an identical way as the initial study—with the exception that sometimes they had higher sample sizes, giving them even greater power to detect effects.
To the credit of the original authors, they provided the OSC team with complete data and details of their experiments, ensuring that the replications were as close as possible in design to the original results. There were many other controls as well, including the use of statisticians to independently replicate the probability values for the replication experiments.
All the original studies had results that were statistically significant, with p values (i.e., the chance of getting the observed effect as a mere statistical outlier when there was no real effect) below 5% (a few were just a tad higher). When the chance of getting a false positive is 0.05 or less, researchers generally consider the result “statistically significant,” which is a key to getting your paper published. That cutoff, of course, is arbitrary, and is lower in areas like physics, which, for experiments like detecting the Higgs boson, drops to 0.00001.
So what happened when those 100 psychology studies were replicated? The upshot was that most of the significant results became nonsignificant, and the effects that were found, even if nonsignificant, dropped to about half the size of effects reported in the original papers. Here are the salient results:
- Only 35 of the original 100 experiments produced statistically significant results upon replication (62 did not, and three were excluded). In other words, under replication with near-identical conditions and often larger samples, only 35% of the original findings were judged significant.
- That said, many (but not nearly all) of the results were in the same direction as those seen in the original studies, but weren’t large enough to achieve statistical significance. If the replications had been the original papers, most of them probably wouldn’t have been published.
Here’s a chart showing the correlation between the p values for the original papers and those for the replicates. Each dot plots the size of the effect seen in the replicate (Y axis) against the effect size for the same study in the original paper (X axis). If a dot is green, the replicate was also statistically significant (as were all effects in the original study). Pink dots mean that the replicate study did not yield statistically significant results. This shows that effect sizes were generally lower than those of the original studies (most points fall below the diagonal line), and most of the replicates (62%, to be precise) did not show significant effects.
The chart also shows that the larger the effects observed in the original study, the more likely they were to replicate, for the pink dots are clustered on the left side of the graph, where the original effect sizes (normalized) are small. This goes along with the investigators’ findings that the lower the p value seen in the original experiment, and thus the more significant the result, the more likely it was to also be significant in the replicate.
- While most of the results of replications were in the same direction as the original study, there were an appreciable number (I count about 20%) that were close to showing either the opposite direction or no effect at all. And remember, even if there is no real biological effect in the original study, half of the replications will, by chance alone, be in the same direction as in the original study.
- The OSC team also asked each team doing a replication whether they considered that their results actually replicated that of the orignal paper. This assessment was subjective, but mirrored the results based on p-value significance: only 39% of investigators concluded that their results replicated those of the original study.
- Finally, it’s possible that many of the p values in replications came close to the magic p = 0.05 cutoff point, which of course is more or less an arbitrary threshold for significance. To see if that was the case, the authors did a density plot of p values in the original paper versus those found in the repicates. Here are the results, with p values from original studies on the left and from the replicates on the right.
As you can see, the p values for replications were distributed widely, and so were not hovering somewhere near the magic cutoff value for significance (0.05). Of course, all the p values in the original studies (left) were at or below that level of significance, or they wouldn’t have been published.
What does it all mean?
There are two diametric views about how to take this general failure to replicate. The first is to celebrate this as a victory for science. After all, science is about continually testing its own conclusions, and you can only do that by trying to see if what other people found out is really right. This, in fact, is the conclusion the authors come to. I quote from their paper:
Scientific progress is a cumulative process of uncertainty reduction that can only succeed if science itself remains the greatest skeptic of its explanatory claims.
The present results suggest that there is room to improve reproducibility in psychology. Any temptation to interpret these results as a defeat for psychology, or science more generally, must contend with the fact that this project demonstrates science behaving as it should. Hypotheses abound that the present culture in science may be negatively affecting the reproducibility of findings. An ideological response would discount the arguments, discredit the sources, and proceed merrily along. The scientific process is not ideological. Science does not always provide comfort for what we wish to be; it confronts us with what is. Moreover, as illustrated by the Transparency and Openness Promotion (TOP) Guidelines, the research community is taking action already to improve the quality and credibility of the scientific literature.
We conducted this project because we care deeply about the health of our discipline and believe in its promise for accumulating knowledge about human behavior that can advance the quality of the human condition. Reproducibility is central to that aim. Accumulating evidence is the scientific community’s method of self-correction and is the best available option for achieving that ultimate goal: truth.
The “all is well in science” interpretation is also that pushed by Lisa Feldman Barrett in her new NYT op-ed about the study, “Psychology is not in crisis.” (Barrett is a professor of psychology at Northeastern University.) But her piece is a mess, comparing failure of psychology-study replication to changing the environment in which a gene is expressed. In some environments, she says, a gene producing curly wings makes the wings less curly, a common phenomenon that we geneticists call “variable expressivity”. And that’s indeed the case, but it doesn’t meant that the “Curly” mutation doesn’t cause the wings to become curled—something she implies. Variable expressivity is not a failure to replicate the finding that a particular genic lesion is responsible for curly wings.
Barrett also compares the OSC study’s failure to replicate to other studies in which failure to replicate depends on “context” (e.g., mice given shocks at when they hear a sound develop a Pavlovian response), so that one doesn’t see the same results under different conditions (mice won’t develop the Pavlovian response if they’re strapped down when shocked). But that, like the curly-wing result, is irrelevant to the OSC’s efforts, which tried ensure that the context and experimental conditions were as close as possible to those of the original studies. In other words, the OSC tried to eliminate context-specific effects. In Barrett’s eagerness to defend and exculpate her field, and affirm the strength of science, she makes arguments based on false analogies.
One thing that we can all agree on—the middle ground, so to speak—is that there’s a problem with the culture of science, which always favors big and impressive positive results over negative results, and favor publication of novel results while largely ignoring attempts to replicate. (Sometimes a failure to replicate isn’t even accepted by scientific journals!) That’s even more true of the popular press, which is quick to tout findings of stuff like a “gay gene,” but can’t be bothered to publish a caveat when that study—as it was—failed to replicate. This problem, at least in the scientific culture, can be somewhat repaired. Most important, we need more studies like that of the OSC, but replications applied to other fields, especially biology.
And that brings me to my final point, which gives a less positive view of the results. As I said above, I think many studies in biology—particularly organismal biology—aren’t often replicated, especially if they involve field work. So such studies remain in the literature without ever having been checked, and often become iconic work that finds its way into textbooks.
In this way biology resembles psychology, although molecular and cell biology studies are often replicated as part of the continuing progress of the field. I think, then, that it’s not as kosher to claim that ecology and evolution experience the same degree of self-checking as, say, physics and chemistry. Yes, all work should in principle be checked, but you find precious few dollars handed out by the National Institutes of Health or the National Science Foundation to replicate work in biology. (That’s because there isn’t that much money to hand out at all!) In my field of organismal biology, then, the self-correcting mechanism of science, while operative at some level, isn’t nearly as strong as it is in other fields like molecular and cell biology.
My main conclusion, then, is that we need an OSC for ecology and evolutionary biology. But it will be a cold day in July (in Arizona) when that happens!
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science, 349 online, DOI: 10.1126/science.aac4716