On September 3 of last year, I described a paper by the Open Science Collaboration (OSC; reference and link below) that tried to estimate the reproducibility of studies published in high-quality psychology journals. It was a complicated paper, but its results were deemed sufficiently important to be published in Science. And those results were in the main disheartening: only about 35% of the replications of 100 studies chosen showed significant effects in the same direction as the original work, and only 39% of the scientists doing the replications agreed that their results really replicated the results of their target models. (Depending on the criteria, “replicability” could be as high as 47%, but that’s still not great.)
This was especially worrisome because the OSC authors said that they made huge efforts to replicate the original methodology and analysis of the 100 “model” studies. This led to a lot of worries about the reliability of scientific results, at least in psychology.
Now, however, a group of four researchers, three from Harvard (including first author Daniel T. Gilbert) and one from the University of Virginia, have published a “technical comment” in Science severely questioning the results of the original OSC paper (reference and free link below). Their point is largely this: there are real differences between “model studies” and “replications—largely differences in experimental populations and conditions— that would lead to a divergence in results between the two. Thus the lack of “replicability” might not represent a potential problem with studies’ original conclusions, but merely an expected difference based on divergence of subjects and experimental methods. Thus, perhaps the reliability of psychological studies isn’t as bad as the OSC implied.
Gilbert et al. make three points about the OSC study:
- The replications used different experimental populations, in some cases so different that it would be surprising if the replication even got the same results. Gilbert et al.’s words:
“An original study that measured American’s attitudes toward African-Americans (3) was replicated with Italians, who do not share the same stereotypes; an original study that asked college students to imagine being called on by a professor (4) was replicated with participants who had never been to college; and an original study that asked students who commute to school to choose between apartments that were short and long drives from campus (5) was replicated with students who do not commute to school. What’s more, many of OSC’s replication studies used procedures that differed from the original study’s procedures in substantial ways: An original study that asked Israelis to imagine the consequences of military service (6) was replicated by asking Americans to imagine the consequences of a honeymoon; an original study that gave younger children the difficult task of locating targets on a large screen (7) was replicated by giving older children the easier task of locating targets on a small screen; an original study that showed how a change in the wording of a charitable appeal sent by mail to Koreans could boost response rates (8) was replicated by sending 771,408 e-mail messages to people all over the world (which produced a response rate of essentially zero in all conditions).”
The point is that these differences could lead to lack of replicability because the studies weren’t really replications! Gilbert et al. in fact cite another set of replicated studies, the “Many Labs” project (MLP), in which some constituted real replications while others used different populations and procedures. This “infidelity” effect caused by differences in populations and procedures was responsible for 34.5% of the failures to replicate—apart from sampling error.
- The OSC study involved only a single attempt to replicate a model study.
Gilbert et al. note that the MLP tried to replicate each study 35-36 times, and then pooled the data. This produced a replication rate much higher than that of the OSC: 85%. But if only single studies (as in the OSC) were used instead of pooled data, the MLP would have gotten a replication rate of 35%, close to that of OSC study. As Gilbert et al note, “Clearly, OSC used a method that severely underestimates the rate of replication.”
- The OSC study has internal data suggesting that differences in design and subject population (“infidelities” of replication) were responsible for some of the lack of replication.
In the original OSC papers, authors of the “model” papers were asked to endorse whether or not the replication was methodologically sound as a genuine replication. 69% of those authors said “yes.” And when Gilbert et al. looked at whether or not this endorsement actually made a difference to whether a study was replicated, it did: a huge difference. As the authors said:
“This strongly suggests that the infidelities did not just introduce random error but instead biased the replication studies toward failure. If OSC had limited their analyses to endorsed studies, they would have found that 59.7% [95% confidence interval (CI): 47.5%, 70.9%] were replicated successfully. In fact, we estimate that if all the replication studies had been high enough in fidelity to earn the endorsement of the original authors, then the rate of successful replication would have been 58.6% (95% CI: 47.0%, 69.5%) when controlling for relevant covariates. Remarkably, the CIs of these estimates actually overlap the 65.5% replication rate that one would expect if every one of the original studies had reported a true effect. Although that seems rather unlikely, OSC’s data clearly provide no evidence for a “replication crisis” in psychological science.”
So perhaps things aren’t as bad as I, and many other commenters, represented. To be fair, none of us knew of the MLP (surprisingly, that study also appears to have been conducted by the Open Science Collaboration, but its results weren’t mentioned in their 2015 paper!). Further, few of us (certainly not I) compared the conditions of the “model” studies with those of the replication studies.
Now the authors of the OSC study have replied to Gilbert et al. in their own comment (Anderson et al., reference and link below), but I’ll have to admit that their response involves arcane statistical points that are above my pay grade. I urge readers, especially those who are statistically savvy, to read both Gilberr et al. and Anderson et al. But the latter authors do seem to admit that Gilbert et al. had a point:
“That said, Gilbert et al.’s analysis demonstrates that differences between laboratories and sample populations reduce reproducibility according to the CI measure.”
In their response in Science, the OSC authors go on to deny the contentions of Gilbert et al., and respond this way in a piece in The Daily Progress:
The 2015 study took four years and 270 scientists to conduct and was led by Brian Nosek, director of the Center for Open Science and a UVa psychology researcher.
Nosek, who took part in the new investigation, said Thursday night that the bottom-line message of the original undertaking was not that 60 percent of studies were wrong “but that 40 percent were reproduced, and that’s the starting point.”
As for the follow-up critique, it’s another way of looking at the data, he said. Its authors “came to an explanation that the problems were in the replication. Our explanation is that the data is inconclusive.”
Well, “data” is a plural word, but neglecting that, this doesn’t sound like a very strong defense.
Judge for yourself; the references are below. I do note, and this is not to denigrate the OSC study in the least, but merely as a point of interest, that the response of Anderson et al. (but not the original paper) had some curious funding:

I find it strange that a response to a critique, but not the original paper that was criticized, would be funded by the Templeton Foundation.
h/t: D. Taylor
_______
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349:943. DOI: 10.1126/science.aac4716
Gilbert, D. G. et al. 2016. Comment on “Estimating the reproducibility of psychological science.” Science 351:1037. DOI: 10.1126/science.aad7243
Anderson, C. J. et al. 2016. Response to Comment on “Estimating the reproducibility of psychological science.” Science 351:1037. DOI: 10.1126/science.aad9163