On September 3 of last year, I described a paper by the Open Science Collaboration (OSC; reference and link below) that tried to estimate the reproducibility of studies published in high-quality psychology journals. It was a complicated paper, but its results were deemed sufficiently important to be published in Science. And those results were in the main disheartening: only about 35% of the replications of 100 studies chosen showed significant effects in the same direction as the original work, and only 39% of the scientists doing the replications agreed that their results really replicated the results of their target models. (Depending on the criteria, “replicability” could be as high as 47%, but that’s still not great.)
This was especially worrisome because the OSC authors said that they made huge efforts to replicate the original methodology and analysis of the 100 “model” studies. This led to a lot of worries about the reliability of scientific results, at least in psychology.
Now, however, a group of four researchers, three from Harvard (including first author Daniel T. Gilbert) and one from the University of Virginia, have published a “technical comment” in Science severely questioning the results of the original OSC paper (reference and free link below). Their point is largely this: there are real differences between “model studies” and “replications—largely differences in experimental populations and conditions— that would lead to a divergence in results between the two. Thus the lack of “replicability” might not represent a potential problem with studies’ original conclusions, but merely an expected difference based on divergence of subjects and experimental methods. Thus, perhaps the reliability of psychological studies isn’t as bad as the OSC implied.
Gilbert et al. make three points about the OSC study:
- The replications used different experimental populations, in some cases so different that it would be surprising if the replication even got the same results. Gilbert et al.’s words:
“An original study that measured American’s attitudes toward African-Americans (3) was replicated with Italians, who do not share the same stereotypes; an original study that asked college students to imagine being called on by a professor (4) was replicated with participants who had never been to college; and an original study that asked students who commute to school to choose between apartments that were short and long drives from campus (5) was replicated with students who do not commute to school. What’s more, many of OSC’s replication studies used procedures that differed from the original study’s procedures in substantial ways: An original study that asked Israelis to imagine the consequences of military service (6) was replicated by asking Americans to imagine the consequences of a honeymoon; an original study that gave younger children the difficult task of locating targets on a large screen (7) was replicated by giving older children the easier task of locating targets on a small screen; an original study that showed how a change in the wording of a charitable appeal sent by mail to Koreans could boost response rates (8) was replicated by sending 771,408 e-mail messages to people all over the world (which produced a response rate of essentially zero in all conditions).”
The point is that these differences could lead to lack of replicability because the studies weren’t really replications! Gilbert et al. in fact cite another set of replicated studies, the “Many Labs” project (MLP), in which some constituted real replications while others used different populations and procedures. This “infidelity” effect caused by differences in populations and procedures was responsible for 34.5% of the failures to replicate—apart from sampling error.
- The OSC study involved only a single attempt to replicate a model study.
Gilbert et al. note that the MLP tried to replicate each study 35-36 times, and then pooled the data. This produced a replication rate much higher than that of the OSC: 85%. But if only single studies (as in the OSC) were used instead of pooled data, the MLP would have gotten a replication rate of 35%, close to that of OSC study. As Gilbert et al note, “Clearly, OSC used a method that severely underestimates the rate of replication.”
- The OSC study has internal data suggesting that differences in design and subject population (“infidelities” of replication) were responsible for some of the lack of replication.
In the original OSC papers, authors of the “model” papers were asked to endorse whether or not the replication was methodologically sound as a genuine replication. 69% of those authors said “yes.” And when Gilbert et al. looked at whether or not this endorsement actually made a difference to whether a study was replicated, it did: a huge difference. As the authors said:
“This strongly suggests that the infidelities did not just introduce random error but instead biased the replication studies toward failure. If OSC had limited their analyses to endorsed studies, they would have found that 59.7% [95% confidence interval (CI): 47.5%, 70.9%] were replicated successfully. In fact, we estimate that if all the replication studies had been high enough in fidelity to earn the endorsement of the original authors, then the rate of successful replication would have been 58.6% (95% CI: 47.0%, 69.5%) when controlling for relevant covariates. Remarkably, the CIs of these estimates actually overlap the 65.5% replication rate that one would expect if every one of the original studies had reported a true effect. Although that seems rather unlikely, OSC’s data clearly provide no evidence for a “replication crisis” in psychological science.”
So perhaps things aren’t as bad as I, and many other commenters, represented. To be fair, none of us knew of the MLP (surprisingly, that study also appears to have been conducted by the Open Science Collaboration, but its results weren’t mentioned in their 2015 paper!). Further, few of us (certainly not I) compared the conditions of the “model” studies with those of the replication studies.
Now the authors of the OSC study have replied to Gilbert et al. in their own comment (Anderson et al., reference and link below), but I’ll have to admit that their response involves arcane statistical points that are above my pay grade. I urge readers, especially those who are statistically savvy, to read both Gilberr et al. and Anderson et al. But the latter authors do seem to admit that Gilbert et al. had a point:
“That said, Gilbert et al.’s analysis demonstrates that differences between laboratories and sample populations reduce reproducibility according to the CI measure.”
In their response in Science, the OSC authors go on to deny the contentions of Gilbert et al., and respond this way in a piece in The Daily Progress:
The 2015 study took four years and 270 scientists to conduct and was led by Brian Nosek, director of the Center for Open Science and a UVa psychology researcher.
Nosek, who took part in the new investigation, said Thursday night that the bottom-line message of the original undertaking was not that 60 percent of studies were wrong “but that 40 percent were reproduced, and that’s the starting point.”
As for the follow-up critique, it’s another way of looking at the data, he said. Its authors “came to an explanation that the problems were in the replication. Our explanation is that the data is inconclusive.”
Well, “data” is a plural word, but neglecting that, this doesn’t sound like a very strong defense.
Judge for yourself; the references are below. I do note, and this is not to denigrate the OSC study in the least, but merely as a point of interest, that the response of Anderson et al. (but not the original paper) had some curious funding:
I find it strange that a response to a critique, but not the original paper that was criticized, would be funded by the Templeton Foundation.
h/t: D. Taylor
_______
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349:943. DOI: 10.1126/science.aac4716
Gilbert, D. G. et al. 2016. Comment on “Estimating the reproducibility of psychological science.” Science 351:1037. DOI: 10.1126/science.aad7243
Anderson, C. J. et al. 2016. Response to Comment on “Estimating the reproducibility of psychological science.” Science 351:1037. DOI: 10.1126/science.aad9163
So is looks like a study that found that studies could not be replicated… could not be replicated.
Doesn’t that actually validate the original paper? We’re getting into ‘this statement is false’ territory here.
“So is looks like a study that found that studies could not be replicated… could not be replicated.”
Or rather, a study that purported to replicate other studies, in some cases, didn’t even try:
“An original study that asked Israelis to imagine the consequences of military service (6) was replicated by asking Americans to imagine the consequences of a honeymoon”
WTF?
?depending on where the American honeymooners are based, both may involve automatic weapons and paranoia…..
“An original study that asked Israelis to imagine the consequences of military service (6) was replicated by asking Americans to imagine the consequences of a honeymoon”
I will not suggest that American brides have anything in common with Palestinian guerrillas.
😉
cr
My gut feeling is that the original paper is closer to the truth. Psych studies have so many hidden variables that even scrupulous researchers come up with wildly varying conclusions (and conclusions often seem to confirm a currently promoted ideology0.
Reducing a test of human reaction to a form of game theory, and then using that to make statements about what people actually think is bound to go wrong often.
Every university has both students who commute and students who do not commute, so the 2nd study in that case could very well have replicated the conditions.
I was rather surprised at the notion that researchers actually try to replicate others’ studies. I thought they made their professional reputations (i.e., tenure) based on being the first one to look at something.
Gilbert, et al. make some decent points that hold within a very particular context, but they miss the bigger picture of what the OSC has done. They only interpret “reproducibility/replicability” as “how many repeated experiments have effect sizes that fall within the 95% CI of the original study’s effect size.” This is a very incomplete measure of replicability.
First and foremost, there is nothing special about the first study. Why compare the replicated study to the first study when you can just as easily do the reverse? There is a tendency to believe the original study over any replication, but there is no good statistical reason for this. Andrew Gelman calls this the “time-reversal heuristic” and describes it well on his blog:
“One helpful (I think) way to think about such an episode is to turn things around. Suppose the attempted replication experiment, with its null finding, had come first. A large study finding no effect. And then someone else runs a replication under slightly different conditions with a much smaller sample size and found statistical significance under non-preregistered conditions. Would we be inclined to believe it? I don’t think so. At the very least, we’d have to conclude that any such phenomenon is fragile.”
Secondly, Gilbert, et al.’s analysis makes very misleading implications about effect sizes. By focusing on the 95% CI measure, they essentially restrict themselves to talking about replicability in terms of the outcome of a hypothesis test for the equality of two means. This misses the point entirely that *many (most?) psychology studies are studying effects that are extremely noisy and extremely small, if they exist at all*. Under such a setup, we know mathematically that observed effects sizes are quite likely to be severely inflated in any underpowered study. It has been a contention among many statisticians in the social sciences for awhile that this is exactly what is going on in the social sciences. The OSC study confirms this contention. In this setting, high-powered replications will necessarily lead to seriously diminished effects, and this is exactly what was reported – “just 5% of OSC replications had replication CIs exceeding the original study effect sizes” (Anderson, et al.). Replicability is about far more than whether or not you can recover a p-value; it’s about what effects we can believe.
There are other things I could pick on, but these are two of the biggest in my opinion. The OSC is a valuable study and the continued discussion around it is very useful. Among other things, it shows that there is still a sizeable disconnect between the communities of active social scientist researchers and statisticians. Ultimately though, I think this discussion is improving how social science is done.
There’s a bootstrapping problem here. If we knew more general theories in psychology (particularly social psych, I think) we could know more about what counts as a “relevant difference”. Alas it seems that practically *everything* is. A difficult science indeed!
Even when studies are ‘true’ replicates, the second study, conducted to corroborate a significant result, may not attain significance merely through statistical chance. By the same token, a second study may attain significance, whereas the original one did not. For this reason it is unwise to reject in principle a fact or relationship merely because one or a few studies fail to detect it. The true believer should forge on until belief is quelled by nasty facts, or p<0.001.
For further responses to the Gilbert et al. critique, see Evaluating a new critique of the Reproducibility Project, The statistical conclusions in Gilbert et al (2016) are completely invalid, and Let’s not mischaracterize replication studies.
sub
First of all, I don’t have much experience with psychology except for the few courses I had on it during medical school.
I can imagine that having good reproducible studies on the subject is difficult when considering the variations between cultures and between people in general.
Even basic anatomy varies between people (with some having different coronary arteries for instance) but at least there are some similarities to work on in somatic medicine.
When it comes to behaviour though, there are so many outliers that it is difficult to find the trend.
For example, I disagree with many of the stereotypes surrounding behaviour of woman and men. I would say my behaviour and personality is much closer to the atheist liberal man next door than a fundamentalist Muslim woman in Somalia.
I wouldn’t want a research project to take the findings from Somalia and make some assumptions about the behaviour of all men and women from that without taking into account the culture and socioeconomic situation.
Research projects within somatic medicine also have to take into account differences between population groups. Still, if making a a study about, for instance, surgery vs conservative treatment in one type of cancer, you could use all type of cultural/ethnic groups as long as you correct external factor like hospital hygiene and eduction of surgeons.
Correcting all the cultural/external influences on human behaviour seems to me a lot more difficult.
I might be wrong however, hopefully psychologists will find a way to circumvent this. Until then, I will hold onto my assumption that human behaviour varies too much to allow these kind of studies to be universally reproducible.
“…human behaviour varies too much to allow these kind of studies to be universally reproducible.”
And if they’re not, just how valuable are they in the first place?
Here is a article which claims that the crisis in psychology is still here and that this new study is also wrong:
“This isn’t the first time that an idea in psychology has been challenged—not by a long shot. A “reproducibility crisis” in psychology, and in many other fields, has now been well-established. A study out last summer tried to replicate 100 psychology experiments one-for-one and found that just 40 percent of those replications were successful. A critique of that study just appeared last week, claiming that the original authors made statistical errors—but that critique has itself been attacked for misconstruing facts, ignoring evidence, and indulging in some wishful thinking.”
Source: http://www.slate.com/articles/health_and_science/cover_story/2016/03/ego_depletion_an_influential_theory_in_psychology_may_have_just_been_debunked.html