We hear a lot about the “replication crisis” in science, and it’s often cited to imply that science is largely untrustworthy, perhaps just as fallible a “way of knowing” as, say, religion. And indeed, a number of prominent results in psychology and other fields have not been replicated by others. What is not mentioned in such criticisms is the huge number of studies in “hard” science that have been replicated. As far as I know, DNA is still a double helix, Jupiter is larger than Earth, benzene has six carbon atoms, and the continents are moving about on tectonic plates. Nobody, of course, has totted up the proportion of all results in any field that have been replicated. Still, failures of replication are concerning, but also inevitable, since science is an ongoing process. And they also give us a way of adding or subtracting credibility from a hypothesis.
A list of “replication failures” does serve to remind us that science is fallible, an ongoing enterprise that is subject to revision. Nothing is “proven” in science; the concept of “proof” is for mathematics, where there’s no “replication crisis.” Science is a Bayesian enterprise, in which accumulating evidence combines to give us more or less confidence in a hypothesis. But remember, too, that many scientific “facts” are very unlikely to be overturned, and, using any reasonable layperson’s notion of “proof”, have been proved. A molecule of normal water has two hydrogen atoms and one oxygen atom, the normal form of DNA is a double helix, the speed of light in a vacuum is 299792458 metres per second ( roughly 186,000 miles per second) and so on.
This list of “reversals” below is limited to psychology and is 18 months old. It comes from the site argmin gravitas, and was compiled by “Gavin, a PhD candidate in AI at Bristol.”
The caveats are given by Gavin below, and the most important one is that a “failure of replication” does not mean either that the original result was wrong or that somebody cheated. Psychological studies often use different samples from different places; the statistical power of tests to detect effects depends on sample size, which varies among studies; different statistical tests can give different results; and, of course, there could be confirmation bias in whether you accept a result. And if you use the 5% level of significance, roughly 1 in 20 tests will yield a “false positive.” As Gavin says, “failed replications (or proofs of fraud) usually just challenge the evidence for a hypothesis, rather than affirm the opposite hypothesis.” Here are his caveats:
A medical reversal is when an existing treatment is found to actually be useless or harmful. Psychology has in recent years been racking up reversals: in fact only 40-65% of its classic social results were replicated, in the weakest sense of finding ‘significant’ results in the same direction. (Even in those that replicated, the average effect found was half the originally reported effect.) Such errors are far less costly to society than medical errors, but it’s still pollution, so here’s the cleanup.
Psychology is not alone: medicine, cancer biology, and economics all have many irreplicable results. It’d be wrong to write off psychology: we know about most of the problems here because of psychologists, and its subfields differ a lot by replication rate and effect-size shrinkage.
One reason psychology reversals are so prominent is that it’s an unusually ‘open’ field in terms of code and data sharing. A less scientific field would never have caught its own bullshit.
The following are empirical findings about empirical findings; they’re all open to re-reversal. Also it’s not that “we know these claims are false”: failed replications (or proofs of fraud) usually just challenge the evidence for a hypothesis, rather than affirm the opposite hypothesis. I’ve tried to ban myself from saying “successful” or “failed” replication, and to report the best-guess effect size rather than play the bad old Yes/No science game.
Figures correct as of March 2020; I will put some effort into keeping this current, but not that much.
Code for converting means to Cohen’s d and Hedge’s g here.
Click on the screenshot to see the “reversals”.
I’ll mention only one example given from each of 13 branches of psychology discussed by Gavin; these are experiments that seem to be fairly well known or whose failure to replicate interested me. Go to the site to see the statistics from the original papers and then from attempts to replicate. And a lot of other papers are cited as well.
Gavin’s words are indented.
- No good evidence of anything from the Stanford prison ‘experiment’. It was not an experiment; ‘demand characteristics’ and scripting of the abuse; constant experimenter intervention; faked reactions from participants; as Zimbardo concedes, they began with a complete “absence of specific hypotheses”.
- No good evidence for facial-feedback (that smiling causes good mood and pouting bad mood).
- Questionable evidence for (some readings of) the Dunning-Kruger effect.
- “Expertise attained after 10,000 hours practice” (Gladwell). Disowned by the supposed proponents.
- Anything by Hans Eysenck should be considered suspect, but in particular these 26 ‘unsafe’ papers (including the one which says that reading prevents cancer).
- The effect of “nudges” (clever design of defaults) may be exaggerated in general. One big review found average effects were six times smaller than billed. (Not saying there are no big effects.)
- Brian Wansink accidentally admitted gross malpractice; fatal errors were found in 50 of his lab’s papers. These include flashy results about increased portion size massively reducing satiety.
- Readiness potentials seem to be actually causal, not diagnostic. So Libet’s studies also do not show what they purport to. We still don’t have free will (since random circuit noise can tip us when the evidence is weak), but in a different way.
I’ve read the references about “failure to replicate Libet”, and they don’t show that conscious will is involved in decisions; they show that neural inputs, either random or non-random (i.e., derived from sensory input) influence decisions, and brain activity can predict behaviors before the subject is conscious of having “decided”. But I have no quarrel about that. Free will, if it means anything, especially to dualists, has to involve the causation of a of an action by a conscious decision that could have been otherwise. And the Libet experiment, and many others since, show a genuine decoupling between brain activity that can predict an action and consciousness of having “decided” to perform that action. That in itself is a sword in the heart of dualistic free will, though of course not of compatibilist free will, as nearly all of its adherents accept physical determinism and reject dualism.
- At most extremely weak evidence that psychiatric hospitals (of the 1970s) could not detect sane patients in the absence of deception.
- No good evidence for precognition, undergraduates improving memory test performance by studying after the test. This one is fun because Bem’s statistical methods were “impeccable” in the sense that they were what everyone else was using. He is Patient Zero in the replication crisis, and has done us all a great service. (Heavily reliant on a flat / frequentist prior; evidence of optional stopping; forking paths analysis.)
- Questionable evidence for the menstrual cycle version of the dual-mating-strategy hypothesis (that “heterosexual women show stronger preferences for uncommitted sexual relationships [with more masculine men]… during the high-fertility ovulatory phase of the menstrual cycle, while preferring long-term relationships at other points”). Studies are usually tiny (median n=34, mostly over one cycle). Funnel plot looks ok though.
- At most very weak evidence that sympathetic nervous system activity predicts political ideology in a simple fashion. In particular, subjects’ skin conductance reaction to threatening or disgusting visual prompts – a noisy and questionable measure.
- Be very suspicious of any such “candidate gene” finding (post-hoc data mining showing large >1% contributions from a single allele). 0/18 replications in candidate genes for depression. 73% of candidates failed to replicate in psychiatry in general. One big journal won’t publish them anymore without several accompanying replications. A huge GWAS, n=1 million: “We find no evidence of enrichment for genes previously hypothesized to relate to risk tolerance.”
15 thoughts on “Failures of replication in psychology”
There is an explanation for this and it is called, “Human Sub-Set Theory”. Dangerous to try condense a complex hypothesis, but here goes.
Human beings are not homogenious; they are in Groups…Not tribal, that refers to countries; but in discernable Groups, spread internationally, such as farmers, accountants, actors, biologists…having more to do with each other even when thousands of miles apart, than with their neighbours. Those Groups are formed by a little-credited social mechanism called ‘Social Self-Selection’ or SSS. People become what they are genetically programmed to do. These many Groups are differentiated by their differing ” Brain Operating System” or BOS. A BOS is an account of ‘Reality’. It is the vast complex of interwoven beliefs that explains the experience of reality to themselves. A notable account of reality is shared by all religious people, for example, who come to believe at an early age that we live in an ‘Intentional Universe’. And it is chastening to observe that all religions flow logically from their mistaken BOS.
Psychologists worldwide share a BOS that all believes human beings are moral creatures that are manipulated by those around them. For example they talk of religious youngsters as victims of indoctrination.
But there is no Free Will; a genetically-installed BOS will lead all psychologists in their wonderful variety to certain worthless conclusions, by means of logic. They never know that their fundamental take on reality is faulty to begin with. They are trapped by their BOS. And so, too, are all sociologists, anthropologists and so forth. The Social Sciences are not sciences at all, because they are all based upon faulty Brain Operating Systems…. or, a faulty account of reality that engenders all subsequent beliefs.
I once wrote this hypothesis out with the evidence and it ran to nearly a million words, which is ten times the size of a large novel. Frankly, i think we are a hundred years from people coming to realise the Group Theory of human belief and behaviour. The psychology BOS is so widespread and so entrenched that all this talk of the lack of reproducibility may just be the first drops of a future hurricane that may well sweep away a century of error.
Thanks for listening… George
“I once wrote this hypothesis out with the evidence and it ran to nearly a million words, which is ten times the size of a large novel.”
Can you provide at least a couple of links to evidence? Not trying to be snarky; just interested in where you’re getting all these theories from…
This does sound interesting, and I, like Carbon Copy would be interested in links, because it sounds like a fascinating topic to explore.
A bit of an aside, but I’m watching Wimbledon right now and the announcers are talking about how great “cupping” is. “We Baltimoreians are very forward-thinking!” Yeah, thanks for pushing pseudoscience to the audience. Fantastic. Maybe they should tell us about the benefits of homeopathy next…
Sorry, it seems I left the typing of my handle unfinished. This is your favorite commenter, Carbon Copy 😛
Hmmm – the Urban Dictionary defines ‘cupping’ differently.
Now there’s a Wimbledon where the mixed doubles is the premier event!
(Look, it’s early here, and before my first espresso I have no capacity for higher order thinking…apologies)
Among these, I have to admit that I am sorely disappointed that the Dunning-Kruger effect appears to be not a real thing. It is hard to let that one go!
Well, he did write “some readings of” it, so…
Me too! Brian Leiter has a whole collection of posts with the tag “The Less They Know, the Less They Know It” which are generally hilarious, though often also scary.
Here is a group that tackles the same problems in the fields of Evolution & Ecology:
In my country, psychology is majored not at faculties of science or medicine but at those of social science. I wonder how widespread this practice is, and whether it contributes to the less rigorous research standards.
Unfortunately, if you use a 5% level of significance, you will get a false positive a lot more than 1 time in 20.
I thought Libet was open to the possibility that readiness potentials are causal, though I’m glad to hear that they are, since that is what I figured. And of course, you’re not going to find conscious will at the source of a subject’s movement in the Libet experiment, since Libet’s instructions to subjects effectively exclude it, by excluding planning and asking them to focus on an “urge to move” instead.
If Robert Plomin is right (Blueprint) psychology is about and getting a good serve from polygenic scores (PS) to a more precise science. PS based on single nucleotide polymorphisms should shake our mindset and take us further ahead than we have ever been.
GWAS studies have changed the game! Incredibly powerful method of “big data” genetic association and determination.