We hear a lot about the “replication crisis” in science, and it’s often cited to imply that science is largely untrustworthy, perhaps just as fallible a “way of knowing” as, say, religion. And indeed, a number of prominent results in psychology and other fields have not been replicated by others. What is not mentioned in such criticisms is the huge number of studies in “hard” science that have been replicated. As far as I know, DNA is still a double helix, Jupiter is larger than Earth, benzene has six carbon atoms, and the continents are moving about on tectonic plates. Nobody, of course, has totted up the proportion of all results in any field that have been replicated. Still, failures of replication are concerning, but also inevitable, since science is an ongoing process. And they also give us a way of adding or subtracting credibility from a hypothesis.
A list of “replication failures” does serve to remind us that science is fallible, an ongoing enterprise that is subject to revision. Nothing is “proven” in science; the concept of “proof” is for mathematics, where there’s no “replication crisis.” Science is a Bayesian enterprise, in which accumulating evidence combines to give us more or less confidence in a hypothesis. But remember, too, that many scientific “facts” are very unlikely to be overturned, and, using any reasonable layperson’s notion of “proof”, have been proved. A molecule of normal water has two hydrogen atoms and one oxygen atom, the normal form of DNA is a double helix, the speed of light in a vacuum is 299792458 metres per second ( roughly 186,000 miles per second) and so on.
The caveats are given by Gavin below, and the most important one is that a “failure of replication” does not mean either that the original result was wrong or that somebody cheated. Psychological studies often use different samples from different places; the statistical power of tests to detect effects depends on sample size, which varies among studies; different statistical tests can give different results; and, of course, there could be confirmation bias in whether you accept a result. And if you use the 5% level of significance, roughly 1 in 20 tests will yield a “false positive.” As Gavin says, “failed replications (or proofs of fraud) usually just challenge the evidence for a hypothesis, rather than affirm the opposite hypothesis.” Here are his caveats:
A medical reversal is when an existing treatment is found to actually be useless or harmful. Psychology has in recent years been racking up reversals: in fact only 40-65% of its classic social results were replicated, in the weakest sense of finding ‘significant’ results in the same direction. (Even in those that replicated, the average effect found was half the originally reported effect.) Such errors are far less costly to society than medical errors, but it’s still pollution, so here’s the cleanup.
Psychology is not alone: medicine, cancer biology, and economics all have many irreplicable results. It’d be wrong to write off psychology: we know about most of the problems here because of psychologists, and its subfields differ a lot by replication rate and effect-size shrinkage.
One reason psychology reversals are so prominent is that it’s an unusually ‘open’ field in terms of code and data sharing. A less scientific field would never have caught its own bullshit.
The following are empirical findings about empirical findings; they’re all open to re-reversal. Also it’s not that “we know these claims are false”: failed replications (or proofs of fraud) usually just challenge the evidence for a hypothesis, rather than affirm the opposite hypothesis. I’ve tried to ban myself from saying “successful” or “failed” replication, and to report the best-guess effect size rather than play the bad old Yes/No science game.
Figures correct as of March 2020; I will put some effort into keeping this current, but not that much.
Code for converting means to Cohen’s d and Hedge’s g here.
Click on the screenshot to see the “reversals”.
I’ll mention only one example given from each of 13 branches of psychology discussed by Gavin; these are experiments that seem to be fairly well known or whose failure to replicate interested me. Go to the site to see the statistics from the original papers and then from attempts to replicate. And a lot of other papers are cited as well.
Gavin’s words are indented.
- No good evidence of anything from the Stanford prison ‘experiment’. It was not an experiment; ‘demand characteristics’ and scripting of the abuse; constant experimenter intervention; faked reactions from participants; as Zimbardo concedes, they began with a complete “absence of specific hypotheses”.
- No good evidence for facial-feedback (that smiling causes good mood and pouting bad mood).
- Questionable evidence for (some readings of) the Dunning-Kruger effect.
- “Expertise attained after 10,000 hours practice” (Gladwell). Disowned by the supposed proponents.
- Anything by Hans Eysenck should be considered suspect, but in particular these 26 ‘unsafe’ papers (including the one which says that reading prevents cancer).
- The effect of “nudges” (clever design of defaults) may be exaggerated in general. One big review found average effects were six times smaller than billed. (Not saying there are no big effects.)
- Brian Wansink accidentally admitted gross malpractice; fatal errors were found in 50 of his lab’s papers. These include flashy results about increased portion size massively reducing satiety.
- Readiness potentials seem to be actually causal, not diagnostic. So Libet’s studies also do not show what they purport to. We still don’t have free will (since random circuit noise can tip us when the evidence is weak), but in a different way.
I’ve read the references about “failure to replicate Libet”, and they don’t show that conscious will is involved in decisions; they show that neural inputs, either random or non-random (i.e., derived from sensory input) influence decisions, and brain activity can predict behaviors before the subject is conscious of having “decided”. But I have no quarrel about that. Free will, if it means anything, especially to dualists, has to involve the causation of a of an action by a conscious decision that could have been otherwise. And the Libet experiment, and many others since, show a genuine decoupling between brain activity that can predict an action and consciousness of having “decided” to perform that action. That in itself is a sword in the heart of dualistic free will, though of course not of compatibilist free will, as nearly all of its adherents accept physical determinism and reject dualism.
- At most extremely weak evidence that psychiatric hospitals (of the 1970s) could not detect sane patients in the absence of deception.
- No good evidence for precognition, undergraduates improving memory test performance by studying after the test. This one is fun because Bem’s statistical methods were “impeccable” in the sense that they were what everyone else was using. He is Patient Zero in the replication crisis, and has done us all a great service. (Heavily reliant on a flat / frequentist prior; evidence of optional stopping; forking paths analysis.)
- Questionable evidence for the menstrual cycle version of the dual-mating-strategy hypothesis (that “heterosexual women show stronger preferences for uncommitted sexual relationships [with more masculine men]… during the high-fertility ovulatory phase of the menstrual cycle, while preferring long-term relationships at other points”). Studies are usually tiny (median n=34, mostly over one cycle). Funnel plot looks ok though.
- At most very weak evidence that sympathetic nervous system activity predicts political ideology in a simple fashion. In particular, subjects’ skin conductance reaction to threatening or disgusting visual prompts – a noisy and questionable measure.
- Be very suspicious of any such “candidate gene” finding (post-hoc data mining showing large >1% contributions from a single allele). 0/18 replications in candidate genes for depression. 73% of candidates failed to replicate in psychiatry in general. One big journal won’t publish them anymore without several accompanying replications. A huge GWAS, n=1 million: “We find no evidence of enrichment for genes previously hypothesized to relate to risk tolerance.”