At the beginning of September I wrote about a paper in Science produced by a large group called “The Open Science Collaboration.” That paper reported the repeatability of 100 papers whose results were published in three prestigious psychology journals. My brief summary of the conclusions is below, though my original post gave a lot more data:
- Only 35 of the original 100 experiments produced statistically significant results upon replication (62 did not, and three were excluded). In other words, under replication with near-identical conditions and often larger samples, only 35% of the original findings were judged significant.
- That said, many (but not nearly all) of the results were in the same direction as those seen in the original studies, but weren’t large enough to achieve statistical significance. If the replications had been the original papers, most of them probably wouldn’t have been published.
Now this doesn’t mean that the original studies were wrong, for of course the replications could have produced the wrong answer. But given that the replication studies generally used larger sample sizes than did the original work, it suggests that there are endemic problems with the way science is adjudicated and published. My suspicion is that the main cause is a bias toward publishing positive rather than negative results, combined with “p-hacking”: looking for those statistical analyses that give you probability values lower than the cutoff needed to reject the null hypothesis, or a tendency to collect data only up to the point where your “p” values become significant, and then stopping and writing a paper.
The Science paper set off a flurry of self-scrutiny and self-recrimination as scientists begin to worry that, even if they’re not psychologists, the problem may cut across disciplines—and probably does. I’ve long thought that studies in ecology and evolutionary biology, particularly field work or experimental work that doesn’t involve DNA sequencing (sequencing data are easily replicated), may also be largely unrepeatable, both for the reasons given above and because field and lab results may be particularly sensitive to experimental or environmental conditions. Next month I am in fact going to a meeting of biologists and editors to address the issue of replication in my field.
In the meantime, lots of articles have come out highlighting the problem. Before people conclude that the replication problem is a big problem for all of science, I’d suggest that some fields, particularly molecular biology, may be largely immune, because a). they’re easily replicated and b). are very often the building blocks for future work, so researchers not only have the incentive to get their results right, but their studies will automatically be replicated as a first step in other people’s followup work. Deciphering the DNA code, the subject of Matthew Cobb’s new book, for instance, automatically involved other people calibrating their system using the codons worked out by earlier researchers. Science journalists should realize this before sounding a general alarm.
That alarm, however, was sounded by writer Regina Nuzzo in an article in Nature about replication called “How scientists fool themselves, and how they can stop.”
The summary is shown in her diagram of the problem and possible solutions:

Here’s my take on Nuzzo’s analysis, which is by and large pretty good, and on her solutions, which are somewhat problematic but still worth considering.
THE PROBLEMS
The first line, “cognitive fallacies,” is pretty self-explanatory. It’s simply doing experiments that are contaminated by confirmation bias: neglecting alternative hypotheses and data inimical to your favored hypothesis. That is the epistemic method of most religions.
The “Texas sharpshooter” problem involves, among other things, p-hacking, and also doing a gazillion different tests on a diversity of data, some of which will be significant by chance alone, and then seizing on those as your publishable results.
“Asymmetric attention” is self-explanatory. Nuzzo gives two examples:
A 2004 study observed the discussions of researchers from 3 leading molecular-biology laboratories as they worked through 165 different lab experiments. In 88% of cases in which results did not align with expectations, the scientists blamed the inconsistencies on how the experiments were conducted, rather than on their own theories. Consistent results, by contrast, were given little to no scrutiny.
In 2011, an analysis of over 250 psychology papers found9 that more than 1 in 10 of the p-values was incorrect — and that when the errors were big enough to change the statistical significance of the result, more than 90% of the mistakes were in favour of the researchers’ expectations, making a non-significant finding significant.
“Just-so storytelling”, in which you make up a story post facto to explain your results, seems to me less of a problem. If your “story” is simply something you say in the discussion to rationalize a result you didn’t expect, well, others readers (and presumably the reviewers of the paper) should catch that, and realize that it’s just a rationalization. Things become more serious if you pretend that that result was your initial hypothesis, and then confirm it with the data, which is an inversion of your scientific history and basically dishonest. But while I see a lot of the former tactic in evolution, it’s not a big problem, for we all know when somebody’s grasping at straws or rationalizing. The former problem I haven’t seen—but we wouldn’t see it anyway unless the “hypothesis” that is tested is not a priori obvious.
THE SOLUTIONS
Nuzzo offers four solutions, one of which is already in play and another that seems unrealistic. Two are feasible.
“Devil’s advocacy,” considering and testing alternative hypotheses, is part of all science, and should be ingrained in every researcher. “How might I have gone wrong?” is a question all good scientists ask themselves, and then we test to see if we’ve erred. Now some people don’t do that, but they’re often caught by reviewers of their papers or grants, who ferret out the hypotheses that are neglected. That’s why every paper should have at least two good reviewers familiar with the field, and grants should have at least four people who scrutinize proposed research. This issue doesn’t seem to be a big problem unless journals and funding agencies do a sloppy job of reviewing papers and proposals. At least in the US, the two major granting agencies (NSF and NIH) are very careful at vetting proposals using in part a “devil’s advocacy” approach.
“Team of rivals”, getting your scientific opponents to collaborate with you in hopes that opposing views will help bring out the truth, is in principle a good idea but will rarely work in practice, at least in my field. For one thing, there would be authorship fights: who gets the credit? Also, who wants to drop their research to work on somebody else’s problems? Anyway, Nuzzo gives one example of such a collaboration in psychology, which didn’t appear to work so well.
In “blind data analysis”, you shift your real data around or even add made-up data, and then do all the analysis on several “blind” data sets—which aren’t really blind, as most researchers know their data and the procedure also involves removing outliers, which you often know as well. Then, when you’re satisfied that you did the analysis as you wanted, lift the blind. (This is sort of like John Rawls’s “veil of ignorance,” where you make up moral rules for society without knowing which position you’ll eventually occupy in that society.) This method will work in some situations but not others, for, as I said, you’re often familiar with your real data. At any rate, it did work in one study:
[Astrophysicist Saul] Perlmutter used this method for his team’s work on the Supernova Cosmology Project in the mid-2000s. He knew that the potential for the researchers to fool themselves was huge. They were using new techniques to replicate estimates of two crucial quantities in cosmology — the relative abundances of matter and of dark energy — which together reveal whether the Universe will expand forever or eventually collapse into a Big Crunch. So their data were shifted by an amount known only to the computer, leaving them with no idea what their findings implied until everyone agreed on the analyses and the blind could be safely lifted. After the big reveal, not only were the researchers pleased to confirm earlier findings of an expanding Universe, Perlmutter says, but they could be more confident in their conclusions. “It’s a lot more work in some sense, but I think it leaves you feeling much safer as you do your analysis,” he says. He calls blind data analysis “intellectual hygiene, like washing your hands”.
The final method, transparency, has promise but also problems. It comes in two forms:
[Form one] Another solution that has been gaining traction is open science. Under this philosophy, researchers share their methods, data, computer code and results in central repositories, such as the Center for Open Science’s Open Science Framework, where they can choose to make various parts of the project subject to outside scrutiny. Normally, explains Nosek, “I have enormous flexibility in how I analyse my data and what I choose to report. This creates a conflict of interest. The only way to avoid this is for me to tie my hands in advance. Precommitment to my analysis and reporting plan mitigates the influence of these cognitive biases.”
I’m fully in favor of this: all data used in a paper, and methods of analysis, as well as experimental analysis, should be available to researchers INSTANTLY after a paper is published. This has long been the custom in the Drosophila community: it’s unthinkable not to share data, or even laboriously constructed genetic stocks, with colleagues and rivals. This is being used already by many scientists, and is mandated by some journals. Other journals, however, either don’t require such data storage or impose a year’s moratorium on it so you can milk your data for more papers before others get their hands on it. I don’t favor the moratorium; it’s just too bad if other people use your published data to their own ends.
Form one could, however, be construed as sharing your aims, methods, and analyses BEFORE you get your data, and that’s just not on. It makes you vulnerable to intellectual theft, and no researcher wants to do that. This is why grant proposals are strictly confidential, and why no grant reviewer is allowed to lift ideas from grants they’ve reviewed. It’s also why reviewing papers is confidential.
[Form two] An even more radical extension of this idea is the introduction of registered reports: publications in which scientists present their research plans for peer review before they even do the experiment. If the plan is approved, the researchers get an ‘in-principle’ guarantee of publication, no matter how strong or weak the results turn out to be. This should reduce the unconscious temptation to warp the data analysis, says Pashler. At the same time, he adds, it should keep peer reviewers from discounting a study’s results or complaining after results are known. “People are evaluating methods without knowing whether they’re going to find the results congenial or not,” he says. “It should create a much higher level of honesty among referees.” More than 20 journals are offering or plan to offer some format of registered reports.
This method, in which the methods and aims are confidential, seems to be gaining popularity. It’s like giving a grant proposal to a journal before you do the experiment, and if they approve of the analysis and methods, they’ll accept the paper no matter how the results turn out. That’s fine, but there are some caveats. First, during a study new experiments often crop up that you haven’t planned, and often those are the most exciting ones. (This has often happened to me.) How do you deal with those? I don’t see how.
Second, this deals only with the analysis of data; it doesn’t deal with the importance of the results. But no journal will publish any paper that’s soundly executed regardless of the results: every scientist knows that there are some highly visible “top-tier” journals where publication can make your career (e.g., Science and Nature), and journals like that aren’t going to publish on just any subject, as they specialize in “important” results. (Journals like PLOS One, which consider only methods and not importance, are venues for all sorts of work, important and not-so-important). For regular journals this means that there has to be pre-vetting of not just the methods of research and how data are analyzed, but whether or not the problem is interesting. This doesn’t seem to be considered in any such proposals for “transparency,” but I haven’t read them all.
Scientists will be chewing over this problem in detail over the next year or so, and it is indeed a problem, though more so for some areas than for others. The solutions, however, vary in quality and depend on the field. The “transparency” method seems to be gaining popularity, but to me it seems the least practical of all solutions.
h/t: Ben Goren