There are two pieces in the latest Economist that are must-reads not just for scientists, but for science-friendly laypeople. Both paint a dire picture of how credible scientific claims are, and how weak our system is for adjudicating them before publication. One piece is called “How science goes wrong“; the other is “Trouble at the lab.” Both are free online, and both, as is the custom with The Economist, are written anonymously.
The main lesson of these pieces is that we shouldn’t trust a scientific result unless it’s been independently replicated—preferably more than once. That’s something we should already know, but what we don’t know is how many findings—and the articles deal largely with biomedical research—haven’t been replicable, how many others haven’t even been subject to replication, and how shoddy the reviewing process is, so that even a published result may be dubious.
As I read these pieces, I did so captiously, really wanting to find some flaws with their conclusions. I don’t like to think that there are so many problems with my profession. But the authors have done their homework and present a pretty convincing case that science, especially given the fierce competition to get jobs and succeed in them, is not doing a bang-up job. That doesn’t mean it is completely flawed, for if that were true we’d make no advances at all, and we do know that many discoveries in recent years (dinosaurs evolving into birds, the Higgs boson, black matter, DNA sequences, and so on) seem solid.
I see five ways that a reported scientific result may be wrong:
- The work could be shoddy and the results therefore untrustworthy.
- There could be duplicity, either deliberate fraud or a “tweaking” of results in one’s favor, which might even be unconscious.
- The statistical analysis could be wrong in several ways. For example, under standard criteria you will reject a correct “null” hypothesis and accept an alternative but incorrect hypothesis 5% of the time, which means that something like 1 in 20 “positive” results—rejection of the null hypothesis—could be wrong. Alternatively, you could accept a false null hypothesis if you don’t have sufficient statistical power to discriminate between it and an alternative true hypothesis. Further, as the Economist notes, many scientists simply aren’t using the right statistics, particularly when analyzing large datasets.
- There could be a peculiarity in one’s material, so that your conclusions apply just to a particular animal, group of animals, species, or ecosystem. I often think this might be the case in evolutionary biology and ecology, in which studies are conducted in particular places at particular times, and are often not replicated in different locations or years. Is a study of bird behavior in, say, California, going to give the same results as a similar study of the same species in Utah? Nature is complicated, with many factors differing among locations and times (food abundance, parasites, predators, weather, etc.), and these could lead to results that can’t be generalized across an entire species. I myself have failed to replicate at least three published results by other people in my field. (Happily, I’m not aware that anyone has failed to replicate any of my published results.)
- There could be “craft skills”—technical proficiency gained by experience that isn’t or can’t be reported in a paper’s “materials and methods,” that make a given result irreproducible by other investigators.
If you read the Economist pieces, all of these are mentioned save #4 (peculiarity of one’s material). And the findings are disturbing. Here are just a few, quoted from the articles:
- Last year researchers at one biotech firm, Amgen, found they could reproduce just six of 53 “landmark” studies in cancer research. Earlier, a group at Bayer, a drug company, managed to repeat just a quarter of 67 similarly important papers. A leading computer scientist frets that three-quarters of papers in his subfield are bunk. In 2000-10 roughly 80,000 patients took part in clinical trials based on research that was later retracted because of mistakes or improprieties.
- . . . failures to prove a hypothesis are rarely even offered for publication, let alone accepted. “Negative results” now account for only 14% of published papers, down from 30% in 1990. Yet knowing what is false is as important to science as knowing what is true. The failure to report failures means that researchers waste money and effort exploring blind alleys already investigated by other scientists.
- Over the past few years various researchers have made systematic attempts to replicate some of the more widely cited priming experiments. [JAC: These are studies in which exposure to a stimulus before taking a test can dramatically affect the results of that test.] Many of these replications have failed. In April, for instance, a paper in PLoS ONE, a journal, reported that nine separate experiments had not managed to reproduce the results of a famous study from 1998 purporting to show that thinking about a professor before taking an intelligence test leads to a higher score than imagining a football hooligan.
- Academic scientists readily acknowledge that they often get things wrong. But they also hold fast to the idea that these errors get corrected over time as other scientists try to take the work further. Evidence that many more dodgy results are published than are subsequently corrected or withdrawn calls that much-vaunted capacity for self-correction into question. [JAC: Many experiments, particularly in organismal biology, are not repeated, nor form the basis of subsequent research. And the dodgy results can be seen by looking at obvious errors in published papers—papers that are not withdrawn or corrected.]
- . . . consider 1,000 hypotheses being tested of which just 100 are true (see chart). Studies with a power of 0.8 will find 80 of them, missing 20 because of false negatives. Of the 900 hypotheses that are wrong, 5%—that is, 45 of them—will look right because of type I errors. Add the false positives to the 80 true positives and you have 125 positive results, fully a third of which are specious. If you dropped the statistical power from 0.8 to 0.4, which would seem realistic for many fields, you would still have 45 false positives but only 40 true positives. More than half your positive results would be wrong.
- John Bohannon, a biologist at Harvard, recently submitted a pseudonymous paper on the effects of a chemical derived from lichen on cancer cells to 304 journals describing themselves as using peer review. An unusual move; but it was an unusual paper, concocted wholesale and stuffed with clangers in study design, analysis and interpretation of results. Receiving this dog’s dinner from a fictitious researcher at a made up university, 157 of the journals accepted it for publication.Dr Bohannon’s sting was directed at the lower tier of academic journals. But in a classic 1998 study Fiona Godlee, editor of the prestigious British Medical Journal, sent an article containing eight deliberate mistakes in study design, analysis and interpretation to more than 200 of the BMJ’s regular reviewers. Not one picked out all the mistakes. On average, they reported fewer than two; some did not spot any.
I find this next one very disturbing (my emphasis):
- Fraud is very likely second to incompetence in generating erroneous results, though it is hard to tell for certain. Dr Fanelli has looked at 21 different surveys of academics (mostly in the biomedical sciences but also in civil engineering, chemistry and economics) carried out between 1987 and 2008. Only 2% of respondents admitted falsifying or fabricating data, but 28% of respondents claimed to know of colleagues who engaged in questionable research practices.
And one more, which is pretty disturbing as well:
- Christine Laine, the editor of the Annals of Internal Medicine, told the peer-review congress in Chicago that five years ago about 60% of researchers said they would share their raw data if asked; now just 45% do. Journals’ growing insistence that at least some raw data be made available seems to count for little: a recent review by Dr Ioannidis which showed that only 143 of 351 randomly selected papers published in the world’s 50 leading journals and covered by some data-sharing policy actually complied.
The journal recommends several ways to fix these problems, including mandatory sharing of data and getting reviewers to work harder, reanalyzing the data in a reviewed paper from the ground up. The former is a good suggestion: many people in my own field, for example, refuse to send flies to other workers, even though they’ve published data from those flies. But reanalyzing other people’s data is almost impossible. We’re all busy, and it’s enormously time-consuming to redo a full data analysis.
My own suggestions include not only mandatory publication of raw data immediately—not after a delay (the current practice), mandatory sharing of researchmaterials on which you’ve published (i.e., fruit flies), a tenure and promotion review system that emphasizes quality rather than quantity of publication (the Economist mentions this as well), and less emphasis on getting grants. The purpose of a grant, after all, is to facilitate research. But the rationale has become curiously inverted: now the purpose of one’s research seems to be to get a grant, for the “overhead money” of a grant (a proportion of the funds given for a project that go not for science, but to the university itself for stuff like supporting the physical plant) has become an important source of revenue for universities. But one can do a lot of good science on little money, especially if you do theoretical work, and the amount of NIH or NSF money you bring in should be relatively unimportant in judging your science. I’m proud that it’s official policy at the University of Chicago that grant monies are not counted when someone is reviewed for tenure or promotion.
Of course, there’s a correlation between grant money and scientific accomplishment, for most experimental scientists simply can’t do their work without external funding. But a lot of that money goes to support what I see as weak science, or science that, at least in my field, is faddish. And in many places the counting of accrued grant dollars or the number of publications becomes an easy but inaccurate way to judge someone’s science. As a colleague once told me when evaluating someone’s publications for promotion: “We may count ’em, and we may weigh ’em, but we won’t read ’em!”
Finally, there should be some provision (and the Economist mentions this as well) to fund people to replicate the work of other scientists. This is not a glamorous pursuit, to be sure: who wants to spend their life re-doing someone else’s studies? But how else can we find out if they’re right? I’m particularly worried about this in ecology and evolution, in which studies are almost never repeated, and I suspect that many of them can’t be generalized. Fortunately, the kind of genetic work I do is easily replicated, but it’s not so easy in field studies of whole organisms.
Let me end by saying that religious people and those who are “anti-scientism” will jump all over these articles, claiming that science can’t be trusted at all—that it’s rife with incompetence and even corruption. Well, there’s more of that stuff than I’d like, but when you look at all the advances in biology (DNA sequencing, for example), chemistry, physics, and medicine over the past few decades, and see how many important results have been replicated or at least re-tested by other investigators, one sees that science is still homing in, asymptotically, on the facts. In contrast, religion has made no progress, and academic humanities often seems to wend themselves into dead ends like postmodernism.
Nevertheless, if you’re a scientist you simply must read these two articles, and I recommend them to others as well. They’re may seem alarmist, but they’re important. And the science journalism—the level of rigor and understanding these pieces show—is admirable. Kudos to the anonymous authors.