Science is in bad shape

October 22, 2013 • 6:05 am

There are two pieces in the latest Economist that are must-reads not just for scientists, but for science-friendly laypeople. Both paint a dire picture of how credible scientific claims are, and how weak our system is for adjudicating them before publication. One piece is called “How science goes wrong“; the other is “Trouble at the lab.” Both are free online, and both, as is the custom with The Economist, are written anonymously.

The main lesson of these pieces is that we shouldn’t trust a scientific result unless it’s been independently replicated—preferably more than once. That’s something we should already know, but what we don’t know is how many findings—and the articles deal largely with biomedical research—haven’t been replicable, how many others haven’t even been subject to replication, and how shoddy the reviewing process is, so that even a published result may be dubious.

As I read these pieces, I did so captiously, really wanting to find some flaws with their conclusions. I don’t like to think that there are so many problems with my profession. But the authors have done their homework and present a pretty convincing case that science, especially given the fierce competition to get jobs and succeed in them, is not doing a bang-up job. That doesn’t mean it is completely flawed, for if that were true we’d make no advances at all, and we do know that many discoveries in recent years (dinosaurs evolving into birds, the Higgs boson, black matter, DNA sequences, and so on) seem solid.

I see five ways that a reported scientific result may be wrong:

The work could be shoddy and the results therefore untrustworthy.

There could be duplicity, either deliberate fraud or a “tweaking” of results in one’s favor, which might even be unconscious.

The statistical analysis could be wrong in several ways. For example, under standard criteria you will reject a correct “null” hypothesis and accept an alternative but incorrect hypothesis 5% of the time, which means that something like 1 in 20 “positive” results—rejection of the null hypothesis—could be wrong. Alternatively, you could accept a false null hypothesis if you don’t have sufficient statistical power to discriminate between it and an alternative true hypothesis. Further, as the Economist notes, many scientists simply aren’t using the right statistics, particularly when analyzing large datasets.

There could be a peculiarity in one’s material, so that your conclusions apply just to a particular animal, group of animals, species, or ecosystem. I often think this might be the case in evolutionary biology and ecology, in which studies are conducted in particular places at particular times, and are often not replicated in different locations or years. Is a study of bird behavior in, say, California, going to give the same results as a similar study of the same species in Utah? Nature is complicated, with many factors differing among locations and times (food abundance, parasites, predators, weather, etc.), and these could lead to results that can’t be generalized across an entire species. I myself have failed to replicate at least three published results by other people in my field. (Happily, I’m not aware that anyone has failed to replicate any of my published results.)

There could be “craft skills”—technical proficiency gained by experience that isn’t or can’t be reported in a paper’s “materials and methods,” that make a given result irreproducible by other investigators.

If you read the Economist pieces, all of these are mentioned save #4 (peculiarity of one’s material). And the findings are disturbing. Here are just a few, quoted from the articles:

Last year researchers at one biotech firm, Amgen, found they could reproduce just six of 53 “landmark” studies in cancer research. Earlier, a group at Bayer, a drug company, managed to repeat just a quarter of 67 similarly important papers. A leading computer scientist frets that three-quarters of papers in his subfield are bunk. In 2000-10 roughly 80,000 patients took part in clinical trials based on research that was later retracted because of mistakes or improprieties.

. . . failures to prove a hypothesis are rarely even offered for publication, let alone accepted. “Negative results” now account for only 14% of published papers, down from 30% in 1990. Yet knowing what is false is as important to science as knowing what is true. The failure to report failures means that researchers waste money and effort exploring blind alleys already investigated by other scientists.

Over the past few years various researchers have made systematic attempts to replicate some of the more widely cited priming experiments. [JAC: These are studies in which exposure to a stimulus before taking a test can dramatically affect the results of that test.] Many of these replications have failed. In April, for instance, a paper in PLoS ONE, a journal, reported that nine separate experiments had not managed to reproduce the results of a famous study from 1998 purporting to show that thinking about a professor before taking an intelligence test leads to a higher score than imagining a football hooligan.

Academic scientists readily acknowledge that they often get things wrong. But they also hold fast to the idea that these errors get corrected over time as other scientists try to take the work further. Evidence that many more dodgy results are published than are subsequently corrected or withdrawn calls that much-vaunted capacity for self-correction into question. [JAC: Many experiments, particularly in organismal biology, are not repeated, nor form the basis of subsequent research. And the dodgy results can be seen by looking at obvious errors in published papers—papers that are not withdrawn or corrected.]

. . . consider 1,000 hypotheses being tested of which just 100 are true (see chart). Studies with a power of 0.8 will find 80 of them, missing 20 because of false negatives. Of the 900 hypotheses that are wrong, 5%—that is, 45 of them—will look right because of type I errors. Add the false positives to the 80 true positives and you have 125 positive results, fully a third of which are specious. If you dropped the statistical power from 0.8 to 0.4, which would seem realistic for many fields, you would still have 45 false positives but only 40 true positives. More than half your positive results would be wrong.

John Bohannon, a biologist at Harvard, recently submitted a pseudonymous paper on the effects of a chemical derived from lichen on cancer cells to 304 journals describing themselves as using peer review. An unusual move; but it was an unusual paper, concocted wholesale and stuffed with clangers in study design, analysis and interpretation of results. Receiving this dog’s dinner from a fictitious researcher at a made up university, 157 of the journals accepted it for publication.Dr Bohannon’s sting was directed at the lower tier of academic journals. But in a classic 1998 study Fiona Godlee, editor of the prestigious British Medical Journal, sent an article containing eight deliberate mistakes in study design, analysis and interpretation to more than 200 of the BMJ’s regular reviewers. Not one picked out all the mistakes. On average, they reported fewer than two; some did not spot any.

I find this next one very disturbing (my emphasis):

Fraud is very likely second to incompetence in generating erroneous results, though it is hard to tell for certain. Dr Fanelli has looked at 21 different surveys of academics (mostly in the biomedical sciences but also in civil engineering, chemistry and economics) carried out between 1987 and 2008. Only 2% of respondents admitted falsifying or fabricating data, but 28% of respondents claimed to know of colleagues who engaged in questionable research practices.

And one more, which is pretty disturbing as well:

Christine Laine, the editor of the Annals of Internal Medicine, told the peer-review congress in Chicago that five years ago about 60% of researchers said they would share their raw data if asked; now just 45% do. Journals’ growing insistence that at least some raw data be made available seems to count for little: a recent review by Dr Ioannidis which showed that only 143 of 351 randomly selected papers published in the world’s 50 leading journals and covered by some data-sharing policy actually complied.

The journal recommends several ways to fix these problems, including mandatory sharing of data and getting reviewers to work harder, reanalyzing the data in a reviewed paper from the ground up. The former is a good suggestion: many people in my own field, for example, refuse to send flies to other workers, even though they’ve published data from those flies. But reanalyzing other people’s data is almost impossible. We’re all busy, and it’s enormously time-consuming to redo a full data analysis.

My own suggestions include not only mandatory publication of raw data immediately—not after a delay (the current practice), mandatory sharing of researchmaterials on which you’ve published (i.e., fruit flies), a tenure and promotion review system that emphasizes quality rather than quantity of publication (the Economist mentions this as well), and less emphasis on getting grants. The purpose of a grant, after all, is to facilitate research. But the rationale has become curiously inverted: now the purpose of one’s research seems to be to get a grant, for the “overhead money” of a grant (a proportion of the funds given for a project that go not for science, but to the university itself for stuff like supporting the physical plant) has become an important source of revenue for universities. But one can do a lot of good science on little money, especially if you do theoretical work, and the amount of NIH or NSF money you bring in should be relatively unimportant in judging your science. I’m proud that it’s official policy at the University of Chicago that grant monies are not counted when someone is reviewed for tenure or promotion.

Of course, there’s a correlation between grant money and scientific accomplishment, for most experimental scientists simply can’t do their work without external funding. But a lot of that money goes to support what I see as weak science, or science that, at least in my field, is faddish. And in many places the counting of accrued grant dollars or the number of publications becomes an easy but inaccurate way to judge someone’s science. As a colleague once told me when evaluating someone’s publications for promotion: “We may count ’em, and we may weigh ’em, but we won’t read ’em!”

Finally, there should be some provision (and the Economist mentions this as well) to fund people to replicate the work of other scientists. This is not a glamorous pursuit, to be sure: who wants to spend their life re-doing someone else’s studies? But how else can we find out if they’re right? I’m particularly worried about this in ecology and evolution, in which studies are almost never repeated, and I suspect that many of them can’t be generalized. Fortunately, the kind of genetic work I do is easily replicated, but it’s not so easy in field studies of whole organisms.

Let me end by saying that religious people and those who are “anti-scientism” will jump all over these articles, claiming that science can’t be trusted at all—that it’s rife with incompetence and even corruption. Well, there’s more of that stuff than I’d like, but when you look at all the advances in biology (DNA sequencing, for example), chemistry, physics, and medicine over the past few decades, and see how many important results have been replicated or at least re-tested by other investigators, one sees that science is still homing in, asymptotically, on the facts. In contrast, religion has made no progress, and academic humanities often seems to wend themselves into dead ends like postmodernism.

Nevertheless, if you’re a scientist you simply must read these two articles, and I recommend them to others as well. They’re may seem alarmist, but they’re important. And the science journalism—the level of rigor and understanding these pieces show—is admirable. Kudos to the anonymous authors.

105 thoughts on “Science is in bad shape”

lionaroundwriting says:

October 22, 2013 at 6:13 am

Sadly the practice of tweaking results is common place because of ego’s and also on the medical side drug companies are determined to make money one way or another.

The negative results bit is particularly key – as it means that of all studies seen most are positive or at least said to be. So there’s a whole industry with skewed views. Negative studies can be learned from yet we never see them.

Also the 95% confidence interval is pretty high. 1% or 2% should be a standard.

Sadly like any field there are schemers, money makers and shady characters trying to increase their academic respectability but doing the opposite if found out.

Reply
1. John H. McDonald says:
  
  October 22, 2013 at 6:30 am
  
  “For example, under standard criteria you will reject a correct “null” hypothesis and accept an alternative but incorrect hypothesis 5% of the time, which means that something like 1 in 20 “positive” results—rejection of the null hypothesis—could be wrong.”
  
  This is only true if about half of the null hypotheses you’re testing really are false; if you’re testing a null that is unlikely to be false, it’s much worse. Imagine that you test 1000 different plant extracts on mice to see which ones make the mice sick. In reality, 500 of them make mice sick, and 500 don’t. With a significance level of 5%, you have 25 false positives (5% of 500) and 500 true positives (assuming you have sufficient power), so about 5% of your positives are false positives.
  Now imagine that you’re testing 1000 plant extracts to see whether they cure cancer, and in reality, only one of them does. You get one true positive, but 50 (5% of 999) false positives. So 98% of your results are crap–publishable crap, but crap nonetheless.
  While I think a formal Bayesian analysis is usually impractical (we generally don’t have a good idea of the probability of a true positive before we do the experiment, that’s why we’re doing the research), it’s important to keep in mind that rejecting a null hypothesis that likely to be true should require a much smaller P value than 0.05. Or in other words, “Extraordinary claims require extraordinary evidence.”
  
  Reply
  1. whyevolutionistrue says:
    
    October 22, 2013 at 6:35 am
    
    Good point, John; I agree. However, as you say, it’s hard to judge a priori how likely a null hypothesis is to be true. But I think the 0.05 level is generally too high. In physics it’s MUCH lower.
    
    Reply
  2. TJR says:
    
    October 22, 2013 at 6:41 am
    
    The example in the “Trouble at the lab” article deals with this.
    
    Reply
    1. Erik Verbruggen says:
      
      October 22, 2013 at 7:54 am
      
      Of course in this case every reasonable scientist would adjust the P value, because so many tests are performed. usually P values are far below 0.05 when there are real effects (but of course dependent of power, so sample size should be wisely chosen).
      
      Reply
  3. Jiten says:
    
    October 22, 2013 at 9:32 am
    
    How do you find this out “in reality 500 of them do and 500 don’t “? Aren’t you trying to find out exactly this in the first place? How do you know if you have false positives?
    
    Reply
    1. W. Benson says:
      
      October 23, 2013 at 7:00 am
      
      Scientists are required to entertain the possibility that false positives exist in data. Science journalists idem. Probability levels can be and usually are corrected for in the statistical procedure (correcting for multiple comparisons with a 5% error rate for the entire experiment). If it is very important to eliminate false positives (and to identify true positives), as when screening for effective cancer treatments, more tests are made on candidate drugs until it becomes very very unlikely that false positives (or false negatives) remain.
      Highly reliable results are usually expensive to obtain, but no matter the statistical certainty, results may be suspect as resulting from a failure in the author’s laboratory. Science solves this by having skeptics repeat important studies (i.e., those provoking disagreement because they appear to reveal something outside-the-box). Replication (sensu lato) also helps identify possible methodological and other problems in the original study, even if not fatal to the conclusions.
      
      Reply
  4. Ed K says:
    
    October 23, 2013 at 1:49 am
    
    These are great examples of how the nature and object of a study can affect the choice of methodology and interpretation. There are two points I want to make here regarding these examples.
    
    First, if you’re testing, at a 5% significance level, 1000 plant extracts to see whether they cure cancer and only 1 actually does, this still produces useful information. Having 1 true positive out of 51 positive results is better than having 1000 plant extracts with indication which may produce an effect. This is why replication is so important. A replication of this experiment would not necessarily pick out the same 51 extracts to give positive results (this would be highly unlikely actually, since most are positive just by chance). So several repetitions of this experiment may allow us to zero in on the one (or at least just a few) extracts that actually do exhibit a true effect.
    
    Second, there may exist far more efficient ways of analyzing this type of problem. The main issue here is that you are simultaneously wishing to test 1000 hypotheses. Classical frequentist methods are likely to be too inefficient (in the sense described above), and perhaps just theoretically less appropriate than a more Bayesian approach. My point here is that statistics should not be treated as a “plug-and-play” tool for objective analysis. Too often we seem to rely on methods and techniques that we have used before without really scrutinizing whether or not those methods are the most appropriate to apply to the particular study at hand.
    
    Reply
    1. Lou Jost says:
      
      October 25, 2013 at 12:20 pm
      
      P-values should be abandoned for most applications of the kind discussed here. Meaningful measures of the magnitude of the effect should be used, and their statistical uncertainty should be expressed as confidence intervals. Ecological and medical research is badly tainted by the insane over-reliance on uninterpreted (and often uninterpretable) measures used simply as a tool to generate p-values.
      
      Reply
Betty Bee says:

October 22, 2013 at 6:28 am

Science itself can be trusted but scientists at all levels are under pressure to publish. Recent scandals concerning plagiarism and peer review lead to mistrust and make it easier to attack Science rather than the individuals involved.

Reply
1. whyevolutionistrue says:
  
  October 22, 2013 at 6:36 am
  
  I’m not sure what it means to say that science can be trusted but scientists often can’t. Science is what scientists DO.
  
  Reply
  1. Betty Bee says:
    
    October 22, 2013 at 6:44 am
    
    And so why the need for this initiative?
    http://www.rsc.org/chemistryworld/2013/10/poland-gets-serious-plagiarism
    
    Or is that what you are suggesting scientists DO? 🙂
    
    Reply
  2. Peter Beattie says:
    
    October 22, 2013 at 6:54 am
    
    But arguably, science is what all scientists do, not what a single scientist does. See my comment below about Popper’s take on “Crusonian science”.
    
    Reply
    1. sadexcuse says:
      
      October 22, 2013 at 6:59 am
      
      ..and not what a bad scientist does. A good methodology does not involve making up the data or ignoring inconvenient data.
      
      Reply
  3. Sastra says:
    
    October 22, 2013 at 7:50 am
    
    I think that “science can be trusted but scientists often can’t” means that the method is a reliable one, but it’s hard to follow.
    
    When my woo-friends try to tell me that belief in the paranormal, alternative medicine and religion all comes down to nothing more than a decision about WHO you want to trust … and I’ve decided to trust scientists … I argue no. Science is not the system you adhere to if you “trust” scientists. It’s the system you use if you DON’T trust scientists. The way it’s set up is supposed to bypass the concept of placing your belief in an individual’s reliability and instead substitute it with skepticism towards all.
    
    The other method is the one of mysticism. Someone tells you what happened to him or her and you accept or don’t accept. Testing doesn’t come into it because it can’t.
    
    Reply
    1. lessbutnotlast says:
      
      October 22, 2013 at 8:43 am
      
      Agreed.
      
      Reply
    2. Diana MacPherson says:
      
      October 22, 2013 at 8:45 am
      
      Good response to your woo friends. I’ll have to keep that one in mind!
      
      Reply
  4. lessbutnotlast says:
    
    October 22, 2013 at 8:41 am
    
    I think what is meant is that the scientific method can be trusted. The human element, the scientist, is not always trustworthy.
    Also, a less that capable scientist is bound to design a less than suitable experiment leading to throwaway/unhelpful results.
    That’s what my thinking is anyway.
    I oppose the use of “science is in bad shape” and prefer “Some scientists are in bad shape leading to the publication/promotion of a bad body of knowledge”.
    …Or something like that! I need some help here.. I hear Jerry also teaches scientific writing!
    
    The idea being that lets also not give the Creationists/other anti-science ghouls false ammunition. They already have plenty of that.
    
    Reply
  5. lessbutnotlast says:
    
    October 22, 2013 at 8:47 am
    
    Well, scientists are supposed to apply the scientific method… they must hold themselves up to proper standards as well as what you and John McDonald mentioned above.
    Otherwise, you get studies/results that are not reproducible… and results that are simply not honest.
    
    Reply
Linda Grilli Calhoun says:

October 22, 2013 at 6:35 am

I would add re: religious criticism that it is obvious that the scientific community sees this problem, is facing it, and will find a way to clean house.

Ever see a religion do that? L

Reply
1. TJR says:
  
  October 22, 2013 at 6:42 am
  
  Martin Luther?
  
  Reply
2. Lynn A. (Ottawa) says:
  
  October 22, 2013 at 7:09 am
  
  Sub
  
  Reply
Jesper Both Pedersen says:

October 22, 2013 at 6:37 am

sub

Reply
Brian says:

October 22, 2013 at 6:38 am

If all raw data was made available immediately, I could see a lot of scientists waiting to publish. If you collect a bunch of excellent, illusive, and expensive data that you still have ideas on how to utilize, you don’t want to get sniped by someone who did not put the same time and money into collecting it.

Reply
1. Adam M. says:
  
  October 22, 2013 at 2:40 pm
  
  Maybe so, but I don’t see a problem with it. They’ll still be pressured to publish the results, especially in competitive fields where there’s a chance somebody might independently discover the same thing, so they won’t delay forever. If such a rule was enforced, there may be an immediate delay in publication, but after a year or two when papers start flowing out of the other end of the delay pipe, the rate of publication would rise back up to normal levels. (Results may be delayed, but once you’re past the gap few would care or notice.) And in addition, we’d have the data.
  
  Refusing to share your data is just unfriendly and un-Scientific and harms the whole enterprise.
  
  Reply
gbjames says:

October 22, 2013 at 6:40 am

sub

Reply
frank43 says:

October 22, 2013 at 6:41 am

Another issue: If you test 20 different variables, one of them is likely to be significant at the 5% level. This is often the case with pseudoscience (like studies designed to measure the “efficacy of prayer” – something is always significantly different, but it’s never the same thing).

It’s a problem for those of us in the molecular world, too, in this era of big data, when the levels of, e.g., all mRNAs are measured after some treatment to our favorite cell line. The hard part is judging whether correlations are meaningful in the light of known information about pathways and other interactions. (Not to mention the fact that cell lines that are nominally the same have diverged upon being cultured in different laboratories.)

Reply
Diana MacPherson says:

October 22, 2013 at 6:43 am

I saw these articles and cringed because of the inevitable outcome of those who are opposed to science jumping on them. I’m glad you did a careful review of what was written.

I’m proud that it’s official policy at the University of Chicago that grant monies are not counted when someone is reviewed for tenure or promotion.

This is the classic problem of certain Key Performance Indicators driving the wrong behaviour when grant money is considered for tenure. 🙁 Wrong KPIs are often at the root of bad behaviours in all sorts of professions.

I also find things most upsetting when it comes to biomedical research – what a mess. It reminds me of Ben Goldacre’s work. I like the idea of forced replication – perhaps it is something that could be built into the Key Performance Indicators when considering tenure and a requirement for publishing in a journal. That way it’s rewarding to replicate stuff as long as it isn’t part of some poor person’s entire job.

Reply
mindfuldrone says:

October 22, 2013 at 6:48 am

Another possibility is to guarantee publication–for suitably well conducted studies–whichever way the results go. I would be more than happy to sign up for this

Reply
Peter Beattie says:

October 22, 2013 at 6:51 am

One thing that immediately jumps out as something everybody should be aware of is: one study proves nothing. Not even to mention that in statistical considerations a certain number of false positives is seen as acceptable. More importantly, science is a social endeavour, a point well made by Popper (in The Open Society and Its Enemies, ch. 23), when he considers whether Robinson Crusoe could do science—before the arrival of Friday:

In order to apply these considerations to the problem of the publicity of scientific method, let us assume that Robinson Crusoe succeeded in building on his island physical and chemical laboratories, astronomical observatories, etc., and in writing a great number of papers, based throughout on observation and experiment. Let us even assume that he had unlimited time at his disposal, and that he succeeded in constructing and in describing scientific systems which actually coincide with the results accepted at present by our own scientists. Considering the character of this Crusonian science, some people will be inclined, at first sight, to assert that it is real science and not ‘revealed science’. And, no doubt, it is very much more like science than the scientific book which was revealed to the clairvoyant, for Robinson Crusoe applied a good deal of scientific method. And yet, I assert that this Crusonian science is still of the ‘revealed’ kind; that there is an element of scientific method missing, and consequently, that the fact that Crusoe arrived at our results is nearly as accidental and miraculous as it was in the case of the clairvoyant. For there is nobody but himself to check his results; nobody but himself to correct those prejudices which are the unavoidable consequence of his peculiar mental history; nobody to help him to get rid of that strange blindness concerning the inherent possibilities of our own results which is a consequence of the fact that most of them are reached through comparatively irrelevant approaches. And concerning his scientific papers it is only in attempts to explain one’s work to somebody who has not done it that we can acquire those standards of clear and reasoned communication which too are part of scientific method.

Reply
1. Sastra says:
  
  October 22, 2013 at 7:59 am
  
  Excellent quote. The communal nature of science is absolutely central. It’s not enough to be able to convince people who want to agree with you or who are easy to convince. Your work has to be clear and rigorous enough to be able to convince skeptics. You need to take your flawless study and beautiful theories to your harshest critics and ask “now — where am I wrong?”
  
  This is the part that people in love with pseudoscience, mysticism, and spiritual forms of thought simply don’t get. They are in love with the romantic image of the Solitary Thinker being brilliant and right and they applaud his or her ability to pay no attention to the nay-sayers. Believe in yourself. Ignore those who criticize you. The problem is with them, not you.
  
  And the fact that we’re supposed to be dealing with common pursuit of discovering the nature of reality is lost under the romantic myth of the maverick hero.
  
  Reply
  1. Scott_In_OH says:
    
    October 22, 2013 at 10:35 am
    
    The “communal nature of science” is definitely key, which is why Jerry’s suggestion of funding (and, presumably, publishing and counting toward tenure or promotion) research that seeks to replicate existing findings is an important one. Until that happens, scholars will tend to leave existing findings alone and seek out their own, independent contributions to make.
    
    Reply
2. Jiten says:
  
  October 22, 2013 at 9:40 am
  
  Excellent quote. Your work has to be available to be criticised.
  
  Reply
  1. Barry says:
    
    October 22, 2013 at 3:04 pm
    
    No one these days knows the story of Robinson Crusoe. You need to bring your example up to date with a current line. Try Tony Stark.
    
    Reply
    1. Diana MacPherson says:
      
      October 22, 2013 at 3:35 pm
      
      Ha ha! Don’t be so sure. Someone at work drove in in an Audi R8 & I said to my co-workers, hey, look there is Tony Stark! No one got it & having to explain the joke took the funny out of it.
      
      Reply
sadexcuse says:

October 22, 2013 at 6:53 am

Yes, last weeks Economist. I read it carefully and, sadly, was not surprised for all the reasons suggested. I also instantly thought of this blog and how much ammunition was served up to the creationist “friends” of mine. I can see them salivating and rubbing their hands with glee.

However, of course, we remain unmoved. Ok, well maybe humbled a bit, but only because we scientists are happy to accept criticism and then work to correct the error of our ways.

Reply
TJR says:

October 22, 2013 at 6:54 am

Refereeing in Maths/Stats journals tends to be quite thorough and harsh, and I’m always struck my how much looser the refereeing is when I publish in an application-area journal rather than a Stats journal. A few lines saying “Yeah, that looks OK” and that’s about it.

I was very much taken by the claim that about 3/4 of papers in Machine Learning are wrong due to overfitting. It has long been a bugbear of mine that machine learning models/algorithms were nearly always deterministic and hence would almost inevitably overfit the data unless the real-world variation really was tiny.

Fortunately in recent years Machine Learning folks discovered Bayes and are now doing loads of interesting work with properly probabilistic models.

Reply
1. compuholio says:
  
  October 22, 2013 at 7:33 am
  
  It has long been a bugbear of mine that machine learning models/algorithms were nearly always deterministic and hence would almost inevitably overfit the data unless the real-world variation really was tiny.
  
  I don’t get it. What is the connection between determinism and overfitting? You can only overfit when you are evaluating the algorithm on the same data that you used to train it. I do see this happening in papers from time to time but I have a hard time believing that 3/4 of the papers suffer from this problem.
  
  It is the nature of machine learning algorithms that they only give good answers in the same domain that they have been trained for. Face detection for example usually fails miserably when you are using HD images and the algorithm was trained on low-res images or vice versa.
  
  Fortunately in recent years Machine Learning folks discovered Bayes and are now doing loads of interesting work with properly probabilistic models.
  
  Bayes has been used for a long time in machine learning. The problem is that for any other than toy problems you almost never have enough data to get meaningful joint probabilities since more often than not the models depend on quite a few random variables.
  Usually you have to be more crafty and use Bayesian Networks, heuristics or you have to limit the problem domain.
  
  Reply
  1. TJR says:
    
    October 22, 2013 at 9:21 am
    
    Are we using “overfitting” in different ways?
    
    I mean fitting a too-complex model so that it fits the original sample data very well but does not generalise to new data from the same population.
    
    This usually happens when the model treats the “noise” in the original data as though it were “signal”, so just reproduces the unique features of that one sample.
    
    The toy example I usually use in class is that if you have a scatterplot with 20 data points and fit a model with 20 parameters then it will reproduce the data exactly, but be totally useless.
    
    Deterministic models are especially prone to this because they don’t explicitly distinguish between signal and noise (mean and variance). Obviously cross validation type methods can be used to get round this to some extent.
    
    I agree that 3/4 sounds a lot, so its interesting that this is ascribed to a computer scientist at MIT.
    
    Sorry, I’ll shut up now, I’ve already broken my personal record for most comments on a thread anywhere.
    
    Reply
    1. compuholio says:
      
      October 22, 2013 at 10:03 am
      
      No that is exactly what I mean by overfitting. As you said: If you overfit you can reduce the training error to zero but your model is useless for predicting new data.
      
      And that is also why it doesn’t matter as long as you evaluate your algorithm on a different dataset than the set you have trained it on. The model may be too complex for the training data but if the algorithm is also good at handling previously unseen data: who cares? Usually it doesn’t but technically it could if the test data happens to be in an area where the learned parameters do give good predictions.
      
      Deterministic models are especially prone to this […]
      
      Ok I think I understand what you mean. I wouldn’t use the word “deterministic” for it since everything in a computer is deterministic. I think what you meant are discriminative models as opposed to generative models who work with an underlying probability distribution.
      
      Reply
      1. TJR says:
        
        October 23, 2013 at 2:27 am
        
        Yes, by “deterministic” I mean models with no probability distribution involved, as against probabilistic (stochastic) models. That’s usual statistical terminology, I guess the alternative terminology you use is common in machine learning?
        
        This is as distinct from deterministic v probabilistic algorithms, where the latter could include random numbers in it’s search for an optimal fit to a possibly deterministic model.
        
        But, as you say, even those are deterministic really as they use pseudo-random numbers.
        
        (Broke my promise not to bang on any more, but I find this sort of terminology disentangling very useful).
      2. compuholio says:
        
        October 23, 2013 at 12:24 pm
        
        I guess the alternative terminology you use is common in machine learning?
        
        In machine learning, yes that is the standard terminology. However in other branches of computer science (usually the ones with a more abstract outlook) it is used differently. But to mind there are subtle differences in the meaning.
        
        When theoretical computer science people talk about deterministic vs. randomized algorithms they usually mean that a randomized algorithm needs to actually draw (pseudo-)random numbers in order to operate.
        
        Whereas when we talk about generative models it usually only means that the algorithm uses a probabilistic model (like a gaussian mixture model) to make inferences. But that doesn’t necessarily mean that the algorithm actually needs to sample (pseudo-) random numbers.
        
        In order to apply bayes it is often enough that the parameters of the probability distributions are known and sampling of new data isn’t required. That would be classified as a generative model but not as randomized algorithm.
        
        […]where the latter could include random numbers in it’s search for an optimal fit to a possibly deterministic model.
        
        Right. I would call that a randomized algorithm to fit a discriminative model.
        
        But regardless of the terminology. I would contest the claim that algorithms that aren’t based on a probabilistic signal-noise modelling are inherently more susceptible to noise. There are many incredibly successful techniques to fit deterministic models in heavily cluttered environments like RANSAC.
Ben Goren says:

October 22, 2013 at 7:15 am

It seems to me that this is a role that the journals could legitimately fill and justify exorbitant fees for, though who should pay the fees would be an important question.

Specifically, a journal that had a policy of not publishing anything until the results had been independently replicated would be one worth a great deal, indeed.

In cases such as the LHC where the costs of full independent replication are prohibitive, an ongoing, invasive, intrusive independent audit would likely be the way to go. Think of something at the level (at least) of UN weapons inspectors, present during all phases of research.

Of course, even these sorts of things don’t eliminate the possibility of collusion and corruption….

b&

Reply
1. schadtc says:
  
  October 22, 2013 at 4:01 pm
  
  Ugh. I work at a government owned, contractor operated laboratory (I’ll leave the agency up to you to guess) and have enough people looking over my shoulder already including feds, lab management, safety, technical, etc. While that can be important in some areas of research it can be stifling and overbearing in others.
  
  Also we have outrageously high overhead rates to pay for all these over-the-shoulder-lookers. When you set up bureaucracies like mentioned above. Guess what, then there is even less available for actual science. Funding sources are already tight enough.
  
  Reply
  1. schadtc says:
    
    October 22, 2013 at 4:07 pm
    
    Also, Im sure there are plenty of over-the-shoulder-lookers (including QA/QC types) on the job at the LHC.
    
    Reply
  2. Ben Goren says:
    
    October 22, 2013 at 4:47 pm
    
    Obviously, not all fields should be treated the same way. If we had the level of public oversight you’re describing in the pharmaceutical research industry, Jerry likely never would have had reason to write that essay.
    
    And the real, honest, obvious answer is that we need to properly fund basic research. Bill Gates doesn’t need any more yachts, but taxing him at anything other than today’s unprecedentedly-low levels is apparently tantamount to roasting babies on a spit with apples in their mouths. And, if I’m not mistraken, the entire LHC has only cost as much as a single aircraft carrier, and we’ve got a dozen of those in service (as many as the rest of the world combined, and our smallest one dwarfs the largest of the rest) with several times as many retired.
    
    Imagine your lab had the budget overseen by a single captain of a single Navy cruiser; I doubt you’d even have a clue where to begin spending that much money.
    
    Cheers,
    
    b&
    
    Reply
    1. schadtc says:
      
      October 23, 2013 at 5:14 am
      
      While some might want the budget of aircraft carrier for their lab, I am not one of them. At that point it wouldn’t really be my lab anyway as it is just not possible to keep up with the amount of people and work that it would take to spend such sums. Running a large research may sometimes resemble a military hierarchy, but IMO that is not necessarily get the best ‘bang for the buck’. There is also some evidence to back this up:
      
      http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0065263
      
      If we put a couple of aircraft carrier’s worth of money back into the overall scientific funding, I agree that that IMO would be a good thing!
      
      Despite the tendency to pile on Big Pharma, I think it can be safely said that they already have a higher level of oversight than most researchers (as they should when lives are directly at stake). We can debate the best way to effectively implement such oversight, whether or not it is funded and executed well enough, but the oversight does exist.
      
      Reply
gil says:

October 22, 2013 at 7:29 am

what if the chance is very small?

Reply
lanceleuven says:

October 22, 2013 at 7:33 am

This is pure speculation on my part, but I wonder if Ben Goldacre was involved in this article at all. It seems to touch upon many of the issues he’s been discussing for some time.

For those who don’t know, he’s a British GP and writer who wrote a very good book called Bad Science (he also has a bl*g of the same name). It looks into many issues like this, alongside a great deal of other pseudoscience and poor science reporting by the media. It’s a very good book.

Reply
1. Chris Quartly says:
  
  October 22, 2013 at 7:54 am
  
  I’d like to echo the love for Ben Goldacre. I’m actually in the middle of reading his next book, Bad Pharma, which echoes many of these points. Maybe there is a movement just waiting to happen for the demand of transparency, particularly in the pharma industry. One can hope.
  
  Reply
  1. Breton Vandenberg says:
    
    October 22, 2013 at 3:16 pm
    
    You will find that Ben Goldacre has tried to kick-off said movement here:
    http://www.alltrials.net/
    
    Reply
cgosling says:

October 22, 2013 at 7:44 am

I guess it takes good science and scientists to expose bad science. Who else could do it? The lesson it teaches is, “Be wary and don’t accept scientific results as if they were proclaimed by God.”

Another point – So called scientific journals purposely publish trash on Prayerful Healing. Journals choose reviewers carefully so that their religious bias will be substantiated.

Reply
uprprof says:

October 22, 2013 at 7:56 am

I think the underlying factor making so many published papers dubious in my field (cell biology, though I suspect it is equally true in all fields) is confirmation bias. We are all wedded to our own hypotheses, and if we don’t start from the presumption of being incorrect(“extraordinary claims” and all that) it is very easy to fool one’s self and let shoddy data and interpretations through. “Hey, that 50 kDa band on the western blot is about the same size as my protein, and it’s behaving like I expect it to, so it MUST be right!” Having postdoc’d in a very high-profile lab, I know that this is not just a problem limited to supposedly marginal scientists. However, the beauty of the scientific endeavor is that, despite all the bunk (that dramatically impedes progress), with time reality rises over the noise. Even when no one makes an attempt to replicate an experiment directly, we all build upon what was published before, and flawed papers cannot be built upon. Their findings get discarded and those who constantly publish tenuous science eventually develop a reputation for doing so. So, yes, we need to be on a better guard against our own tendency to hoodwink ourselves (as well as the less common outright fraud), but despite it all, science still works because the whole enterprise, even if not some of the individual scientists practicing it, recognizes the need for rigor and replication.

Reply
Sastra says:

October 22, 2013 at 8:14 am

A few years back some serious criticisms came out against studies in medicine (I forget the name of the critic, started iirc with an “I”)and the alternative medicine advocates had a field day. For some strange reason they thought that the researcher had VINDICATED them and was on their side.

It was childish. “See? See? This guy is saying what we’ve been saying — your science is bad. So you have NO RIGHT criticizing us for being sloppy. YOU do it. I bet you do it worse.”

No. The problem with mainstream scientists not being rigorous enough is that they need to become MORE stringent and careful, not more sympathetic and understanding and accepting of pseudoscience in dodgy alt med journals. The tu quoque approach missed the whole point.

The folks who wrote those articles in the Economist are not going to do a follow-up article praising reiki and homeopathy studies. If they looked at the kind of crap that passes for research there, they’d probably pass out.

Reply
1. µ says:
  
  October 22, 2013 at 8:34 am
  
  Ioannidis, perhaps?
  
  Ioannidis JPA: Why most published research findings are false.
  PLoS Med 2005, 2:e124.
  
  Reply
  1. Sastra says:
    
    October 22, 2013 at 9:52 am
    
    Yes, thank you.
    
    Reply
Greg Esres says:

October 22, 2013 at 8:25 am

Part of what The Economist suggested is that the self-correcting mechanism of science is broken; I see this issue coming up as evidence that the self-correcting mechanism is working.

Reply
Greg Esres says:

October 22, 2013 at 8:30 am

It seems like most of the examples come from medicine or social sciences, which, because you’re studying populations of organisms, is enormously complex and difficult to study.

Would, say, physics suffer from the same problem?

Reply
1. josh says:
  
  October 22, 2013 at 9:33 am
  
  Physics (at least my branch of it) has the advantage of requiring very high significance to claim proof (5 sigma, i.e. less than a 1:1 million chance of the null hypothesis providing your observation. (With the big caveat that this always depends on having a good estimate of your errors!)
  
  But we still have the problems of pressure to publish and no reward for replicating results. I don’t think this is a big problem with experimental claims, which people are pretty cautious about, but lots of theoretical papers probably have some calculational or coding error because no one is double checking. If it is an important idea however then mistakes tend to get ferreted out.
  
  Reply
  1. Greg Esres says:
    
    October 22, 2013 at 1:58 pm
    
    “If it is an important idea however then mistakes tend to get ferreted out.”
    
    That was my thought on the original article; probably a lot of the studies which no one even attempts to replicate don’t have any significance anyway. I would think there would be greater alarm for an unreplicated study that was cited a huge number of times.
    
    Reply
2. Keith Douglas says:
  
  October 22, 2013 at 9:34 am
  
  There seems to have long been a trend (maybe even a law) that the closer one gets to humans, the shakier the science gets. I think (guess alert!) that this both because matters become more technological (by accident) and because the systems are more complicated. Political economy makes particle physics look easy, if put in the proper perspective! 🙂
  
  Reply
  1. Jesper Both Pedersen says:
    
    October 22, 2013 at 9:59 am
    
    We’re not even capable of predicting the outcome of our own actions on the hysterically prone stockmarkets.
    
    Selfimposed problems seems to be a unavoidable part of our existence.
    
    Oh, the humanities.
    
    Reply
3. eric says:
  
  October 22, 2013 at 9:46 am
  
  All IMO, but…
  
  Physics and chemistry do suffer from it somewhat. I’d say the difference is that in physics and chemistry (compared to the social sciences), reported experiments are more often replicable in theory, however, in practice people don’t do those replications any more often than they do them in the social sciences. The comparatively low professional reward for doing replication (rather than original research) is a problem across all of the sciences.
  
  Reply
  1. frank43 says:
    
    October 22, 2013 at 12:23 pm
    
    The reactions in Organic Syntheses are tested and optimized before they are published – a hundred year labor of love which has saved the community an enormous amount of time and money. But I don’t know of many other cases.
    
    Reply
Tommy the cat says:

October 22, 2013 at 8:57 am

Something that has always been a source of wonder for me, is how the peer-reviewing process is most of the time quite fundamentally biased and inadequate.
I mean it’s considered normal when your work is reviewed not to know the name of the colleagues doing the review. But how comes that it’s not the other way around? For most journals (and I know there are exceptions here), the reviewer knows the name of the people she’s reviewing, the institution they belong to and so on…
How on earth can we be naive to the point of considering that this would not affect the outcome of the reviewing process? Imagine a young researcher receiving a manuscript authored by a bigwig in her field. Don’t you think that there will be a bias towards considering that if she doesn’t understand something, then she must be at fault, not the authors who obviously know their business? Of course it’s difficult to quantify, but wouldn’t it seem natural to apply the golden standard of “double blind” in peer-reviewing to avoid this source of bias?

I’ll be interested to know the thoughts of others on this issue…

Reply
1. TJR says:
  
  October 22, 2013 at 9:06 am
  
  Double blind reviewing happens in some fields but is fairly rare in general, I’ve only done it once IIRC.
  
  As a youngish researcher I did once reject a paper by a major bigwig in my field, although I strongly suspect it was the co-author’s work. I like to think I would have done the same anyway.
  
  Reply
  1. Tommy the cat says:
    
    October 22, 2013 at 9:17 am
    
    I didn’t mean to imply that the integrity of the fictional young researcher I described was in doubt. Just that at some unacknowledged level, her decision would be affected in some way…But it works the other way around, an innovative paper authored by a no-name, hastily dismissed say…
    
    Reply
2. Flo_B says:
  
  October 22, 2013 at 12:51 pm
  
  “Double blind” would probably work in some fields, but in others, I think, it would be just a joke.
  I’m a field ecologist, and even though I’m a complete nobody, finished my PhD only 2 years ago and have a horribly short publication list, anybody who gets to review any of my manuscripts can find out who it’s by in 30 seconds via Google – there is simply nobody else working in my study area with my target organisms. And anybody working on a similar topic will at least know what work group it’s from anyway.
  
  Reply
  1. John Scanlon, FCD says:
    
    October 31, 2013 at 11:37 am
    
    This is the case for my field also… probably because I would never have entered an overcrowded field in which individuals were interchangeable.
    
    Also, after reading some article arguing for double-blind peer-review as an ideal, I opted for anonymity when reviewing for several years, and found that (a) I was never mentioned in acknowledgements, (b) the citation rate of my papers dropped off, and (c) rate of requests to review diminished almost to zero. Maybe other causes were operating, but observations suggest that the anonymity experiment broke my academic career.
    
    Reply
Mike Herron says:

October 22, 2013 at 9:14 am

Complex issue.

Better reviewers on the publication end might help, but they are all working for FREE as it is.

Better statistics might help too. But in the end repetition is what will sort things out. But you cannot get funded to repeat things, and you will not be able publish the repeated work.

In my experience large laboratories with many workers competing for the attention of a PI who is not actually in the laboratory, in combination with a surfeit of inexperienced workers and students is a recipe for mistakes.

Labs that have a flatter structure and a backbone of experienced and skeptical people sort out the nonsense.

Reply
Keith Douglas says:

October 22, 2013 at 9:38 am

As for repetition: we know that exact replication is impossible. So why not work to develop “comparable techniques”. This only works if one has a good body of theory and not just data churning. For example, we know the value of the avogadro constant (Happy Mole Day everyone, once that starts!) because many independent ways can be used to check each other. In some fields, by contrast, we have “simply” “X is corrolated with Y under conditions Z.” This doesn’t give understanding either.

My comparison takes extremes, rather than the fuzzy middle, but why not move towards the former, rather than the latter?

—

As for institutional factors: I have long been an advocate for “we didn’t actually find a connection” journals, or at least databases.

Reply
eric says:

October 22, 2013 at 9:41 am

There could be duplicity, either deliberate fraud or a “tweaking” of results in one’s favor, which might even be unconscious.

There is another form of duplicity which (IMO) is far more common in corporate-driven biological research, and which is neither tweaking or (classic) fraud. It’s cherry picking: running 10 studies, and publishing only one. Typically the one that shows your product had a positive effect.

One suggestion I heard (several years back) for reining in this practice is for the FDA to only count trials that a company declares to them before its carried out…and then to count all of those, with no report from a declared trial reducing the stastitical significance of the reported declaired trials.

Unfortunately, the FDA is so stressed for resources right now that AFAIK they really only check out claims of harm now, not claims of efficacy. So even if this solution was implemented, I have my doubts that they’d have the practical resources to do anything about cherry picking. They’d know it was going on, but not be able to punish the offenders.

Reply
Gareth Price says:

October 22, 2013 at 10:06 am

There may well be postdocs who would be willing to spend time attempting to replicate experiments rather than to be forced out of science altogether because there are not enough openings.

Reply
1. Flo_B says:
  
  October 22, 2013 at 12:35 pm
  
  Well, yes, but the reason they are forced out is because there isn’t enough money to employ them all. And that won’t change if you reallocate some money to replicate experiments.
  
  Reply
ritebrother says:

October 22, 2013 at 10:23 am

“I’m proud that it’s official policy at the University of Chicago that grant monies are not counted when someone is reviewed for tenure or promotion.”

Wow. When I read this I almost fell out of my chair. This is the single most stressful aspect of P&T. I had no idea that this policy existed anywhere among research universities. Even my third tier medical school places grant funding near the top of its consideration for tenure.

Reply
beyondbelief007 says:

October 22, 2013 at 11:25 am

Granted, it was in a poli-sci/social science context, but I witnessed the “publish or perish” dictum drive many of my grad school peers to regularly falsify data to align with their hypotheses.

When a group project based on such doctored info was submitted and won a national prize, I became disheartened at the enterprise. Maybe I should have been in a “proper” science field… Or maybe the incentives are perverse?

Reply
1. Adam M. says:
  
  October 22, 2013 at 2:58 pm
  
  Depressing. 🙁
  
  Reply
mecwordpress says:

October 22, 2013 at 11:54 am

One thing should be noted here. One of the publications cited by Dr. Coyne (Trouble at the lab in the Economist) in turn cites a paper published in Nature (Nature 483,531–533(29 March 2012)) written by two Amgen scientists who claim that they could only repeat 6 of 53 pre-clinical “landmark” studies in oncology. However, none of *those* results – the successful 6 and failed 47 experiments- were themselves published. The authors could not publish them because of confidentiality agreements that precluded them doing so. So, to me, there are two problems here. One is the confidentiality agreements – these should not have been required. The original studies are public and so anyone should be able to reproduce them without constraint (except of course for IP). Two, they did not publish their own results! So we can’t judge whether the Amgen scientists did the studies correctly or not. We just have to take it at their say-so.

Reply
Pingback: Science is in bad shape « Why Evolution Is True | Smooth Pebbles
schadtc says:

October 22, 2013 at 12:09 pm

I disagree with much of the economist piece for this reason… Getting it wrong is as much a part of science as getting it right. And just because it is wrong and published is not necessarily a bad thing. If you want your study to be replicated you have to publish it! And not all ‘wrong’ studies should (or will) be retracted. We have to learn and move forward with new methods and more robust designs to test past work, that is a natural process. If a researcher does not have the lit search skills to find the latest test on a subject and trace it back through previous successful, failed or wrong tests, then that is another issue entirely.

Perhaps this comes from my background in ecology, but we see differential effects on ecosystems of different types. Theory or conclusions that hold true in one, are by no means guaranteed to hold true in another. We thus spend much time thinking why our results differ from others, and this is important!

There are many reasons we get it wrong, some of which are readily identifiable and have largely to do with study design and statistical power. For example, we do not always have the time and budget to generate large data sets and have to live with probabilities of Type 1 and Type 2 errors.

Other reasons may be equally unintentional and hardly preventable. Unintended, unexpected, and un-noted effects of an extraction procedure, or unrecognized contaminants in water sources, for example.

As for the fact that reviewers need to do more… Really? Im an VOLUNTEER Assoc. Editor and I am lucky if can get 2-3 reviewers to VOLUNTEER out of a dozen requests. Then i have to get them to follow through! But the point is it is a VOLUNTEER activity. I would never require my reviewers to replicate analysis. We all have our own careers and data to work on!

Additionally, as stated above, this is part of the reason for publish in a paper. You get a result, and you want to put it out there for others to see, and others to replicate! If my result is wrong, prove it!

Reply
s.k.graham says:

October 22, 2013 at 12:51 pm

Here is another suggestion.

Experiements should be published in two steps: ‘pre-‘ and ‘post-‘.

Pre-publication, as the term implies, is done prior to the experiement, and includes discussion, hypothesis, and most important all the details of the planned method (so others could attempt the same experiement). Science culture shoudl be adjusted so the first person to publish the description of how to do a particular experiment gets appropriate citation&credit if someone else happens to do the experiment first.***

Post-publicatoin then provides the results, including all raw data. Thorough explanation/justification of any deviation from the original plan (significant deviations should require updating the pre-publication before conducting the experiment).

The point of this is two-fold. First it avoids suppression of boring “null” results. Second, it may actually encourage a focus on the importance of forming good hypotheses and giving credit to good & useful experiements even when the result turns up null.

For some sensitive experiments where scientists don’t want their ideas “stolen”, pre-publication might be made to some kind of confidential jury of peers.

*** I especially think pharmaceutical companies should be required to pre-publish all clinical trials, so they can’t hide the negative results. No trial which is not pre-published should be accepted as evidence for the effectiveness of a drug.

Reply
Diane G. says:

October 22, 2013 at 1:31 pm

sub

Reply
Alex SL says:

October 22, 2013 at 2:18 pm

IMO, some incompetence is not the problem. Every area of human endeavor will necessarily suffer from that, and the important mistakes will be found out and corrected because people will try to build on them.

The problem are rather the underlying incentives in our field. Making it ever more competitive encourages faddishness, quantity over quality, and sometimes even fraud. The metrics used to evaluate people provide incentives for concentrating on flashy topics, for gaming the system (citation cartels etc), and again for quantity over quality. And then there are the constraints under which editors and university management are operating, which systematically disadvantage certain areas of research.

Those are the real problems because they constitute a force that pushes science constantly into the wrong direction unless we all consciously steer against it.

Reply
peltonrandy says:

October 22, 2013 at 2:19 pm

“Finally, there should be some provision (and the Economist mentions this as well) to fund people to replicate the work of other scientists.”

I’d like to float an idea concerning this suggestion. Obviously not many scientists are going to want to do this.

I recently retired from teaching science at the secondary level. There are a great many retired high school science teachers out there. Perhaps some program could be designed to enlist this group in this effort. They are already trained in the basics of science. With some additional training in research methods, could these former high school teachers become a body of researchers to assist in replicating the scientific work that has yet to be replicated? I am interested what others think of this idea. I admittedly haven’t put a great deal of thought into how to organize such an effort. I offer it as an idea for discussion.

Reply
Alex SL says:

October 22, 2013 at 2:22 pm

And as some others have already pointed out, quality problems may be worse in some areas than in others. My own, for example, rarely has negative results because it is very descriptive, and thus a systematic bias towards publishing only positive results will not be much of an issue…

Reply
colnago80 says:

October 22, 2013 at 2:50 pm

For example, under standard criteria you will reject a correct “null” hypothesis and accept an alternative but incorrect hypothesis 5% of the time, which means that something like 1 in 20 “positive” results—rejection of the null hypothesis—could be wrong.

It should be pointed out that the standard in physics, at least for journals like the Physical Review is 5 standard deviations.

Reply
couchloc says:

October 22, 2013 at 7:18 pm

It seems to me that this paragraph from Jerry explains one reason behind the discipline of philosophy of science rather well:

“those who are “anti-scientism” will jump all over these articles, claiming that science can’t be trusted at all—that it’s rife with incompetence and even corruption. Well, there’s more of that stuff than I’d like, but when you look at all the advances in biology (DNA sequencing, for example), chemistry, physics, and medicine over the past few decades, and see how many important results have been replicated or at least re-tested by other investigators, one sees that science is still homing in, asymptotically, on the facts.”

Given all the problems described in the article, inquiring minds will want to know this about science: When can scientific inferences be trusted? When do we overstep the bounds of evidence? How do we determine when social or political factors are leading to mistakes in scientific results? What role does corruption play in science? What are scientific “facts” in light of all the foregoing? What does it mean that science is a self-correcting community? And what by god does it mean to talk about science as asymptotically reaching the facts? All of these are questions philosophers of science like Kuhn have been working on for years. And yet many scientists think the philosophy of science isn’t useful!! On boy my head is starting to hurt with this one.

Reply
Ed K says:

October 23, 2013 at 1:29 am

I agree very much with most of the specific points these articles and Jerry have made about what can be done to improve the way we perform and publish science. But there are several things that bother me about these articles that seem to try to shock and awe rather than just inform.

(1) the examples of Amgen and Bayer scientists not being able to reproduce certain results. Not all results are equally good or reliable, even if they are considered “landmark” by some. Without reading the original studies and then comparing these to their replications, it’s hard to take this as direct evidence for anything. Especially when the article do not mention that there have been other analyses that conclude the opposite (e.g. Jager, Leek – “Empirical evidence suggest most published medical research is true).

(2) There is extensive use of the paper by Ioannidis (2005). This paper is interesting, but suggests far less than what these articles are construing. Ioannidis’ paper was basically an exercise in comparing frequentist to Bayesian statistical methods of hypothesis testing, and he arrives at his impressive sounding conclusions by assuming very specific things about the scientific designs and analyses in question. His work, most notably the claim that more than half of all published results in the medical literature are wrong, only applies to a certain subset of results. For example, in the field of genomics, or other fields where you may simultaneously test thousands of hypotheses. He is (partially) right to criticize the implementation of traditional frequentist methods in this case, but this does not generalize to all of medicine.

For example, a common type of post-op observational study might seek to investigate a supposed link between post-operative infection after a specific kind of surgery and age, or sex, or surgical apparatus used. These types of studies abound in the literature. But Ioannidis’ criticisms can’t be applied to these types of studies.

(3) There are many other problems with Ioannidis’ work that these Economist articles overlook. In fact, one problem they seem to repeat themselves is that a low statistical power is always a bad thing. First of all, “low” is a relative term. Secondly, if you test a single hypothesis using a test with only 20% power (under say a traditional 0.05 significance level) and end up finding a statistically significant result, then there is really no problem. The low power can make it difficult to detect a present effect, but if such an effect is detected, there is still only a 5% chance that it is erroneous. Again, a gigantic portion of the medical literature is concerned with testing only one (or a few) hypotheses. For those disciplines concerned with high-dimensional, multiple testing (like genomics), then a power of only 20% might indeed be “dismal.” But in these fields, it would be unwise to apply the traditional frequentist methods that both Ioannidis and the Economist article’s authors imply everyone always uses. More appropriate methods have been developed (and are used, although perhaps not as often or as well as they should be).

Most published results are most certainly significant – this is by statistical design. In fact, this was arguably the original motivation of Pearson, Fisher, and others for developing the coherent frequentist framework. Of course, all of this is predicated on scientists implementing correct statistical reasoning (and qualifying and tempering their conclusions with appropriate statistical caveats), and *this* is the true problem. Shoddy statistical work abounds in all fields of applied science (much less so in physics, at least in my experience). But this is mainly a product of scientists not consulting with statisticians frequently enough (full disclosure: I am a statistician myself). It is far too common for scientists to just emulate a design and analysis from a study that seems to address a similar style question as theirs, and to do so without critically considering whether this is really appropriate.

Reply
Trophy says:

October 23, 2013 at 4:48 am

Those were very good reads and I also want to add my voice of concern to those expressed in the article and the comments here. As a non-tenured, maybe it’s my own particular bias but to me the biggest issue seems to be very tough competition combined with imperfect evalution of potential competitors.

For example, consider someone like me who say has an article to write plus a few others to review. If I put more work on my article, maybe I can make it get accepted in a higher ranked journal which will directly benefit my own career. If I put more work on my reviewing tasks, I’ll be working for the common good but with very marginal and almost non-existent benefits for myself (the editors of the journal might notice that I did a good job which might tip the balance … actually scratch that; it’s science-fiction). Now, I really want to do a good job on my reviews but there is only so many useful hours of work I can get done in one day.

Now add to that the fact that today’s scientist studies and works much more than a few decades ago to get to the fruitful part of their career. You need many years to do a PhD and postdocs and temporary positions here and there while your classmates from high school are working to beef up their CVs in other fields. If after so many years, you do not make it, you are in a much worse shape than all those people who did not venture into academia. Of course, it’s not the end of line but it puts more pressure into not failing which in return puts more pressure on each day allocating more time for the selfish goals (publishing papers) than altruistic pursuits (e.g., reviewing papers).

Reply
1. Lou Jost says:
  
  October 25, 2013 at 12:44 pm
  
  Give thorough, constructive reviews, and then sign them. Then you’ll get some credit, and often later have interesting and productive interactions with the editors and researchers you helped. I now sign almost all my reviews.
  
  Reply
  1. cjwinstead says:
    
    October 25, 2013 at 2:41 pm
    
    Is this commonly done in your experience? I would definitely be open to signing my reviews, but I’ve never seen it done in my field. There are certainly big problems with anonymous reviews, but by “outing yourself” as the author of a critical review, you could create additional problems for yourself (along with additional work).
    
    Reply
    1. Lou Jost says:
      
      October 25, 2013 at 6:48 pm
      
      No, it is not very common even in my fields, ecology and population genetics. I don’t remember ever receiving a non-anonymous review.
      
      Nevertheless, I have had great post-review interactions with authors as a result of signing. This has been true even after quite critical (but constructive) reviews. However, I admit I have not signed a few of my most negative reviews, because I was afraid to make an enemy.
      
      Reply
Kevin says:

October 23, 2013 at 5:44 am

Those are great articles, thanks for posting Jerry. In this context, this quote is appropriate:

“An engineering firm that builds a faulty bridge based on an overfitted model will be sued or fined out of existence; to date, we know of no ecological theorist whose similarly overfitted model has evoked comparable penalties. Because society demands little from theoretical ecology, one can have a successful lifetime career in the field without any of one’s theories being put to the practical test of actual prediction.”

Ginzburg, L.R. and Jensen, C.X.J. (2004) Rules of thumb for judging ecological theories. Trends Ecol. Evol. 19, 121–126

Bridges, satellites, weapons, laptops, nearly all medical technologies…these are made with some of our most advanced understanding of nature. Science is not going to stop producing what we use because we are faced with problems publishing results.

On another note, I work at a place where many people have dozens of patents….virtually useless, but not abjectly wrong, just like most of the stuff that gets published.

The internet is good for weeding out what is important; like having a secondary peer review. Put it out there and let every wolf tear it apart. If it is superspecialized, like needing ultracold neutrons, it makes it harder to judge if the work is reliable, but if someone claims a superconductor works at room temperature you can bet an army of scientists will be at their respective labs will be up all night working it out.

Society has priorities and those priorities determine what scientific research is important, not the research itself.

Reply
Kevin says:

October 23, 2013 at 5:53 am

In physics, proving null hypothesis or results are the hardest to obtain: electron dipole moment (EDM), modifications to gravity, proton lifetime (possibly infinite), neutrino mass, etc. People in these fields publish sometimes once a decade to improve a the accuracy of a field by maybe only a factor of two.

Reply
Pingback: Statistics From The Economist October 19th-25th, 2013 | Consilient Interest
Jonathan Wallace says:

October 23, 2013 at 7:10 am

There is a separate but related problem relating to the reporting of science in the general media.
Studies that, say, identify an increase in some factor associated with cancer under certain treatments in lab conditions get reported in the newspapers as “‘x’ causes n-fold increase in cancer risk!” when this is not the conclusion the study’s authors have claimed or what the data can support.

Britain’s health service has a web-site ‘NHS Choices’ that has a ‘Behind the headlines’ section and associated RSS news feed that does an excellent job at dissecting health stories in the press, identifying what the source studies were and what they actually showed/said.

Reply
Michael says:

October 23, 2013 at 7:59 am

To solve the confirmation/repetition issue,
it would be nice to have some sort of “secondary publications” following primary publications.
I.e. one could think of a peer-reviewed “secondary” paper, which confirms (or not) experiments of a “primary” paper. This secondary paper could be directly linked to the primary paper at issue (in the journal and pubmed).
A certain percentage of public grant money (10% ?) could be associated with such repetitions. This would certainly increase the pressure against cheating and other misconduct and increase the confidence into science.

Reply
Eupraxsophy says:

October 23, 2013 at 9:47 am

Reblogged this on Sarvodaya and commented:
It is interesting — and encouraging — to note how the scientific method was itself utilized to uncover these widespread issues.

Reply
cjwinstead says:

October 24, 2013 at 6:20 am

I think Trophy identified a key contributor: unmanageable competition. Each year I am handed an increasing number of junk articles to review, and they really are junk. I usually try to say something nice or encouraging (along with my strong rejection), since these articles often seem to be written by under-trained students who are pressured to “publish or parish,” yet have evidently received no mentorship, training or oversight in the process. I’m also encountering blatant plagiarism on a regular basis — entire articles are copied from an archive and re-submitted with the authors’ names changed. Occasionally these duplicate articles pass peer review; it’s hard to say how many undetected copycats are still out there. Here is an amazing example from my own field:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5164175

This article is actually on the topic of improving the quality and reproducibility of research. But the article itself is a FRAUD! It was exactly duplicated from this earlier article:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4218335

If people are getting away with this, it isn’t hard to imagine that they can get away with more subtle infractions. I’m sure the motives are diverse, but I’m inclined to think that aggressive academic competition pays a significant role. It is wrong to say that researchers pursue over-hyped results to “advance their careers.” Researchers today face extraordinary pressure just to get a job and keep it. Forget about advancing it. In my limited experience, the “bar” at my institution was set by a small number of researchers who published upwards of 20 peer-reviewed articles per year. If you’re publishing at that rate, I can’t imagine that you’re being fully careful and rigorous with the content of each article (but maybe I’m slow and stupid). Administrators were happy to terminate untenured faculty who couldn’t measure up, and are continually threatening more senior faculty with post-tenure actions that may lead to dismissal. When I squeaked through our tenure process, our recent success rate (most recent five years) was 36% to 42% (my institution is evidently a basket case; most departments are more generous than that with success rates around 60-70%). There has never been a shortage of qualified young applicants ready to replace our supposedly under-performing faculty. The competition drives people arguably insane.

My own field is very close to machine learning; a key issue is that we disclose partial descriptions of complex systems. There is rarely enough information in an article to permit true replication. A bad article can sit on the shelf forever and never really be refuted, because any irreproducible results are easily attributable to the “secret sauce” that isn’t expressly stated in the article. It isn’t hard to imagine building a stellar career based on a prolific output of that secret sauce.

Reply
Pingback: Why Most Published Research is False | Anacephalaeosis
Matt Cavanaugh says:

October 25, 2013 at 12:50 pm

“Only 2% of respondents admitted falsifying or fabricating data, but 28% of respondents claimed to know of colleagues who engaged in questionable research practices.”

You were saying something about flawed statistical interpretations?

That data could be interpreted as: of 50 respondents, only Dr. Schummel admitted to fudging data, and 14 others said, ‘yeah, I know somebody — Schummel does its all the time.’

Of course, it’s not that simple, but nor should we automatically assume that another whole 26% of the respondents fudge but won’t admit it.

Reply
1. cjwinstead says:
  
  October 25, 2013 at 2:34 pm
  
  I don’t see any reason to believe that scientists are all ethics superstars. 2% seems like a low number to me. I haven’t studied the details of those surveys, but I wonder if they capture questionable practices that might originate from students, technicians or other participants — practices that can easily go unnoticed. As a PI, there may be a temptation to avoid digging too deep into a student’s work. I know the pain of finding one single sad mistake that knocks out months of lab effort, maybe delays students’ graduation, or impedes the progress of a post-doc’s career… Wouldn’t it be easier to just avoid finding that mistake? Then success happens, and you don’t have to feel like you’ve done anything bad.
  
  Reply
Pingback: Science: DIY, peer review and problems | Evidence & Reason
Pingback: Links 10/30/13 | Mike the Mad Biologist
Sergio Graziosi says:

November 2, 2013 at 7:16 am

Jerry and all: there is quite a lot of activity that touches the concerns addressed here. It’s mostly animated by young researchers full of energy and good will, I think they need all the support they may find (hence this).
Main article is here:
http://neuroconscience.com/2013/10/31/birth-of-a-new-school-how-self-publication-can-improve-research
my own contribution here:
http://sergiograziosi.wordpress.com/2013/11/02/birth-of-a-new-school-a-reply-on-self-publishing/

If you can, please chip-in and have your say.

Reply

	Leslie MacMillan on All hell breaks loose at Colum…
	Mark Sturtevant on Guest post: The new Cass Revie…
	Mark Sturtevant on Guest post: The new Cass Revie…
	Doug on Guest post: The new Cass Revie…
	JezGrove on Guest post: The new Cass Revie…

Share this:

105 thoughts on “Science is in bad shape”

Leave a Reply to couchloc Cancel reply