Dutch psychologist admits research fraud—and the lessons

April 27, 2013 • 12:27 pm

I hadn’t known about this case, reported in today’s New York Times, but perhaps some of you had. It’s a fascinating tale about the Dutch psychologist Diederik Stapel, who fudged data for dozens of papers—data comporting with people’s intuitive ideas about human nature—and became famous along the way.  He eventually got caught and fired.  He seems to mistake explanation for apology, and I think his only regret is that he got caught.

Stapel did not deny that his deceit was driven by ambition. But it was more complicated than that, he told me. He insisted that he loved social psychology but had been frustrated by the messiness of experimental data, which rarely led to clear conclusions. His lifelong obsession with elegance and order, he said, led him to concoct sexy results that journals found attractive. “It was a quest for aesthetics, for beauty — instead of the truth,” he said. He described his behavior as an addiction that drove him to carry out acts of increasingly daring fraud, like a junkie seeking a bigger and better high.

Stapel gives a lot of excuses but his apologies sound lame.  And while waiting for the investigations to end—get this—he published a book called Derailed, designed to get money by detailing his perfidy. Here’s an explanation but not an apology:

What the public didn’t realize, he said, was that academic science, too, was becoming a business. “There are scarce resources, you need grants, you need money, there is competition,” he said. “Normal people go to the edge to get that money. Science is of course about discovery, about digging to discover the truth. But it is also communication, persuasion, marketing. I am a salesman. I am on the road. People are on the road with their talk. With the same talk. It’s like a circus.” He named two psychologists he admired — John Cacioppo and Daniel Gilbert — neither of whom has been accused of fraud. “They give a talk in Berlin, two days later they give the same talk in Amsterdam, then they go to London. They are traveling salesmen selling their story.”

The duplicity started when he did a “priming” experiment as a young professor, showing subjects images of an attractive or less attractive female and asking them to rate their own attractiveness. He assumed that the prettier image would make the students, by comparison, rate themselves less attractive, but it didn’t work. He therefore decided to fudge the data to get the desired result. The laborious fudging—his results had to be significant, but not too big, lest they be suspicious—probably took longer than the experiment itself! Nevertheless, the new outcome jibed with what people intuited was true, and he became famous.

Other experiments followed, all faked, in which, for instance, he showed (i.e., fudged) data that white people waiting on a train platform would become more racist if they were surrounded by garbage. (They’d sit farther from a black person in a row of seats.) That was published in Science. Another fraudulent study purported to show that kids who colored a cartoon became more likely to share their candy if the cartoon character was depicted shedding a tear.

Stapel became famous because he got results that jibed with what people “wanted.” And the journal editors and reviewers liked them too.  The Times lays some blame at the door of those reviewers:

At the end of November, the universities unveiled their final report at a joint news conference: Stapel had committed fraud in at least 55 of his papers, as well as in 10 Ph.D. dissertations written by his students. The students were not culpable, even though their work was now tarnished. The field of psychology was indicted, too, with a finding that Stapel’s fraud went undetected for so long because of “a general culture of careless, selective and uncritical handling of research and data.” If Stapel was solely to blame for making stuff up, the report stated, his peers, journal editors and reviewers of the field’s top journals were to blame for letting him get away with it. The committees identified several practices as “sloppy science” — misuse of statistics, ignoring of data that do not conform to a desired hypothesis and the pursuit of a compelling story no matter how scientifically unsupported it may be.

Well, according to the article Stapel went to great lengths to make his data seem credible, and the writer of the piece, Yudhijit Bhattacharjee, doesn’t seem to have looked at the manuscripts to see if sloppy practices were pervasive (I’d like to know what they were), but reviewers often don’t pore over manuscripts nearly as carefully as committees designed to detect fraud. I don’t blame the system nearly as much as I do Stapel here. I think his students are also at fault: how can you put your name on a Ph.D. dissertation if you didn’t collect the data yourself?

And if you read the piece, Stapel seems curiously unapologetic, like Jonah Lehrer when he got caught making up quotes. Yes, Stapel became depressed, but it seems more because he was found out, not because he committed fraud and ruined the careers of many of his students.

Fortunately, many psychology studies get repeated.  Daryl Bem’s study on precognition, which showed that subjects’ knowledge seemingly affected their behavior in the past (before they had that knowledge), ultimately proved unrepeatable.

I’ve always thought that there should be Institutes for Repeatability, where studies that yielded flashy or novel results should be repeated by independent investigators. Not only in psychology, of course, but also in areas where experiments usually aren’t repeated. Those include ecology and evolutionary biology which, unlike molecular biology, are fields in which successive studies don’t always build on earlier ones (that “building” often means repeating earlier work).  I’m not accusing my colleagues of widespread fraud, of course, but there’s a tendency to publish only positive results (ensuring that 5% of them are wrong), and the messiness of organismal biology means that whole-animal studies may be influenced by the vagaries of weather, location, the population chosen, or other factors that make the results hard to generalize.

My own guess—and this is pure speculation—is that about 30-40% of whol- organismal biology studies in ecology and evolution would not give the same results if repeated. Two classic studies of this type were Thoday and Gibson’s early work showing that a population of flies could split into two reproductively isolated units (in effect, species) when selected for divergent bristle numbers but still allowed to mate with each other. That paper was published in Nature, and yet 19 attempts to repeat it failed. Likewise, early studies showing that putting populations of fruit flies through “bottlenecks” (very small population sizes) could, after a number of generations, make those populations reproductively isolated from each other, suggesting that random genetic drift itself could contribute to speciation.  Repeats of those studies didn’t give the flashy results.

The lesson: if a study seems too good to be true, let someone else repeat it.  And give them funding to do so—something that no funding agency wants to do.

59 thoughts on “Dutch psychologist admits research fraud—and the lessons

  1. In a related matter, I have often thought that a journal specifically devoted to publishing methodologically sound studies having no significant outcomes would make a valid contribution to any scientific field. It would certainly highlight avenues of research that might not be fruitful.

    1. Lots of journals do publish negative results, if they’re deemed interesting.

      Better than that, many online only journals publish negative results regardless of interest, e.g. some BMC journals, PLoS ONE, and F1000 research require only that the science is valid, not that the result is interesting.

      F1000 research (http://f1000research.com/) makes a point of saying that they are keen to publish negative results.

    2. While it is a good idea, one should remember that a non-significant result doesn’t mean that the null hypothesis is true. Perhaps if there were more subjects or trials, the result would be significant. Looking at confidence intervals and effect sizes are even more important for non-significant results.

  2. Additionally, journals need to be willing both to publish results which aren’t statistically significant and also to publish papers which are straight replicates of earlier papers, without novelty. I understand there has been difficulty in publishing a straight replication contradicting a seminal paper in psychology which had nonsignificant results due to a lack of novelty in experimental design.

      1. Perhaps. But it would be nice to have replications, especially contradictory replications, reach the same audience as the initial study being replicated.

  3. “…if a study seems to good to be true…”

    should be “too good…”

    signed, Grammar Police.

  4. Mr Stapel is in the Netherlands some kind of cult villein. The department of justice is currently investigating whether they can put criminal charges against him (for commiting fraud with public funds). Here in the Netherlands almost everyone knows about Mr Stapel, and in the aftermath more similar cases has been discovered by Dutch authorities.

  5. The clinical medical literature has been awash in this problem for a long time, even infecting the high-octane ones such as Lancet, NEJM, JAMA. It is difficult to feel certain these days what is reliable and what is not. The editors have been less than rigorous all too often in these areas, and clinicians wonder why hard scientists are dubious of their ‘science’.

  6. I remember that Drosophila study on bristle numbers and speciation. Given the number of times it cannot be repeated gives me the creepy feeling it was faked.

    1. If they used the standard significance measure of 0.05 (i.e. 5% possibility that the positive result is due to chance), then 19 negative repeat attempts is precisely what you’d expect. It doesn’t require fakery, just bad standards for probability analysis.

  7. I wonder if you’d further explain what you are trying to say with:

    “I’m not accusing my colleagues of widespread fraud, of course, but there’s a tendency to publish only positive results (ensuring that 5% of them are wrong)”

    I don’t want to argue about what I think is a mistaken understanding unless I am sure that I understand what you are saying first.

    I’m worried about the 5% business, not the fraud part. Peer review isn’t particularly good at spotting fraud…but isn’t really meant to be, is it?

    1. That 5% thing caught my eye too. I think the statement would only be true if all published statistical results were “siginificant” at P = exactly 0.05.

  8. What’s interesting to the lay-person (like me) is the fact that many scientific experiments are designed and carried out with the hope of getting a particular result. Although failure to get that result might, logically, be just as valuable scientifically, it’s bound to be an emotional let-down. I wonder if the kind of fraud described in Jerry’s piece is simply an extreme example of a fairly common practice. It’s hard to exclude human nature entirely from the scientific method. Studies which turn out to be unrepeatable (like the precognition study) must surely have been either faked or badly designed, there is no third alternative.

    1. This guy seemed to have a narcissistic need to manipulate studies to get his own results though – he seems like the George Costanza of psychology where he spent more time lying and manipulating his results than actually doing real work.

      1. Yes, I agree, but wonder how widespread lesser forms of manipulation are.

        Years ago, I helped to compile a dictionary which the publisher claimed took all its examples from a huge corpus of real language. We did indeed consult a huge corpus, but in fact the real language never quite exemplified the target word quite as well as we’d hoped. So we often rewrote the real language to bring it more closely in line with what we already KNEW the word meant. I’m sure there are many corollaries of this in the scientific world. Perpetrators will think of it as tidying up the data rather than manipulating it, just as we thought of it as ‘editing’ the real language.

        1. The recent Reinhart and Rogoff fiasco is a good example of a lesser form of manipulation.

    2. Well, there is human error, faulty equipment, misunderstood novel equipment, misunderstood novel interactions. Lots of other things. I wouldn’t necessarily call a study with issues like those poorly designed, though they certainly could be. Looking back at them from the future they could be said to be poorly designed with respect to what is known as of that future time, but that is sort of the whole idea.

    3. There are statistical flukes as well, in which well-designed experiments carried out in good faith yield “significant” results by pure chance. As Jerry indicated, if “significant” is defined in terms of a 95% confidence level, then 5% of all “significant” results will be flukes.

      1. I suppose I was assuming that a well-designed experiments would, by definition, exclude statistical flukes. This might be ignorance on my part, though, and I can appreciate (on reflection) that you can only reduce the chances of a statistical fluke and never eradicate them, however well-designed the experiment is.

      2. No, that is not correct. There is no way of computing what proportion of “all significant results” will be flukes. It could be anywhere from 0 to 100%. The 5% refers to what proportion of results will be declared significant *if the null hypothesis model is true*.

    1. Perhaps it was a test of priming: there was garbage in his office that primed him to lie.

  9. Re: repeatability – my colleague and I have just had a paper rejected (but with enough useful comments that I think we can legitimately take on those comments and submit elsewhere), and both reviewers suggested we drop a section of the data as it only ‘confirms what has been shown in a few other recent papers’. My feeling is that the result should be published for precisely that reason. Repeatability of data is seen as important but not sexy enough for inclusion in a manuscript. I work on behavioural ecology.
    As for making up data – as I discovered when I wanted to create a data set for students to analyse as part of a stats. course it can be bloody difficult to fudge a large data set to say just what you want it to say!

  10. The 5% is a result of the statistics people use to decide if the data show a result or not – a feature rather than a bug.

    Suppose, for example, you want to see if giving sugar to baby rats increases their eventual tail length when fully grown. It would of course be nice and easy if you gave the rats the sugar, and every single rat in the no-sugar group had a 3 inch tail and every single rat in the sugar group had a 4 inch tail.

    In reality, however, rats are not all the same. Even if it turns out that sugar does increase tail length, some of the rats that you gave sugar will have short tails and some of the rats in the other group will have long tails.

    There is no easy way to deal with this. What people usually do is something like this:

    You imagine what the data might look like if there was no difference between the two groups – that is, if both the sugar rats and the no-sugar rats were drawn from the same distribution of tail lengths. If this “null hypothesis” would have a less than 5% chance of producing as extreme a difference as the one you saw, you accept the result as “signficant”, i.e. real.

    This 5% is totally arbitrary. If you wanted, you could be more stringent and choose something like 1% – but if you did, this would also increase the chances of incorrectly deciding that there was no effect when there in fact was one. So people have settled on 5%.

    Technically, this doesn’t quite mean that 5% of published results are wrong (I think). A more precise way of saying it is that in 5% of experiments in which there “should be” no difference between the two groups being studied, a difference will be falsely reported, just due to chance. To actually know what fraction of published (presumably positive) results are wrong, you’d need to know what fraction of experiments conducted “should” produce a negative result. (Again, I think, I’m not too good with statistics…)

  11. Even if his studies weren’t faked (tho’ it’s clear they were), I don’t understand why they were thought to be worth publishing. Some Dutch university students’ attitudes about beauty, or reactions of a few white folks in the Netherlands on a train platform, just seem to be scientifically uninteresting. Of what significant larger populations are such subjects representative? None that I can think of!

    1. Your bar is way too high for mere mortal scientists, who can only study what it is logistically possible to study. The applicability of any study’s results to other populations is a legitimate question for further resarch.

      1. Psychology studies are too full of college students! I want to see Dan Ariely’s studies of dishonesty carried out in a prison. (but I admire what he’s done otherwise)

      2. I admit my bar may be too high for students working on PhDs – but not for publications in reputable journals! See John Ioannidis, whose critiques of medical articles should apply doubly to social psychology.

  12. The committees identified several practices as “sloppy science” — misuse of statistics, ignoring of data that do not conform to a desired hypothesis and the pursuit of a compelling story no matter how scientifically unsupported it may be.

    With this set of skills Stapel must be kicking himself for ignoring the potential and safe harbor of doing research for 1.) the Templeton Foundation; 2.) the Discovery Institute; or 3.) NCCAM or some other ‘alternative medicine’ institution. Another good option would have been to just say ‘screw it’ and go full New Age.

  13. I’m not sure how meaningful a public apology is. Does it ever mean anything other than “I’m sorry I got caught?”

    I also wonder if it’s ever possible to put the genie back into the bottle; once you decide that you don’t need to follow the rules, isn’t that likely a permanent change in perception? I certainly feel that way about religious rules.

    1. Yes, it is possible to put the genie back in the bottle. But it is difficult so the success rate is low. You have to change behavioral patterns for ever.

  14. I have to agree with you on the students, Jerry, but maybe go a little farther. These studies were not that complicated and additional follow-ups could have been readily spun off, so I see no reason why the students should be so far from their data. Why didn’t it ever strike one of them as strange? There are more than a few whistle-blower cases against PIs. Either a student was complicit or he selected one he could fool easily, and in either case I think the degree wasn’t earned.

    1. Agreed. If you do not collect and analyze your own data, you do not deserve a PhD.

  15. “- the Dutch psychologist Diederik Stapel”
    Professor Diederik Stapel is not a psychologist but a ‘social psychologist’ which is (more or less) he same as a sociologist. Sociologists work mainly with with surveys and not with experiments. Students are sent out with questionnaires to interview people. In my view this is not a very reliable method of research.

    1. what?
      No, social psychology is not the same thing as sociology. No, the studies in question here did not rely on questionaires.

  16. how can you put your name on a Ph.D. dissertation if you didn’t collect the data yourself?

    It’s been explained to me that, in the social sciences, such things require surveys under a representative sample of the (desired) population, which is such specialistic and time-intensive work that it cannot be reasonably expected that PhD students perform it themselves and finish their research in the allotted 4 years. It is therefore normally outsourced to specialised survey agencies.

    So what does a PhD student who needs a survey do? S/he composes a list of questions and hands it over to the financial authority aka research head who is then supposed to hand it over to a competent third party and pay their bill. Which in our case was Diederik Stapel, who was both his own surveyer and surveyed in one.

    Tilburg is my native city (Netherlands’ 6th in size) and Tilburg University was my employer for 9 years. It’s sad to see that, the one time they reach the international news, it has to be with a fraud.

    1. I can understand that it may not be feasible for a PhD student to do all of the data collection. But apparently in these cases they didn’t do any of it, nor did they look critically at the data delivered, or even have any discussion of the data with the collectors at all. And that seems to me indefensible.

    1. Sort of. Their “policy” has got stiffer, but as I point out in my post today, their practice leaves teh door wide open for future Stapels and Schoens.

  17. Well, the takeaway from this affair certainly jibes with what we always intuited about what the soft sciences get away with. Can we imagine, say, cold fusion having remained unchallenged for almost a decade?

    1. To be fair, there was far more at stake in the case of cold fusion: Nobel prizes for Pons and Fleischmann if they’d been vindicated, along with potentially huge economic implications for the world energy supply. Even in the hard sciences, few results get that level of scrutiny.

      1. Indeed. And the Piltdown Man hoax stood for about 40 years. Though Waterston within a year of “discovery” correctly identified it and suspicion was rife from the outset. Could they have analysed DNA (a hard science implement, thank you very much) back then, he would have been vindicated instantly.

        However, in the natural sciences, you won’t find anything close to entire bullshit based academic disciplines such as theology or post-modernism. The BS overhead in psychology, though less than sociology, is still considerable. We remember Freud.

  18. Deliberately skewed research findings are not as uncommon as folks like to think. Think say of all the studies funded by the tobacco industry as an example.

    False findings happens in fact quite often where someone has something to gain and thus a motive to stretch the truth.

    Does anybody really think a pharmaceutical company is going to research its products in a way that compromises their bottom line?

    And doesn’t it follow that many organizations politically have an interest in suppressing knowledge about negative aspects of what they do?

    So yeah lets face it: There indeed are many forces lobbying against the government (or other entities) spending money to reveal the truth.

    Not that everyone is so blatant as to falsify their data, but indeed many topics are likely to get researched in a selective manner such that the full story doesn’t get told.

    For instance a power company under some permit renewal requirement for a dam will fund certain aquatic studies which put their environmental impacts in a positive light and avoid lines of study which do the opposite.

    Same thing happens with government agencies making land use decisions which are politically/monetarily motivated.

    Local governments often have a true bias toward growth and development (as this obviously raises tax revenue but more importantly feathers the nest of the movers and shakers who financially control local politics) and so these tend to fund research minimizing the impacts of development run amok.

    And likewise isn’t it pretty clear that universities which are the seat of much of the research being conducted these days are controlled to a major extent by those who fund what they can or cannot do?

    The truth is much of the research we see is guided to some extent by an aim well outside the abstract search for the truth. Anyone who believes otherwise is just kidding themselves.

    So just as much as folks should be outraged about some researcher altering his data and findings for his own aims, they should be also concerned about findings with other aims which have false intent when it comes to promoting GOOD science.

    Point being that not all criminals are low life street thugs, as in fact some of the worst are found on Wall Street living ever so high and mighty.

    Nuff Said!

  19. …Wow. Did it really never occur to him that he was completely in the wrong field? Did he have such job security that he thought trying to stay in science was the only option?

    Concerning peer review, let me just confirm that reviewers simply do not expect fraud. We don’t have the time or (usually) money to repeat all the observations, so we assume the raw data are all correct (unless maybe if they look very strange) and instead look for innocent mistakes elsewhere. We expect everything up to outright incompetence in interpreting data, but we don’t expect fraud.

    That’s also part of why scientists get so outraged over fraud or plagiarism. Large parts of the public don’t quite get why two German ministers had to resign for having plagiarized large parts of their doctoral theses.

    Taking the data for granted is also why the following was possible: Once I submitted a manuscript with a table of data that had suffered from being sent back and forth between different versions of MS Word. After it was rejected for not being interesting enough, I noticed that the entire first row of the table had been deleted by this version incompatibility stuff – the table began with character 16 instead of character 1. Neither of the two reviewers had noticed, and the editor hadn’t either. That’s particularly cringeworthy because the manuscript was about not taking phylogenetic data matrices for granted because they often contain typos and other innocent mistakes that can scramble the results. But, hey, nobody is paid to pore over the appendices.

  20. I noticed that the entire first row of the table had been deleted by this version incompatibility stuff –

    In the version I had submitted, mind you. The table was incomplete when it was sent to the journal, so the reviewers and the editor cannot have seen the first row.

  21. Diederik Stapel?
    You guys haven’t heard about that director(of university research institute Caphri)/doctor Onno van Schayck who supposedly saw a leg grow back with 2 centimeters after prayer? Allegedly this happened 25 years ago and xrays confirmed it. Only, the xrays got lost. Supposedly he suggested that this miracle had been proven scientifically.
    For those who read Dutch(or want to use a translator):
    http://www.trouw.nl/tr/nl/5091/Religie/article/detail/3405616/2013/03/08/Directeur-onderzoeksinstituut-stapt-op-na-genezingswonder.dhtml

  22. Outright fraud is still a rare occurence in science (I hope). If you are an optimist about human nature, you can explain this by the fact that people are, in general, honest. If you are a pessimist, you may point to the grave consequences that fraud, once discovered, has for the perpetrator: it’s simply not worth risking one’s career.

    What I find much more widespread (at least in biological sciences) and in a way more worrisome, is a tendency to pick and choose which data are included in manuscripts for publication. Often only those data sets that support the paper’s conclusions will be included, and others are rejected for reasons that sound dangerously close to wishful thinking (or worse). A researcher desperate for more publications will apply all kinds of exuses and rationalizations. Repeating an experiment will be too costly; it’s OK to ignore a particular experiment that “didn’t work” (i.e. no desired effect was observed); it is OK to eliminate outliers, as they are likely the result of an error unrelated to the scientific question at hand (an easy way to make a data set seem statistically significant); in a figure demonstrating raw data it is OK to present carefully selected results (which best support the author’s conclusions) as “typical”, etc., etc.

    Given the reality of shrinking scientific funding, pressing grant proposal deadlines and “publication inflation”, the communal acceptance of such sloppy practices is very dangerous. It leads to the production of an ever growing body of practically worthless data, not to mention the demoralizing effect on young researchers. Of course, science is self-correcting, and truly valuable results will eventually be noticed and reproduced. But this way of doing science comes at an enormous cost to the scientific community and to society as a whole.

Comments are closed.