On the poor reproducibility of psychology studies

September 3, 2015 • 10:45 am

I wrote a short post yesterday about a huge attempt to answer the question, “What proportion of results reported in psychology journals can be repeated?” This was a massive study in which dozens of psychology researchers simply went and repeated 100 studies published in three respectable experimental psychology journals: Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology: Learning, Memory, and Cognition.  The full paper, along with a one page summary, is published in Science (see reference and free download below); the authors call themselves the “Open Science Collaboration” (OSC). There’s also a summary piece in the New York Times, a sub-article highlighting three famous studies that couldn’t be repeated by OSC (including one on free will, which I wrote about yesterday), and a newer op-ed in the Times arguing that this failure to replicate doesn’t constitute a scientific crisis, but simply shows science behaving as it should: always scrutinizing whether published results are reliable.

Even before this paper was published, I argued that people should do in biology what these folks did in psychology: test experimental results that are impressive but rarely repeated. In psychology, as in evolutionary biology and ecology, significant findings aren’t often repeated, for doing so takes hard-to-come-by money and a concerted effort— an effort that isn’t rewarded. (You don’t get much naches or professional advancement by simply repeating someone else’s work.) Further, in biology (and presumably in experimental psychology), work isn’t often repeated as the normal by-product of building on previous results. For example, if you want to use new gene-replacement methods, you are obliged to indirectly replicate other people’s protocols before you can begin to insert your own favorite gene.

It’s thus been my contention that about half of published studies in my own field (I include ecology along with evolution) would probably not yield the same results if they were replicated. I’m excluding those studies that use genetics, as genetic work is easily repeated, particularly if it involves sequencing DNA.

Failures to repeat a published result don’t mean that the experimenters cheated, or even that the work was faulty. They could mean, for instance, that the results are peculiar to a particular location, time, or experimental setup, or that there’s a publication bias towards impressive results, so only the ones whose results are highly statistically significant get published. Finally, given the conventional probability ceiling of 0.05, 5% of all experiments will yield a significant deviation from chance (thus falsifying the null hypothesis), even when that null hypothesis is true.

On to the experiment. The OSC decided to finally test reproducibility in a quite rigorous way. A whole group of people agreed to test a passel of papers taken from three journals, winding up with 100 replicated experiments. To enforce rigor, they chose papers from only prominent journals (they wound up with exactly 100 replicates), replicated only the last study in each paper (so that they weren’t just replicating preliminary results, which are often reported first), and then did each replication, as far as they could, in an identical way as the initial study—with the exception that sometimes they had higher sample sizes, giving them even greater power to detect effects.

To the credit of the original authors, they provided the OSC team with complete data and details of their experiments, ensuring that the replications were as close as possible in design to the original results. There were many other controls as well, including the use of statisticians to independently replicate the probability values for the replication experiments.

All the original studies had results that were statistically significant, with p values (i.e., the chance of getting the observed effect as a mere statistical outlier when there was no real effect) below 5% (a few were just a tad higher). When the chance of getting a false positive is 0.05 or less, researchers generally consider the result “statistically significant,” which is a key to getting your paper published. That cutoff, of course, is arbitrary, and is lower in areas like physics, which, for experiments like detecting the Higgs boson, drops to 0.00001.

So what happened when those 100 psychology studies were replicated? The upshot was that most of the significant results became nonsignificant, and the effects that were found, even if nonsignificant, dropped to about half the size of effects reported in the original papers. Here are the salient results:

  • Only 35 of the original 100 experiments produced statistically significant results upon replication (62 did not, and three were excluded). In other words, under replication with near-identical conditions and often larger samples, only 35% of the original findings were judged significant.
  • That said, many (but not nearly all) of the results were in the same direction as those seen in the original studies, but weren’t large enough to achieve statistical significance. If the replications had been the original papers, most of them probably wouldn’t have been published.

Here’s a chart showing the correlation between the p values for the original papers and those for the replicates. Each dot plots the size of the effect seen in the replicate (Y axis) against the effect size for the same study in the original paper (X axis).  If a dot is green, the replicate was also statistically significant (as were all effects in the original study). Pink dots mean that the replicate study did not yield statistically significant results. This shows that effect sizes were generally lower than those of the original studies (most points fall below the diagonal line), and most of the replicates (62%, to be precise) did not show significant effects.

(From the paper): Original study effect size versus replication effect size (correlation coefficients). Diagonal line represents replication effect size equal to original effect size. Dotted line represents replication effect size of 0. Points below the dotted line were effects in the opposite direction of the original. Density plots are separated by significant (blue) and nonsignificant (red) effects.

The chart also shows that the larger the effects observed in the original study, the more likely they were to replicate, for the pink dots are clustered on the left side of the graph, where the original effect sizes (normalized) are small. This goes along with the investigators’ findings that the lower the p value seen in the original experiment, and thus the more significant the result, the more likely it was to also be significant in the replicate.

  • While most of the results of replications were in the same direction as the original study, there were an appreciable number (I count about 20%) that were close to showing either the opposite direction or no effect at all. And remember, even if there is no real biological effect in the original study, half of the replications will, by chance alone, be in the same direction as in the original study.
  • The OSC team also asked each team doing a replication whether they considered that their results actually replicated that of the orignal paper. This assessment was subjective, but mirrored the results based on p-value significance: only 39% of investigators concluded that their results replicated those of the original study.
  • Finally, it’s possible that many of the p values in replications came close to the magic p = 0.05 cutoff point, which of course is more or less an arbitrary threshold for significance. To see if that was the case, the authors did a density plot of p values in the original paper versus those found in the repicates. Here are the results, with p values from original studies on the left and from the replicates on the right.
Screen Shot 2015-09-02 at 12.18.58 PM
Density plots of original and replication P values and effect sizes. P values.

As you can see, the p values for replications were distributed widely, and so were not hovering somewhere near the magic cutoff value for significance (0.05). Of course, all the p values in the original studies (left) were at or below that level of significance, or they wouldn’t have been published.

What does it all mean?

There are two diametric views about how to take this general failure to replicate. The first is to celebrate this as a victory for science. After all, science is about continually testing its own conclusions, and you can only do that by trying to see if what other people found out is really right. This, in fact, is the conclusion the authors come to. I quote from their paper:

Scientific progress is a cumulative process of uncertainty reduction that can only succeed if science itself remains the greatest skeptic of its explanatory claims.

The present results suggest that there is room to improve reproducibility in psychology. Any temptation to interpret these results as a defeat for psychology, or science more generally, must contend with the fact that this project demonstrates science behaving as it should. Hypotheses abound that the present culture in science may be negatively affecting the reproducibility of findings. An ideological response would discount the arguments, discredit the sources, and proceed merrily along. The scientific process is not ideological. Science does not always provide comfort for what we wish to be; it confronts us with what is. Moreover, as illustrated by the Transparency and Openness Promotion (TOP) Guidelines, the research community is taking action already to improve the quality and credibility of the scientific literature.

We conducted this project because we care deeply about the health of our discipline and believe in its promise for accumulating knowledge about human behavior that can advance the quality of the human condition. Reproducibility is central to that aim. Accumulating evidence is the scientific community’s method of self-correction and is the best available option for achieving that ultimate goal: truth.

There’s a lot of sense in this, of course. A result isn’t widely accepted (in most fields) unless it’s repeated or makes firm predictions that can be tested. Self-correction is a powerful too—one of the most important characteristics of science, and one that makes it different from, say, theology.

The “all is well in science” interpretation is also that pushed by Lisa Feldman Barrett in her new NYT op-ed about the study, “Psychology is not in crisis.” (Barrett is a professor of psychology at Northeastern University.) But her piece is a mess, comparing failure of psychology-study replication to changing the environment in which a gene is expressed. In some environments, she says, a gene producing curly wings makes the wings less curly, a common phenomenon that we geneticists call “variable expressivity”. And that’s indeed the case, but it doesn’t meant that the “Curly” mutation doesn’t cause the wings to become curled—something she implies. Variable expressivity is not a failure to replicate the finding that a particular genic lesion is responsible for curly wings.

Barrett also compares the OSC study’s failure to replicate to other studies in which failure to replicate depends on “context” (e.g., mice given shocks at when they hear a sound develop a Pavlovian response), so that one doesn’t see the same results under different conditions (mice won’t develop the Pavlovian response if they’re strapped down when shocked). But that, like the curly-wing result, is irrelevant to the OSC’s efforts, which tried ensure that the context and experimental conditions were as close as possible to those of the original studies.  In other words, the OSC tried to eliminate context-specific effects.  In Barrett’s eagerness to defend and exculpate her field, and affirm the strength of science, she makes arguments based on false analogies.

One thing that we can all agree on—the middle ground, so to speak—is that there’s a problem with the culture of science, which always favors big and impressive positive results over negative results, and favor publication of novel results while largely ignoring attempts to replicate. (Sometimes a failure to replicate isn’t even accepted by scientific journals!) That’s even more true of the popular press, which is quick to tout findings of stuff like a “gay gene,” but can’t be bothered to publish a caveat when that study—as it was—failed to replicate.  This problem, at least in the scientific culture, can be somewhat repaired. Most important, we need more studies like that of the OSC, but replications applied to other fields, especially biology.

And that brings me to my final point, which gives a less positive view of the results. As I said above, I think many studies in biology—particularly organismal biology—aren’t often replicated, especially if they involve field work. So such studies remain in the literature without ever having been checked, and often become iconic work that finds its way into textbooks.

In this way biology resembles psychology, although molecular and cell biology studies are often replicated as part of the continuing progress of the field.  I think, then, that it’s not as kosher to claim that ecology and evolution experience the same degree of self-checking as, say, physics and chemistry. Yes, all work should in principle be checked, but you find precious few dollars handed out by the National Institutes of Health or the National Science Foundation to replicate work in biology. (That’s because there isn’t that much money to hand out at all!) In my field of organismal biology, then, the self-correcting mechanism of science, while operative at some level, isn’t nearly as strong as it is in other fields like molecular and cell biology.

My main conclusion, then, is that we need an OSC for ecology and evolutionary biology. But it will be a cold day in July (in Arizona) when that happens!

Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science, 349 online, DOI: 10.1126/science.aac4716

130 thoughts on “On the poor reproducibility of psychology studies

      1. The current consensus is that the above-mentioned paradigm is correct. My colleagues and I have been begging to differ for over a decade, but have been marginalized for pointing out what seems to us to be obvious flaws in the consensus. (see below).

    1. I would put “wonderful” in scare quotes, but that’s my professional opinion (and what I think you may have meant by including that link in this thread, I hope).

      A bunch of pissed-off researchers have labeled us as HIV-deniers for daring to suggest that most of the African HIV study literature is garbage, because of failure to properly control for anal and blood exposures, the way it was done in the west. Apparently there are two standards for HIV studies. In any event, here’s a link to what we managed to squeak into the literature along the circumcision lines.

      And this list contains a sordid litany of what we consider to be flaws in the literature (from pub #90 on – synopses are provided, but you can also pester me for specific refs, if interested).

      1. Interesting. I went to quite a few talks about AIDS in the late 80s and early 90s and several speakers noted that one of the reasons why AIDS was so common in sub-saharan africa was that one of the most common methods of contraception used there was anal sex.

        Certainly back then it seemed the standard view that the fast spread was partly due to the relatively high rates of non-spousal and/or anal sex.

        1. It turns out this view is quite correct, in my opinion. It is also our opinion that much the same thing is happening in the US, esp. the SE US among African-Americans, esp. the rural poor. In some cultures in Africa, anal sex is not “sex” per se, as “sex” means what one does to make babies. Naturally, much was lost in translation when western researchers design their studies, so the relevant data are never captured.

        2. Way back when, we published a letter bringing into question some researchers’ conclusion that there was rampant “heterosexual” transmission among a cohort of poor Black women in Florida (1993). We questioned in the letter if anal exposure was collected. (“heterosexual” is stupid and uninformed — what researchers really should be saying is “penile-vaginal”, but this is yet another blind spot that has us bashing our faces against our desks all the time).

          The researchers replied that anal exposure was not collected… not because of reluctance on the part of the patients to reveal such information, but that the researchers themselves were TOO EMBARRASSED to ask. They did reply, in general, that anal exposure was prevalent among “high-risk” females, but that was it. Potterat JJ. HIV infection in rural Florida women (Letter). New England Journal of Medicine 1993; 328: 1351-1352.

          1. What is the current best-estimate for proportion of women who have penile-anal sex on a more-than-experemental basis? (I wouldn’t be surprised to find that the numbers differ for different populations.)
            From my days slumming it, I’d estimate for the UK, it’s 25-30% regularly have anal sex. Certainly large enough to be a significant issue.

            1. I don’t know offhand, but it will definitely differ by culture, age, and I would guess socioeconomic status (where it concerns so-called “survival sex” – either paid, or merely as cheap birth control). I do know it has been changing recently esp. among the young, perhaps in response to Internet porn.
              I also know that I have to stamp my little feet to get it included as a variable in network-type study designs, but rarely are my recommendations heeded (and I’m never in a position to force the issue).
              According to William Saletan who, despite writing for Slate, really does good research when he writes stuff, it’s somewhere around 1/3 of women who report it in the last month. (1-3% in the last encounter). There’s a link there to the science article (5 yrs old). There’s other stuff in the literature about a rise in rectal cancers being hypothesized to be caused by anal sex. (and that one is 10 years old). The trend goes way up in the <45 crowd.

              1. There’s very, very good reason to think that it’s actually very common amongst religious girls.

                Even a quarter century ago, it was common to hear about.


              2. Gotta keep that virginity. (I’d add it is also common for Muslim women, so that they can “save it for marriage”, as well) It explains the ridiculous incidence in SE African, predominantly Muslim, countries – where the punishments for sex outside of marriage are severe indeed.

              3. I can’t help but think of the blood on the hands of the Catholic Church (and, of course, Islam) with respect to the AIDS genocide. Their theories about human sexuality, after all, are the ultimate example of non-reproducible…and this insane notion that the non-corporeal perils of condoms outweigh the corpses produced by the policy…damn, but that’s some fucked-up evil shit.


              4. So … my back of the bedpost-notches estimate isn’t far from normality. At least, if I were in Seattle.
                What delusional idiot at Marie Claire thought that I’d be interested in beauty tips? Ohhh, it almost make me wish Google had sold them my internet history. But why on earth did they get an anal virgin to write about anal? Bizarre – almost as bizarre as having celibate priests massaging the choir into impassioned cries.

            2. A PuffHo piece on buttsecs has a link to a CDC report, which I’d guess is also an underestimate.

              When I was working at the local health department, the CDC forms for HIV testing did a massive shift around 1993-4, and stopped including even asking the question as a risk factor. This HAD to have been intentional, most likely as a response to political pressure in response to the (then new) assumption coming from Jon Mann at the WHO that the astonishing attack rates coming out of sub-Saharan African women HAD to be due to penile-vaginal sex. Fucking idiotic assumption (based merely on a 50:50 M:F ratio) – but Jon Mann was no idiot. I think he realized he wasn’t ever going to get any funding out of the Reagan administration unless he concocted that fiction. It was war, and gay men (especially) knew it. The disease had to be “liberalized” to the general population for anybody to give a shit. Then Mann was martyred in SwissAir 111 off Nova Scotia, and it has been said that millions died with him.

              1. What derail? You’re giving first-hand from-the-trenches accounts of the perils of poor reproducibility! And in human behavior, too, even if in a subfield not typically labeled as psychology.


              2. And there’s no way to run the clock back, and do the experiment again. There’s only one lab. One shot. We fought like Marines, published like crazy, presented… it didn’t matter. Ideology trumped science when it really mattered. We’re self-publishing a book, BTW, and this is the final chapter in it. I’ll let you know when we get it out. Should be a week or two, I hope.

              3. Stephen, the Potterat sudies seem to me to be very much focused on non-sex transmission in SS Africa (esp reuse of needles) and if I recall, attributed up to 60% of infections to non-sex transmission. What do you feel is the relative contribution of penile-anal transmission relative to non-sex transmission?

                p.s I too am interested in that book!

              4. Hi Colin! The most honest answer I should give to that question is really: “dipped if I know”. And that’s a travesty, considering we’re 30 years into the African epidemics. Unasked questions yield no answers. Considering how much money has been blown on SS African studies, and the lives at stake, it is worse than unconscionable. Lifting a relevant passage (I’m working on it now) from Potterat:

                Lastly, two of the most seasoned epidemiologists working in sub-Saharan Africa recently admitted: “We still do not fully understand why the spread of HIV has been (and still is) so different in sub-Saharan Africa compared to heterosexual populations in other parts of the world and why the incidence of HIV infection in young women in southern Africa is so high”.(Buve A, Laga M. Epidemiological research in the HIV field: towards understanding what we do not know. AIDS 2012; 26:1203-1204.) This stunning confession of ignorance, made in 2012 — three decades into the African HIV epidemics — indicates that the original question “Why Africa?” is still very much with us and, therefore, a crucial and urgent challenge. Any bets that comprehensive risk factor assessment, rather than the monochromatic focus on heterosexual sex, might help solve this puzzle?

                I’m not sure, but I think the “studies” are actually only estimates, probably the work of David Gisselquist and colleagues (incl. later collaborations with Potterat). I’d have to ask when I see him tomorrow or Saturday. I’ll let him know you’re interested, thanks!

              5. Hi Colin — trying to keep the vertical space down… Apparently, Gisselquist & John had been canvassing the African literature and had come up with a *rough* (generous) estimate of 25%-35% incidence due to PVI. Nowhere near 90%, which is STILL the @#$#@ consensus. The whole point of the paper was not to get any hard numbers (no way to do that, anyway), but to show that there is trouble in toyland, and to get other researchers to prove them wrong. Instead, the ad hominems flew, accusations of Duesbergian denialism, comparisons with Linus Pauling in his dotage, etc. Others who dared to so much as say they had a point, just raise the issue in print get pushed out of their positions or otherwise intimidated – instead of scientists doing their damned jobs. The paper in question is here. Twelve years ago now, and still no appreciable numbers of studies that properly ask about the biggest non-PVI routes (subcutaneous, anal). Maybe 10 studies do (out of ~950 by now). So that’s what that was about… just trying to get the ball rolling. And then you wouldn’t believe what happened next… (trying to sound like clickbait, but we are really disappointed and baffled by all the crap that shook out of this and other papers they did – for merely pointing to anomalies that should be addressed). Since then it’s been either deafening silence, appeals to ecologic data, or ad hominem / power plays / turf defense. And still no answers. nuff said.

              6. I almost hate to ask…but what’s the religious demographics of the field like? The official line sounds like what I’d expect the Catholic Church to dictate.


              7. I cannot think of a single person dominated by any religious ideology, whatsoever. The only thing everybody seems to worship is Jon Mann’s 90% figure, which was never published anywhere… has no basis in nothing. (we asked officials at the CDC, at the WHO in Geneva, at UNAIDS for a reference. People were like “sure, we’ll get back to you on this”. Then silence, then more back and forth re-requests. Eventually, people had to concede they didn’t know. It was literally pulled out of Mann’s ass in desperation in the 80s, from a table-napkin estimate he made from ASSUMING a M:F ratio of approx 50:50 in Uganda was due to “sex” and nothing else. (not dirty needles in pre-natal clinics, for example). Then he hopped a doomed plane, and now the behaviorists are holding onto this figure from Jesus Mann because it keeps their money flowing. You can dump all the condoms you want on Africa (it’s been done), and the only thing you are controlling are the STD rates (they have predictably plummeted, HIV has not). You’d think just this one piece of evidence in isolation would make our case, wouldn’t you?
                Seriously… the local street people here have more sense than the leaders in my field.

              8. Then…do you have any ideas, even wild-assed guesses, why they’re so staunchly dogmatic? Could help come up with ideas of how to get through to them….


              9. Wild-assed guesses, but evidence based: the weight of folks like Roy Anderson & Sevgi Aral-CDC (among others at the top) are behind it. They contend that, *even if John is correct*, such conclusions should be repressed… because if it got out, Africans would stop taking their anti-malarial injections, and the death toll would be worse. They invited John to Geneva to meet with a panel, including Roy & Sevgi, ostensibly to have a collegiate discussion about the issue of nosocomial risks driving the epidemics. When John showed up, they merely read John the riot act, telling him to shut up about it. It was not a collegiate discussion, but a one-way list of demands. There was regrettably no video or recording devices. Trust no one, is now my motto, when dealing with these scum.

              10. Would a single-use needle regimen really be that big a deal? Seems like a truly negligibly marginal fractional cost in Western medicine, which makes me wonder why it wouldn’t be the global standard. Or am I playing the “let them eat cake” song?

                And it’s not like AIDS is the only blood-borne disease. I seem to remember single-use needles becoming the norm in the West long before the AIDS epidemic, with AIDS simply being yet another reason why needle reuse is a very bad idea. And, certainly, it’s been common to sterilize surgical implements almost since the days of Louis Pasteur….


              11. We’ve argued for auto-disable syringes since we met Gisselquist & Brody in 2000 or so (and Gisselquist was spearheading the argument before then). But sucking from the behavioral/modeling/policy teat is just like alcoholism: the first step is to admit you actually have a problem. The other issue is that, even if this was the norm in formal settings, there are informal settings that are likely to be more prevalent. (backyard dentistry, family members, e.g.). I don’t think it was a coincidence that the HIV explosion in South Africa coincided with the fall of Aparteid. Back then we were all dancing in the street. The reality was that the ANC would become an entity that Desmond Tutu would eventually blast as being worse than Botha’s regime. The suppression of traditional healers was suddenly lifted, and unlicensed medicine proliferated. YAY!

              12. Oh, wow. I had no clue. The description of common medical practices in that link….

                It now occurs to me that it would do basically the entire world a world of good to make basically all syringes be auto-disable, with reusable syringes something practically impossible to get unless you’re somebody with a legitimate medical need. And tax the shit out of reusable syringes, as well, with the tax money directly subsidizing single-use syringes.

                Never mind AIDS; what it would do for all bloodborne illnesses would be of huge benefit for everybody.

                And I’m again flabbergasted with the AIDS link. It’s common wisdom that needle sharing amongst drug users is at least as big a risk factor as any sort of sexual practice — and, for decades, it’s been a pre-screening question for blood donors and and automatic disqualification. Along with “accidental needle sticks.” No matter how much transmission of AIDS is sexual, it would seem blindingly obvious that poor needle hygiene would simply have to be a major vector.

                I would have previously assumed that the fact that nobody talks about needle hygiene being a factor in the African AIDS crisis would be the obvious naïve conclusion: needle hygiene in Africa is roughly up to Western standards. But this short paragraph completely rewrote my understanding:

                As for “iatrogenic” (which means physician-caused), it’s probably most fair to think of non-sterile punctures as being mostly perpetrated by non-physicians. There are too few physicians in Africa to cause such a massive epidemic (or so it seems to me); they tend to practice at the district or academic institution levels. Rather I suspect that it is the “barefoot doctor” medical or dental provider, or the informal village injectionist, or family members who inject each other (a common practice in Uganda, for example) who have probably done the most damage. Thus prevention initiatives to discourage iatrogenic HIV transmission won’t be as easy as you may think: these folks may be hard to reach and teach. My guess (who knows?)

                That suggests to me that the common person in rural Africa is exposed to worse needle hygiene than the average American heroin addict — preposterous on the face of it, but such an obvious cause for suspicion in epidemiology!

                Damn. Scary stuff, indeed.


              13. Yep. Needle-sharing is second only to outright transfusion. Worse than anal. (which is really bad). And there’s totally unacceptable amounts of Heps B & C there. Roy Anderson’s protege, Geoff Garnett (who we once invited to Colo Spgs, early 90s, to discuss the network science we were doing) attacked us in the pages of Nature with an article that (along with a snarl of differential equations) “proved” we were wrong because the geographic distributions of Hep C compared to HIV were so dissimilar. It was the cover article, with a big glossy map of Africa’s hep C distribution on the cover. We carefully crafted a collegiate, well-supported (and excellently written) response, which was rejected by Nature, of course. (it’s the mouthpiece of people like Roy Anderson – plebes like us need not apply). We showed the argument was a red herring, as there were unacceptable amounts of bloodbornes across the board. The rebuttal had to be published in some minor journal, tucked away in a corner of the literature no one reads. I wound up begging off on that piece, as I didn’t agree with regressing those data… it was regressing a Rorschach blot of points, that had no linearity to them whatsoever. I merely took a ruler, eyeballed it, and concluded that simple proportions sufficed. (“if it ain’t in the percentages, it ain’t there”.) So much for HIV epidemiology as a science, eh? If most people knew this shit, they would scream bloody fucking murder. This makes Tuskeegee look like a picnic, but public health malpractice standards.

              14. “proved” we were wrong because the geographic distributions of Hep C compared to HIV were so dissimilar

                Wait — I’m confused.

                Is it expected that different diseases should have the same geographic distributions simply because they share some modes of transmission? Considering that different strains of influenza don’t have the same geographic distributions, I’m not sure that it’d even occur to me to consider that as a factor. And wouldn’t you have to establish that you’re not just looking at noise by comparing with rates of dissimilar, even non-infectious diseases? What do the African geographic distributions of gonorrhea, influenza, malaria, and lung cancer look like compared to HIV and hepatitis? For that matter, how do different strains of hepatitis map out?

                What am I missing?

                …and…again…if you’ve got a big needle contamination problem in Africa, wouldn’t you want that to be a major focus of health care reform, regardless of what role it plays in the spread of AIDS? I mean, if we found that health care workers in Africa weren’t washing their hands properly, we’d devote a lot of attention to that, no?


              15. And these arguments were promoted by Roy Anderson. He’s the guy who, along with Robert May, invented modern epi. R0=BcD… the reproductive rate of disease. A startlingly useful concept. Esp. in influenza. It’s boggling. All the onlookers saw this, look at the diff eq.’s, and go “yup”, their right, end of discussion. My personal feeling is that most pros in my field just don’t want to admit they don’t understand the math. No one wants to look like fools, so they fall for the argument from authority. (in this case, the most appropriate authority there is, but still).
                Outsiders see right through this shit in a second.

              16. Going to a demographic conference in Copenhagen – back around ’96 or so… too lazy to look it up… I was walking there with Geoff Garnett, later one of the authors of that article. I mentioned that I thought it was transmitted the same way there as here… essentially up the butt and through the skin. He told me that the demographic surveys ALL indicate that homosexuality is rare in Africa – virtually non-existent. Swallowing my incredulity for a moment, I politely asked him if he knew what the typical punishment for being a homosexual in Africa was. Answer: a gruesome death… sometimes followed by having your family members offed as well. He just looked at me sideways w/ a little half-smile, had no rejoinder to that, and we kept walking. Similar ignorance re: injections – their response: IV drug use is virtually unheard of there. Yes, but not similar behaviors. “I toured all the hospitals in Kinshasa, and the facilities were all fine – medical personnel were using new or autoclaved sharps. You don’t know what you are talking about”. The argument from “I’ve been to Africa, and you have not.” blpblpblpblpblp. Later demo surveys have shown that there are indeed (surprise) male homosexuals in Africa. Silence from those fuckers. We point out that we’re not talking only about formal settings… silence. It really has been this stupid. I know it sounds like I’m making this all up, but the various arguments have all made it into the literature for all to see. So people in my field should all be eager to dig up these turds and shove them in the faces of the assholes that cranked out this stuff in the first place, right? Silence. No one wants to look stupid, and no one wants “point fingers” at the affected “groups” in the west. It’s a confusion of moral thinking and practical thinking. Still think buttfucking is a major route? You aren’t hip to the last few decades of enlightened thinking, you homophobe. Etc. it goes on and on.

              17. He told me that the demographic surveys ALL indicate that homosexuality is rare in Africa – virtually non-existent.

                At best, that’d have to be some sort of “noble savage” claim — that our primitive ancestors had the pure archetypal form of sexuality that the West has corrupted but those closer to our roots still retain. For an epidemiologist to seriously consider something like that, and not ineffective and / or inadequate survey techniques (“what the typical punishment for being a homosexual in Africa was. Answer: a gruesome death”)…damn.


              18. I still have to pinch myself. These are the leaders in the field. But if we’re right, then huge swaths of their work can be easily shown to be barking up the wrong tree. Their precious R0=BcD breaks down. (which is actually, what we showed, under conditions of the disease being less “democratic” – and I cannot think of a more unfair disease than HIV, transmission-wise (besides strictly genetic diseases).

              19. That was another thing…wouldn’t geographic distribution be relatively meaningless in the growth phase of a new epidemic? Lots of opportunities for chaos, especially with today’s near-instant transportation infrastructure — so-and-so went home to this-and-such town where the disease spread, but just as easily could have spread from any other town. Indeed, I’d think it would really only be useful for things like malaria that have reached equilibrium and have local causes.


              20. Yep. And this was the argument used originally by the CDC: “the fact that HIV rose among gays first in the west was an accident. It just happened to hit there first.” That argument was the central motivation behind Project 90 here… we turned over every pebble to demonstrate a more “democratic” HIV, by elucidating the networks of female pros & IDU, precisely to demonstrate the veracity of this hypothesis. (the hypothesis was stupid in the first place, though, since HIV was similarly popping up in the gay communities of NYC, Houston, Dallas, Chicago, Minneapolis, you name it). One would expect an exception or two here and there. This is not rocket science. Laypersons tend to know this stuff. You would be astonished to attend an HIV conference and talk to true believers, though. It’s like entering upside-down world.

              21. Note, also, that the only folks web-publishing discussions like this are conservative rags. The Horowitz crowd. I have to go wash now. (John is one of the most apolitical people you would ever meet).

              22. oh… my guesses. The behaviorists/modelers got there first, and their NGOs are making tons of money with their condoms and sex-ed programs. To actually admit to the fact that the problem would be better dealt with by clamping down on blood safety & really have a good look at touchy political issues like traditional healing — is like asking them to willingly chop off their fingers and toes. Besides, everybody has kids they need to put through their Ivy League educations. There’s absolutely no reasoning with them. Not even in the scientific discourse. It’s over, done. What we have is the new “truth”. It is YOU that is now corresponding with a crank (me). Post-modernists are actually correct here. What everybody thinks they know is really a hegemonic construction, and there’s not a damned thing anybody can do about it anymore. A whole new generation of researchers will have to learn from our documentation of the events, and this is unlikely indeed. But we’re still documenting, anyway.

              23. BTW, that Giselquist/Potterat paper was in the Int J STD&AIDS that got its founder and Editor-in-Chief Wallace Dinsmore pushed out. Fired, essentially by his own Board. For devoting an issue to airing problems with the consensus view.
                Another person who worked for the Science Director of the CDC decided to take us up on the challenge, and used one (rare) dataset that had medical injection info in it as well as sex info…. just to show how wrong we were. Controlling for injections, the sex variance disappeared. Her boss, the head of science at the CDC prohibited her from publishing, but (because her underling had a conscience) she found a way to get the results out anywhere. She was subsequently pushed out and is now in Mississippi, and is now a “convert”.
                Now you know things are bad when I have to use religious language to describe a scientific situation. A review she subsequently published on a book of Gisselquist’s: http://std.sagepub.com/content/20/8/592.1.extract

              24. Oh you can have anal sex if you’re not sterile. It’s not a very effective form of contraception anyway. Jizz dribbles into all sorts of places.
                I should have kept that old bedpost.

  1. In this way biology resembles psychology, although molecular and cell biology studies are often replicated as part of the continuing progress of the field. I think, then, that it’s not as kosher to claim that ecology and evolution experience the same degree of self-checking as, say, physics and chemistry.

    Point granted — though I suspect psychology is probably on the further end of that spectrum. Individuals routinely interpret questions and/or situations in unexpected and unique ways and that must be a real bitch to control for. It adds another yet layer to the variance of biology.

    1. My guess would be that both regular biology and psychology are more reproducible than corporate-based pharmaceutical research. Once a company has sunk a few hundred million or few billion into a drug development, there is going to be a lot of institutional bias and pressure to come up with marketable, positive results and a strong institutional bias not to discover anything that would cause the drug to be flushed.

  2. The original articles were published in specialized psychology journals. The article on the replications has been published in Science, a general journal, and with a much broader readership and “impact”. My point is that there could be funding for this type of research in any STEM field (replication only) as it would have greater impact than the original research. Reproducibility in scientific research is a hot topic these days and would get coverage and scientist’s attention. The bold funding bodies willing to pay for this kind of work would also benefit from this exposure.

    1. Reproduction by itself is not relevant. Consider a psychology or physics or chemistry experiment that reports a result that no one cares about. Reproducibility is not a concern when the result is viewed as non-innovative or uncovering very little truth about the world.

      Much of science is so incremental it is easy to let much of it just slip into the cracks without verification, because the results concluded in some research are just not very profound, or worse, arbitrary.

      If I set up an experiment that shows that listening to Soundgarden instead of Subotnick helps fifth graders finish their math homework faster and with fewer mistakes…what is the point? Is it arbitrary? It might be completely reproducible, but to what point? What about all the other music comparisons or environmental aspect. In the next study I will have some of them smell lavender while they others get a whiff of kitty poop.

  3. It seems to me that one reason (in addition to the much simpler systems) that one gets reproduction in physics and chemistry (say) and less in high level fields is because of the theoretical integration that gets performed. There are few general theories in biology and almost none in psychology as far as I can tell, so the “consilience of inductions” that happens routinely especially in physics does not happen in these fields.

      1. But it isn’t *just* replication that gets performed. Think of Newton putting together the works of Galileo, Kepler, Descartes, etc. into one unified theory – this allows a sort of “strength in numbers” thing.

      1. I suppose the answer to Feynman is to ask what such laws might consist in and whether ‘tangible’ laws such as one has in physics and chemistry could possibly exist where the study of society is concerned and how useful such laws might be.

  4. Interesting to hear from a biologist that biology might be closer to psychology than physics in terms of reproducibility.

    We usually tend to think that the problem is mostly in areas like psychology or pharmacology, where the large numbers of fairly speculative studies, and publication bias, mean that its almost inevitable that a high proportion of reported positive results will be false positives.

    In contrast, the Higgs boson work didn’t just use the usual physics cutoff of 5 standard errors (hence type 1 error less than one in a million), it also had two separate experiments running on it (Atlas and CMS).

  5. What about a site like PubPeer (https://pubpeer.com/)? I know it doesn’t deal with the issue of how to pay for the work to test for reproducibility, but do you think that allowing for anonymous, referenced post-publication review by other scientists could help?

    By the way, this is my first time commenting, so I’d just like to say how much I enjoy this site and find the science coverage here to be particularly insightful.


  6. More people will watch the current film “The Stanford Prison Experiment” today than will read this column or its sources in few weeks.

      1. Indeed. While I think it probably would be impossible to ever get a proposal to conduct such a study today accepted for ethical reasons, there tragically exists as far as I understand it, ample evidence from the real world to shore up Zimbardo’s observations, not least what happened on the night shift in Tier 1A at Abu Ghraib back in Iraq 2003.

  7. Is there any indication that the successfully repeated experiments had any correlation? In other words, are some varieties of psychological experiments easier to repeat than others, and if so, what are the similarities?

  8. Two days ago, I came across this piece of psych in Nature on a rather complex (convex hull) semantic analysis, and how the researchers claim 100% predictive value for progression to schizophrenia (among 34 people already classified at-risk). My better half is going over this in more detail (she’s more of a linguist than I am). Extremely exciting to us, if true… right up our alley. But small sample size as well as knowledge of repeatability problems in the literature have us taking this with a lot of circumspection.

    It would be nice what any of youse alls think about this piece…

  9. Excellent article!

    However, I think it fails to emphasize a key point; scientists need to hold journalists (editors) accountable when their work is misrepresented. I can’t count the number of popular press articles I’ve read that have announced something, “amazing, revolutionary, will change the field, etc.”, and then when I actually read the articles, the authors are very tentative about their results.

    What I don’t see is the authors of the original articles firing back about how their results have been over stated. Maybe they actually do but their replies aren’t published. Or maybe the authors think it’s just not worth the effort. Either way, I think it is a problem that needs addressed.

    1. How would that even work? Journalists don’t answer to scientists; we aren’t their bosses.

      And forget libel or slander, there’s pretty clearly no journalistic malice towards scientists here.

      1. Same way they publish corrections of other articles – or better, an improvement on same. Also better (less sensationalistic) headlines could be done.

        Peter Danielson (a philosopher at UBC) invites journalism students to take or audit his courses in ethics of technology, etc. because he figures students (of technology and of journalism) should learn to interact in both directions in the technology (and science, but that’s less his thing) type situation.

  10. “arguing that this failure to replicate doesn’t constitute a scientific crisis, but simply shows science behaving as it should: ”

    Can’t it be both? It does concern me that we know even less about psychology than we supposed. A “three steps forward, two back” sort of thing.

  11. I do not want to seem cynical, but one hears from time to time that a lot of psychology studies focus on testing college students. If that is the case, the results can be of limited value even if the replications (on other college students) gave similar results.

    1. If I understand the argument (and data) correctly, this appear in most cases not to be an issue, since much of psychology is interested in the functioning of normal people, in normal everyday situations, and, college students by and large, does not seem to fundamentally differ all that much from other people in most respects… (as hard to believe as that might be 😉 )

      1. College students are overwhelmingly 18-25 years old, from financially secure households, and have above-average intelligence and academic achievement. And that, indeed, is unrepresentative of society — let alone the species.


        1. The question is not (I believe) whether they differ along some variables, of course they do, but whether those differences are relevant. In many cases it seems not to be the case. To quote from Roy Baumeister’s “Social Psychology, 3rd Ed (2014)”, page 31-32,

          “Periodically social psychologist seek to replicate their studies using other groups. In general, the results are quite similar … when they do differ, it is often more a matter of degree than of behaving according to different principles. A social psychology experiment typically seeks to establish whether some causal relationship exists – such as whether frustration causes aggression. As it happens college students do become more aggressive when frustrated, but so do most other people. It might be that some groups will respond with more extreme aggression and others with less, but the general principle is the same: Frustration causes aggression … when college students do differ from other people, these differences are probably limited to a few specific areas, and researchers interested in them should be cautious.”

          1. Even still, these sorts of things can and do change, and sometimes radically, over a lifetime. Just look at all the regulars here who were fire-breathing evangelists when college aged and are now sober rationalists…what sort of thing alone is going to radically skew those sorts of results.

            I know I myself would likely have significantly different results today on the tests the grad students performed on the non-major undergrads as part of their studies. One in particular stands out, though fuzzily through the mists of time…I was given some sort of meaningless task, followed by some feedback, with some sort of self-assessment before and after. I don’t remember the details, but the experimenter gave negative feedback that later turned out to have been scripted and not at all related to actual my performance. And, of course, my own self-assessment was negatively influenced.

            Today? I’d laugh in the experimenter’s face, as would many my age and older.


            1. I wonder if we are talking past each other here?

              The critical point (to my mind) is not if you can find exceptions, but rather if those exceptions are frequent, and large enough to invalidate the results. The scientific data seem to indicate that that is not the case, as stated in my quote above; ‘when these experiments are replicated with other age groups, from white haired old geezers to midlife working men and women, the results are in general, similar.’

              While you can easily point to dramatic changes in personal belief, the question is (I think) how much, if at all, those changes impact fundamental psychological and cognitive processes and constructs. Will for example a change from devote Christian to stout atheist change a persons performance on a Stroop test, the degree to which resisting temptation drains willpower or change your score on an IQ test?

              We can probably all also find some aspects on which we have changed quite a bit since our student years, but, the more urgent (and relevant) question is rather, what proportion of our total psychological makeup this really represent.

              While those instances where we have changed often stand out like sore beacons in our memories, those where we have not are more likely relegated to oblivion. I think this often traps us into believing that our student selfs have changed much more than we truly have, which was the background to my quip in my first comment to Mark above… 🙂

              1. We’re in agreement that people change, and that at least some of those changes are significant.

                I think we’re also in agreement that we don’t have anything save for some very, very spotty research suggesting that there are some similarities. That’s not surprising; most of us would also agree that, despite changes, there’s a great deal of similarity.

                What’s missing is a calibration standard that can be reliably mapped for extrapolating from the very small, very homogenous, very unrepresentative sample to the entire population.

                A very significant data point in my favor: tell a political pollster that your sample only included students at a certain college, and the pollster will tell you that you’d be nuts to use that data for anything outside of the college. Think of all the various psychological factors that go into choosing a political party or candidate, and how hard it is to get a good measure of a simple near-binary choice like that, and how important it is to have a broad, representative sample…and we’re now supposed to think that the whole field of psychology rests on surveys of college students?


  12. Science does not always provide comfort for what we wish to be; it confronts us with what is.

    That’s the faith v fact dichotomy distilled down to its essence.

    Turning the scientific method on itself, using science to determine how science can be made more effective…that sort of recursion is not only the most powerful way to drive science forward but is the answer to all those philosophical complaints that science needs some sort of philosophical justification.

    Two reforms that come to mind that would help…first, a standard part of the grant process should include independent verification. You get your grant to do your research, but the award comes with some mechanism for funding somebody, perhaps somebody entirely unknown, to take your results and attempt to replicate them. Of course, this would be twice as expensive as the current practice and would likely lead to twice as much original work…but, at the same time, today we’re laboring under the false-but-comforting thought that we’re twice as productive as we really are.

    Second, we need desperately for people to publish negative results. If nothing else…how is anybody else to know that you’ve already looked for something but found nothing? How many people are wasting time chasing down dead ends that have already been explored? The journals should be eager to publish papers that say nothing more than, “Well, I was hoping to find such-and-such but there just wasn’t anything there that I could see. Here’s where I searched; you might want to think about looking elsewhere.”

    That last one would, of course, have the additional benefit of putting findings into perspective. If you do twenty experiments and find a single one with p < 0.05 and report on all twenty experiments then people, hopefully, won’t get nearly so excited about the non-anomalous anomaly.


    1. Very good points. My only thought – where are the incentives coming from to direct the participants in this direction. Who would fund replications? Who would be willing to perform the replications? What journal would want to publish a wider spectrum of results? We can imagine a single authority to herd all these cats toward a brave new enlightenment, but isn’t a single authority anathema to the culture and philosophy of science?

      1. I think it’ll have to come organically from within…which should, hopefully, come naturally as a result of critiques such as this one. When scientists and publishers and grantors alike all come to value the importance of replication, and to come to accept that it’s no more reasonable to omit replication than it is to omit peer review…then they’ll adjust budgets accordingly.

        After all, look at it from the other direction. Imagine how many more papers could be cranked out if you didn’t bother with the hassle of peer review. Wouldn’t it be so much more efficient to eliminate that entirely? And yet the mere suggestion seems insane because we know what would happen if we did so.

        Well…we now know what does happen without replication and that, in reality, it’s just about as insane to rely only on peer review and eschew replication as it would be to eschew peer review itself.

        …hmmm…another possible adaptation might be that peer review could be used as an initial filter, and only those studies that make it past peer review could be considered for replication. If you’ve gotten peer review, great, but that’s no more cause for celebration than getting the grant approved. It’s when your work has been replicated that it’s considered solid.


        1. At least there could be a replicator signed up before publication. Nice thought.

          I certainly hope “from within” takes hold.

    2. Yep, yep, and yep. Yeppity-yep.

      I think there has been a push from various corners to do just this, and I was able to see such an initiative in psych.

      I also see a published paper calling for replication, this one in the field of economics.

      We’ve also been vexed in our field by authors who do not make relevant data available, sometimes despite a publication’s policy that authors do so upon request. Five years ago I was co-author on a Lancet letter (Brewer DD, Potterat JJ, Muth SQ. Withholding access to research data (Letter). The Lancet 2010; 375: 1872.) which was a reply to an editorial there about making data available to fellow researchers. We pointed out in that letter that one of the editorial’s authors denied us access to their data on several occasions.

      Things are not right in science-land. (at least in my field). Pretty abysmal, actually.

    3. I thought the same – there needs to be better measures of success that would compel scientists to replicate studies. If it’s important, we should measure it and if we measure it in the right way, people will want to do it.

        1. Yes, it pretty much reflects my ideas about managing and improving processes. So many times I’m disappointed though – I fear, like the many people I’ve tried to help, they want to change but they don’t really want to do what it takes to change. I guess that’s the human condition.

          1. For whatever reason, it seems that it’s very, very, very difficult for humans to see the world as it is as opposed to how they think it must be. And then even more difficult for people to figure out how to change to address reality, let alone to effectively change reality to their liking.

            …I know I struggle with it, myself, even with something like the trumpet that I’ve been working on since I was ten and have a college degree in….


    4. What you said at the end there, especially. The TOP guidelines that Jerry referenced include preregistration, which does exactly the record-making of all twenty experiments which you demand – and more. The link explains what preregistration entails (and also argues that it shouldn’t be required for certain cases – but I think it should).

      1. Hmmm…seems pretty clear that preregistration is certainly an ideal to strive for. The link makes some good arguments for why it might not be a good straightjacket…but I think it would be safe to suggest that any exceptions would necessitate some full disclosure. “We set out to do such-and-such and would have called it quits, but we noticed this amazing thing we didn’t expect and so made the decision to deviate from protocol.”

        …and probably followed by an urgent request for independent replication….


  13. Very well written assessment. Thank you.

    This is old news in some sectors however (though it is certainly worth repeating and looking at other fields). Steven Novella has a nice summary of a similar problem in medical trials: http://theness.com/neurologicablog/index.php/are-most-medical-studies-wrong/

    He succinctly summarizes the publication bais of p<0.05 such:
    "This results from the fact that most new hypotheses are going to be wrong combined with the fact that 5% of studies are going to be positive (reject the null hypothesis) by chance alone (assuming a typical p-value of 0.05 as the cutoff for statistical significance). If 80% of new hypotheses are wrong, then 25% of published studies should be false positives – even if the research itself is perfect."

  14. I think this is a much bigger problem for psychology than other sciences, regardless of what the actual numbers are.

    Psychology is about us. People with agendas will latch onto any study that seems to support their pet view.

    Beyond that, the studies themselves are usually conducted by people who have a particular agenda themselves. They don’t rig the results, but they create tests they think will show the results they want, whether consciously or not.

    One example that comes to mind, though I can’t remember the details to cite it, involved color association and timing. The gist was that participants were asked to indicate that something was positive or negative, and were slower to choose positive when the color black was involved (“good” in black text, or something like that). The effect was stronger among white people, but existed among blacks as well.

    The interpretation of the results was that people are racist, and that racism was so prevalent that even black people had some internal racism.

    The rational interpretation would be that the colors black and white have a long history of negative and positive associations, respectively. He had a black heart. It’s always darkest before the dawn. White knight. She has a bright and sunny personality. It’s not hard to see why a diurnal species would have an innate preference for light over dark.

    The difference in delay between white and black people can plausibly be explained by the latter group having created a culture of racial pride (which is racist, of course, but done in response to overwhelming racism), which would make such a person more likely to automatically associate blackness with goodness. Though still with a delay due to the diurnal prejudice.

    So I think it’s not only important to check that results can be replicated, but that the interpretation of those results actually makes sense – that the tests involved actually show what they are purported to show.

    1. The gist was that participants were asked to indicate that something was positive or negative, and were slower to choose positive when the color black was involved (“good” in black text, or something like that). The effect was stronger among white people, but existed among blacks as well.

      There was an online version of that somewhere. Something along the lines of good and bad words paired with photos of Europeans and (recent) Africans, one pairing on the left and the other on the right, and you had to press a key corresponding with good or bad. Probably not exactly that, but you get the gist.

      There were two rounds, one associating good with Europeans and the other with Africans. I don’t remember the particulars, but I did fairly well on the first one and rapidly accelerated with the progression. And then they swapped, and I slowed down dramatically as I worked to unlearn and invert everything from the first round.

      The test interpreted it to mean that I had an overwhelming instinctual love of Africans and a powerful hate of Europeans. In reality, all they had measured was how quickly I could make and subsequently reverse a particular pattern association. Give me the same test with, say, birds v fish paired with intelligent and stupid words, and I’m sure I’d again do much better on the first round than the second.

      Or, again: even if you really have measured something, how do you know that what you actually measured is what you think you measured?


  15. I am reminded of the story about the man who had cancer. His doctor recommended a newly-introduced chemotherapy drug. He wearitly replied “Yeah, let’s go ahead with that. I want to use it while it still works.”

    In the NYT op-ed by Lisa Feldman Barrett she implied that most of the nonreplication could be due to changes in the conditions of the experiments, when there are subtle interactions present. But as you imply, some of it could simply be what is expected owing the expected false positives in the original study, for which 5% of all studies where the effect is not real will show P 0.05.

    What fraction this will be of all results depends on what fraction of studies test effects that are actually real. This is unknown, and may well be different from field to field.

    It may not indicate a field in crisis, so much as a field that tests a lot of wrong hypotheses.

    1. Somehow the comment I just made turned out to say “P 0.05” when I typed “P < 0.05". I also added that, upon replication, the cases where the effect is not real will show nonsignificance 95% of the time.

      (The field may not be in crisis, but my ability to type a coherent comment is in crisis).

      1. Interesting observation, which may tie in with Keith’s comment above on the lack of theories in some ares. (Or, I would add, the less constraining regularities, c.f. selection under a selective pressure vs falling under gravitation.)

  16. Random thought: Surgical interns (for instance) typically don’t get to do cutting edge procedures right out of med school. They have to learn how to do the routine stuff first.

    So maybe a similar model could apply to science grad students. Instead of pushing them into original research right off the bat, let them first learn how to replicate work done by others. Whether or not the replication succeeds, they’ve learned something about the nuts and bolts of doing science. And if they manage to debunk a celebrated result, then they’ve contributed something useful and earned the attention of luminaries in their field.

    This might also help improve the quality of published papers, if authors know their work is going to be thoroughly checked by an army of grad students at other institutions.

    1. I really like those ideas.

      And it’s not at all a stretch from the current system. In an undergraduate physics class for non-majors, I replicated, for example, the derivation of absolute zero; in high school chemistry, buffer systems. It would be a very natural progression for more advanced students to replicate more advanced and more modern studies, with the most advanced students replicating the most advanced and modern studies.


    2. Good idea. They would still have to be funded and probably work in guided teams in order to replicate what some of the high end experimenters do. But at least that idea is a move in the right direction.

    3. I thought that might be a solution as well but it would have to be set up so that grad students didn’t get only to replicate. I assume some of the fun of science is designing your own experiments & you want to attract the brightest and most motivated people so you don’t want to completely turn them off with a long process of only replication.

      1. Tim Vines at #21 makes a distinction between reproduction and replication that I think would be very beneficial. Earlier in education, you’d want to reproduce experiments — carefully follow the exact same experiment exactly as described. This would help perfect the craft, and master the techniques involved. Later, you’d want to replicate the experiments — to come up with a different method of testing the same phenomenon. That would require creativity in methodology, but not in theory. Finally, you’d want to blaze your own trails.

        And, of course, trailblazing should be heartily encouraged even in the youngest and most inexperienced of students. It’s where a lot of the fun is at, and kids do sometimes strike it lucky and find something new. But you still need to learn your scales and arpeggios….


  17. I’m pleasantly surprised, who doesn’t remember the scientists at Amgen who tried to replicate more than 50 key studies in cancer research and could only replicate 6.

  18. It’s worth distinguishing between reproducing an experiment and replicating it: the former is what the OSC did here (same hypothesis, try to get same experimental conditions), the latter is testing the same idea under different circumstances to assess its generality. (Roger Peng wrote about this here http://simplystatistics.org/2013/08/21/treading-a-new-path-for-reproducible-research-part-1/)

    There’s not much work in ecology/evolution that attempts to reproduce famous results, but we do seem to do a lot of replication work, with different organisms, geographic locations etc. I think this is more important than reproducibility, because failure to replicate implies that the original result isn’t a general phenomenon and we can move on to other questions. It doesn’t actually matter whether that failure to replicate is because the original study was based on a statistical fluke or that the phenomenon only exists under those exact conditions.

    1. Agreed – I would add that in theory-poor areas determining what *could count* as a replication in the sense you are introducing seems to be (especially) difficult. In a theory rich area one could look for appropriate boundary conditions and so on – especially if mathematicization is available.

  19. Jerry, allow me to offer a few corrections:

    In the eighth paragraph you write, “…and the effects that were found, even
    if nonsignificant, dropped to about half the size….” I think you meant “even
    if significant.”

    The first sentence in the paragraph immediately preceding the first figure says
    that the figure shows the correlation between p-values in the original and
    replicate papers. It doesn’t, since the axes are effect sizes (transformed to correlation coefficients), not p-values.
    Also, in this paragraph you say that all the effects in the original studies
    were statistically significant. In fact, three of the original effect sizes
    were nonsignificant.* Believe it or not, the color of the border of each dot,
    which may differ from the color of the interior, shows whether the original
    effect size was significant. [Finding the three dots representing nonsignificant original effects is left as an exercise. 🙂 ]

    * Same comment regarding the second sentence in the paragraph immediately
    following the second figure.

  20. Yea. The variables on psychology are all over the place. And the underlying assumptions are more assumed, in some ways not even theoretical.

    On more of a cognitive science side, we can task the percentage of people who see a checker board shadow illusion, or something similar. There is good reason to think that such visual processing may repeat across all humans, and the test will be repeatable and tell us something robust about visual processing. Problems like that should be repeatable and deliver useful information.

    If we try to task people with studying shadows intensely for two months, and then ask about this tasks effect on depressive states 12 months from here on, you run into endless confounding variables. And it is questionable what you will have learned even if you find a meaningful result.

    Those are bad examples. But much of psychology is using underlying assumptions and theories and trying to connect those beliefs with experimental findings in ways that no other science is working within. And that includes most animal behavior studies. To me, the idea of repeatable experimentation to hone theories makes more sense when we have narrowed down theories to acceptable degrees. What we believe is happening in the brain/mind, that is the kinds of things we think we are holding steady and what an experiment is supposed to show, is too multifaceted and untheorized to make heads or tails of the result in many of these cases.

    For those of us who believe free will should be tossed, and that society and theories will eventually write it out of belief, the Vohs study will become unthinkable. Such a test, or something similar, will necessarily be unrepeatable in those subjects. People within such a society will not be able to make heads or tails of the experimental design. Which goes to show that psychology studies that rely on certain contingent knowledge/characteristics of their subjects, and use that knowledge to test things about our psychology in general, are already on some bizarre ground.

  21. “… significant findings aren’t often repeated, for doing so takes hard-to-come-by money and a concerted effort— an effort that isn’t rewarded. (You don’t get much naches or professional advancement by simply repeating someone else’s work.)” I read somewhere long ago that instead of awarding doctorates for “new” contributions to the field, grad students should be assigned to repeat prior studies and try to reproduce experimental results – that this would be of more use to both the students and their field of study than all the papers written on ever more obscure points of interest. It might also free up some grant money for truly new research. (Not that I know anything about such things – never got to grad school.)

    1. Makes a hell of a lot of sense to this undergrad, too. A really huge problem, at least in the US, is paying for it. With Congress shrinking the NIH budget (responsible for the overwhelming amount of health science work) year after year, and new grants getting accepted only if they are “innovative” (i.e. “new”), it would seem to be an excellent use of grad students, it would seem to me. — especially with the current glut of students we have now, plus the sometimes shameful ways PhDs tack their name onto grad students’ work. It seems to me that PhDs should be pulling more research weight instead of merely padding their CVs on the sweat of the students in their team. But I’m a commie.

  22. Many psychology experiments can be very interesting. They may not ever lead to a particular result that can be reproduced in a meaningful way, however that does not mean we should do a full stop on psychology experiments. Too many of them are interesting, even if they are kind of silly.

    On the hand, hard sciences and mathematics and computer science can have extraordinarily reproducible results that are extremely boring. Are these any better or worse?

    Reproducibility is not necessarily the most important thing in science. Making discoveries and/or opening up avenues of inquiry even if hypothesized incorrectly can sometimes be much more important and motivational.

  23. Nice post. Some responses gleaned from Twitter that I’ll add here:

    (Author of Science report) @BrianNosek said: .@Evolutionistrue, on your last point, @OSFramework hosting mtg for ecology and evobio in Nov & @BrunaLab has RepProj. proposal for trop eco

    More details on the latter, from Emilio Bruna (fellow Davis trainee) in this Biotropica blog post on “Reproducibility & Repeatability in Tropical Biology: a call to repeat classic studies” http://biotropica.org/reproducibility-repeatability/

    1. I’d say it depends on a host of elements: sample size, plausibility, how critical the question(s) is/are to other research that depends on it, and whether the phenomena can be expected vary across different persons/places/times for starters.

      1. You can have optimum sample size and the rest by aggregating the findings of different studies that test the same hypothesis, aka meta-analysis (MA). Thereby was Gene Glass (circa 1977) able to answer one of the great questions of the ages: Does psychotherapy work? There may have been 500 findings in the MA, not subjects, that produced a healthy aggregated effect size, just about proving once and for all that psychotherapy does indeed work!

        1. This isn’t to negate your point, Dave — it is a good one. What I am pointing out, though, is that a really good MA should be exhaustive – using all such studies, both neg & pos results. And we know that, for the most part, neg results (esp. in social sciences) are not only not getting published, but not even systematically registered as having been done. So trying to control for the bias is a crapshoot at best, IMO. http://www.ncbi.nlm.nih.gov/pubmed/17275251

          1. I agree with you totally. I thought that meta-analysis is relevant in the present context and should be put on the table for discussion. I do not really believe that Glass’s MA answered the question once and for all of psychotherapy’s effectiveness.

    2. It is a good start to push down the problem.

      If it is used (relied on) massively (which isn’t what Jerry proposes) as a sole means while the basic structural problems remains, it will ultimately have its own data fishing problem for one.

  24. Thank you, very useful!

    My immediate response is if reproducibility had been a problem for science in general, it wouldn’t have gotten started. My impression is a problem in biological sciences is that many effects are weak and/or multicausal. Ideally people would look at effect sizes too.

    Here is what I think I have learned from experience:

    – When you do the form of statistical physics which is called “chemical reactions” in physics, it is often reaction rate limited surface reactions (growing surfaces such as crystals or films from gas phases), and hence the overall uniformity can be pushed to less than 0.1 %.

    – Wet chemistry is most often diffusion rate limited by the liquid medium, so you immediately get 1-2 orders of worse uniformity.

    – Biology is … well, many more factors added. Mostly unavoidably you have individuals, statistics of crowded cellular environments, couplings of many genes (which of course push structure and control onto the system too), et cetera. It is awesome that we can even discuss uniformity in simple cases of antibody reactions, say!

    That cutoff, of course, is arbitrary, and is lower in areas like physics, which, for experiments like detecting the Higgs boson, drops to 0.00001.

    The 5 sigma choice in detectors is most often a choice to mitigate the “look elsewhere” effect of data fishing when you don’t know the energy of the sought for particle. (Such as in the Higgs case, it had an initial large range of possible energies.) When the particle is found, initial observations correspond to having 3 sigma.

    In astronomy, where phenomena can be unique and the search space is … astronomical (if not genomical) … I hear they want 7-9 sigma to make sure they aren’t seeing smudges on the telescope.

    1. In other news, I read somewhere of a new statistical method that alleviates data fishing somewhat if it works.

      They treat, computer science fashion, data as “a resource”. That way they can tell if searches (in say medicine) run out of independent experimental data when trying to fit hypotheses (test experimental drugs).

  25. Although it wouldn’t ‘solve’ the problem a different setting for ‘p’, say ‘p < 0.01', even if run alongside the traditional 'p < 0.05' would add an extra check on the results.

    Or perhaps the calculated value of 'p' should be included in all results?

    I can't help wondering if scientists 'relax' once they feel they have achieved 'p < 0.05' when perhaps they should have continued a little longer. I know I did.

    1. That illustrates a problem with p (data) fishing, or unblinded experiments in general. You ought to preserve a sufficient “black box” data set after method development and method testing, and then only do the data analysis once. That would remove temptation to “continue a little longer”.

  26. We shouldn’t sell biology short in the process of criticism. Here is one famous repeat:

    “However, failure to replicate the experiment and criticism of Kettlewell’s methods by Theodore David Sargent in the late 1960s led to general scepticism. When Judith Hooper’s Of Moths and Men was published in 2002, Kettlewell’s story was more sternly attacked, accused of fraud, and became widely disregarded. The criticism became a major argument for anti-evolutionists. Michael Majerus was the principal defender. His seven-year experiment since 2001, the most elaborate of the kind in population biology, the results of which were published posthumously in 2012, vindicated Kettlewells’ works in great detail. This restored the peppered moth evolution as “the most direct evidence”, and “one of the clearest and most easily understood examples of Darwinian evolution in action”.[5]

    [ https://en.wikipedia.org/wiki/Peppered_moth_evolution ]

    [And, I note, the successful repetition increased the value of the experiment. It should be possible to use that against funding agencies.]

  27. The 5 sigma choice in detectors is most often a choice to mitigate the “look elsewhere” effect of data fishing when you don’t know the energy of the sought for particle. (Such as in the Higgs case, it had an initial large range of possible energies.) When the particle is found, initial observations correspond to having 3 sigma.

    I know that the so-called look-elsewhere effect increases the false positive probability; however, are you sure that the result is all the way down to 3-sigma? That’s around a 1-in-700 chance of a false positive, 1-tailed, under the null hypothesis, and sounds a couple orders of magnitude short of a Nobel prize to me.

Leave a Reply