At the beginning of September I wrote about a paper in Science produced by a large group called “The Open Science Collaboration.” That paper reported the repeatability of 100 papers whose results were published in three prestigious psychology journals. My brief summary of the conclusions is below, though my original post gave a lot more data:
- Only 35 of the original 100 experiments produced statistically significant results upon replication (62 did not, and three were excluded). In other words, under replication with near-identical conditions and often larger samples, only 35% of the original findings were judged significant.
- That said, many (but not nearly all) of the results were in the same direction as those seen in the original studies, but weren’t large enough to achieve statistical significance. If the replications had been the original papers, most of them probably wouldn’t have been published.
Now this doesn’t mean that the original studies were wrong, for of course the replications could have produced the wrong answer. But given that the replication studies generally used larger sample sizes than did the original work, it suggests that there are endemic problems with the way science is adjudicated and published. My suspicion is that the main cause is a bias toward publishing positive rather than negative results, combined with “p-hacking”: looking for those statistical analyses that give you probability values lower than the cutoff needed to reject the null hypothesis, or a tendency to collect data only up to the point where your “p” values become significant, and then stopping and writing a paper.
The Science paper set off a flurry of self-scrutiny and self-recrimination as scientists begin to worry that, even if they’re not psychologists, the problem may cut across disciplines—and probably does. I’ve long thought that studies in ecology and evolutionary biology, particularly field work or experimental work that doesn’t involve DNA sequencing (sequencing data are easily replicated), may also be largely unrepeatable, both for the reasons given above and because field and lab results may be particularly sensitive to experimental or environmental conditions. Next month I am in fact going to a meeting of biologists and editors to address the issue of replication in my field.
In the meantime, lots of articles have come out highlighting the problem. Before people conclude that the replication problem is a big problem for all of science, I’d suggest that some fields, particularly molecular biology, may be largely immune, because a). they’re easily replicated and b). are very often the building blocks for future work, so researchers not only have the incentive to get their results right, but their studies will automatically be replicated as a first step in other people’s followup work. Deciphering the DNA code, the subject of Matthew Cobb’s new book, for instance, automatically involved other people calibrating their system using the codons worked out by earlier researchers. Science journalists should realize this before sounding a general alarm.
That alarm, however, was sounded by writer Regina Nuzzo in an article in Nature about replication called “How scientists fool themselves, and how they can stop.”
The summary is shown in her diagram of the problem and possible solutions:
Here’s my take on Nuzzo’s analysis, which is by and large pretty good, and on her solutions, which are somewhat problematic but still worth considering.
THE PROBLEMS
The first line, “cognitive fallacies,” is pretty self-explanatory. It’s simply doing experiments that are contaminated by confirmation bias: neglecting alternative hypotheses and data inimical to your favored hypothesis. That is the epistemic method of most religions.
The “Texas sharpshooter” problem involves, among other things, p-hacking, and also doing a gazillion different tests on a diversity of data, some of which will be significant by chance alone, and then seizing on those as your publishable results.
“Asymmetric attention” is self-explanatory. Nuzzo gives two examples:
A 2004 study observed the discussions of researchers from 3 leading molecular-biology laboratories as they worked through 165 different lab experiments. In 88% of cases in which results did not align with expectations, the scientists blamed the inconsistencies on how the experiments were conducted, rather than on their own theories. Consistent results, by contrast, were given little to no scrutiny.
In 2011, an analysis of over 250 psychology papers found9 that more than 1 in 10 of the p-values was incorrect — and that when the errors were big enough to change the statistical significance of the result, more than 90% of the mistakes were in favour of the researchers’ expectations, making a non-significant finding significant.
“Just-so storytelling”, in which you make up a story post facto to explain your results, seems to me less of a problem. If your “story” is simply something you say in the discussion to rationalize a result you didn’t expect, well, others readers (and presumably the reviewers of the paper) should catch that, and realize that it’s just a rationalization. Things become more serious if you pretend that that result was your initial hypothesis, and then confirm it with the data, which is an inversion of your scientific history and basically dishonest. But while I see a lot of the former tactic in evolution, it’s not a big problem, for we all know when somebody’s grasping at straws or rationalizing. The former problem I haven’t seen—but we wouldn’t see it anyway unless the “hypothesis” that is tested is not a priori obvious.
THE SOLUTIONS
Nuzzo offers four solutions, one of which is already in play and another that seems unrealistic. Two are feasible.
“Devil’s advocacy,” considering and testing alternative hypotheses, is part of all science, and should be ingrained in every researcher. “How might I have gone wrong?” is a question all good scientists ask themselves, and then we test to see if we’ve erred. Now some people don’t do that, but they’re often caught by reviewers of their papers or grants, who ferret out the hypotheses that are neglected. That’s why every paper should have at least two good reviewers familiar with the field, and grants should have at least four people who scrutinize proposed research. This issue doesn’t seem to be a big problem unless journals and funding agencies do a sloppy job of reviewing papers and proposals. At least in the US, the two major granting agencies (NSF and NIH) are very careful at vetting proposals using in part a “devil’s advocacy” approach.
“Team of rivals”, getting your scientific opponents to collaborate with you in hopes that opposing views will help bring out the truth, is in principle a good idea but will rarely work in practice, at least in my field. For one thing, there would be authorship fights: who gets the credit? Also, who wants to drop their research to work on somebody else’s problems? Anyway, Nuzzo gives one example of such a collaboration in psychology, which didn’t appear to work so well.
In “blind data analysis”, you shift your real data around or even add made-up data, and then do all the analysis on several “blind” data sets—which aren’t really blind, as most researchers know their data and the procedure also involves removing outliers, which you often know as well. Then, when you’re satisfied that you did the analysis as you wanted, lift the blind. (This is sort of like John Rawls’s “veil of ignorance,” where you make up moral rules for society without knowing which position you’ll eventually occupy in that society.) This method will work in some situations but not others, for, as I said, you’re often familiar with your real data. At any rate, it did work in one study:
[Astrophysicist Saul] Perlmutter used this method for his team’s work on the Supernova Cosmology Project in the mid-2000s. He knew that the potential for the researchers to fool themselves was huge. They were using new techniques to replicate estimates of two crucial quantities in cosmology — the relative abundances of matter and of dark energy — which together reveal whether the Universe will expand forever or eventually collapse into a Big Crunch. So their data were shifted by an amount known only to the computer, leaving them with no idea what their findings implied until everyone agreed on the analyses and the blind could be safely lifted. After the big reveal, not only were the researchers pleased to confirm earlier findings of an expanding Universe, Perlmutter says, but they could be more confident in their conclusions. “It’s a lot more work in some sense, but I think it leaves you feeling much safer as you do your analysis,” he says. He calls blind data analysis “intellectual hygiene, like washing your hands”.
The final method, transparency, has promise but also problems. It comes in two forms:
[Form one] Another solution that has been gaining traction is open science. Under this philosophy, researchers share their methods, data, computer code and results in central repositories, such as the Center for Open Science’s Open Science Framework, where they can choose to make various parts of the project subject to outside scrutiny. Normally, explains Nosek, “I have enormous flexibility in how I analyse my data and what I choose to report. This creates a conflict of interest. The only way to avoid this is for me to tie my hands in advance. Precommitment to my analysis and reporting plan mitigates the influence of these cognitive biases.”
I’m fully in favor of this: all data used in a paper, and methods of analysis, as well as experimental analysis, should be available to researchers INSTANTLY after a paper is published. This has long been the custom in the Drosophila community: it’s unthinkable not to share data, or even laboriously constructed genetic stocks, with colleagues and rivals. This is being used already by many scientists, and is mandated by some journals. Other journals, however, either don’t require such data storage or impose a year’s moratorium on it so you can milk your data for more papers before others get their hands on it. I don’t favor the moratorium; it’s just too bad if other people use your published data to their own ends.
Form one could, however, be construed as sharing your aims, methods, and analyses BEFORE you get your data, and that’s just not on. It makes you vulnerable to intellectual theft, and no researcher wants to do that. This is why grant proposals are strictly confidential, and why no grant reviewer is allowed to lift ideas from grants they’ve reviewed. It’s also why reviewing papers is confidential.
[Form two] An even more radical extension of this idea is the introduction of registered reports: publications in which scientists present their research plans for peer review before they even do the experiment. If the plan is approved, the researchers get an ‘in-principle’ guarantee of publication, no matter how strong or weak the results turn out to be. This should reduce the unconscious temptation to warp the data analysis, says Pashler. At the same time, he adds, it should keep peer reviewers from discounting a study’s results or complaining after results are known. “People are evaluating methods without knowing whether they’re going to find the results congenial or not,” he says. “It should create a much higher level of honesty among referees.” More than 20 journals are offering or plan to offer some format of registered reports.
This method, in which the methods and aims are confidential, seems to be gaining popularity. It’s like giving a grant proposal to a journal before you do the experiment, and if they approve of the analysis and methods, they’ll accept the paper no matter how the results turn out. That’s fine, but there are some caveats. First, during a study new experiments often crop up that you haven’t planned, and often those are the most exciting ones. (This has often happened to me.) How do you deal with those? I don’t see how.
Second, this deals only with the analysis of data; it doesn’t deal with the importance of the results. But no journal will publish any paper that’s soundly executed regardless of the results: every scientist knows that there are some highly visible “top-tier” journals where publication can make your career (e.g., Science and Nature), and journals like that aren’t going to publish on just any subject, as they specialize in “important” results. (Journals like PLOS One, which consider only methods and not importance, are venues for all sorts of work, important and not-so-important). For regular journals this means that there has to be pre-vetting of not just the methods of research and how data are analyzed, but whether or not the problem is interesting. This doesn’t seem to be considered in any such proposals for “transparency,” but I haven’t read them all.
Scientists will be chewing over this problem in detail over the next year or so, and it is indeed a problem, though more so for some areas than for others. The solutions, however, vary in quality and depend on the field. The “transparency” method seems to be gaining popularity, but to me it seems the least practical of all solutions.
h/t: Ben Goren

Subscribe
Not enough attention is being paid to one of the major culprits, the whole idea of testing a null hypothesis by p-values. Many problems would be resolved by replacing p-values with meaningful measures of the actual magnitude of the observed effect, with confidence intervals expressing the uncertainty. It is truly shameful that so much of science is based on this methodology. There are some questions for which it is the correct tool, but these are very rare.
“this methodology” being the null-hypothesis-testing model.
I know that some of the ones against the idea of human made Climate Change have latched upon this to condemn Science in toto.
Yes! The mistaken idea that a single number captures the essential meaning of a result. Given a large enough sample, extremely small differences can be found to be statistically signifiant.
And then there’s the data dredging/p-value trolling.
THIS! Missed this comment first time around. In molecular and cellular biology, at least, understanding of statistics is woeful, with p-values frequently misinterpreted and abused. Power calculations and effect sizes are rarely considered in advance (in my experience). In computer science it’s even worse, and often results are published as pure numbers, with no estimate of error at all! (Area under a ROC curve, for example.)
I consider my own statistical training and awareness to be quite poor – although I do at least recognise this fact and take extra care accordingly. I only really realised the superiority of effect sizes plus confidence intervals over statistics p-values whilst preparing to give our third years an intro to statistics a few weeks ago.
I wonder whether this might be solved in part if every paper underwent a separate statistical review – but statisticians who really understand biology are like gold dust.
Of course, the beauty of science is that it is self-correcting, and the errors are outed eventually. It is just a shame that so much time, money and effort is being used less efficiently than it could be.
Its certainly a contributor, or to be more precise the fact that it is so easy to misuse is a contributor. Any method of deciding whether a result is “genuine” or not will have false positives, and at least hypothesis tests explicitly quantify type I errors.
However, a 5% test is just a 5% type I error for a single test. If you do lots of tests you need to correct for this, e.g. by Bonferroni or similar to get an overall type I error of 5%. This doesn’t seem to be done often enough, and of course isn’t done when its different sets of people doing the tests. Except in meta-analysis or in particle Physics, where their use of 5 sigma for “significance” is in effect doing just this.
If I can I always teach confidence intervals first, and then hypothesis tests as an adjunct to them.
Let’s hope care is taken not to stifle or inhibit research. The “registered reports” idea seems like it could easily move in that direction.
Interestingly, no one’s mentioned pressure to publish yet.
Pressure to publish, and general pressure to be someone who gets results – even if those results are specious.
I saved a great comment I found on reddit a while back that sums up the problem very well.
Sadly, this comment rather reflects my research experience during my PhD.
By Rockthem1s from Reddit:
“Sadly, this is what happens in a publish-or-perish academic research environment.
Overreaching and handwaving by funding-starved PI’s is quite common in my field (Structural Biology).
Once the project is funded, it falls on to the post-docs and grad students to validate the ideas. More often than not, it takes 3-6 months in my field to get a workable biological system up and running for characterization.
“Get results” begins to take precedence over “Do it right” and favouritism sets in rather fast, as anyone bringing in positive results is seen as “someone who can get the job done”. Their ideas are pushed and their voices get heard more often. However, many of these positive results are hollow, and have massive failure rates.
Optimization is meticulous, and requires time and a true scientific mind. Unfortunately some PI’s see this as a waste of time. Anyone approaching their projects by meticulously having all the variables in an experiment controlled, doesn’t have positive results to report at their weekly group meeting. This instantly is seen as “making excuses” and said person becomes “unreliable”.
Some PI’s truly don’t care and will publish results that are based on a 10% success rate because they don’t report on the number of failed experiments, just the ones that worked.
This is a huge problem, and fundamentally plagues reproducibility in the end.”
Pressure to publish, and general pressure to be someone who gets results – even if those results are specious.
I saved a great comment I found on reddit a while back that sums up the problem very well.
Unfortunately, this comment rather reflects my research experience during my PhD.
By Rockthem1s from Reddit:
“Sadly, this is what happens in a publish-or-perish academic research environment.
Overreaching and handwaving by funding-starved PI’s is quite common in my field (Structural Biology).
Once the project is funded, it falls on to the post-docs and grad students to validate the ideas. More often than not, it takes 3-6 months in my field to get a workable biological system up and running for characterization.
“Get results” begins to take precedence over “Do it right” and favouritism sets in rather fast, as anyone bringing in positive results is seen as “someone who can get the job done”. Their ideas are pushed and their voices get heard more often. However, many of these positive results are hollow, and have massive failure rates.
Optimization is meticulous, and requires time and a true scientific mind. Unfortunately some PI’s see this as a waste of time. Anyone approaching their projects by meticulously having all the variables in an experiment controlled, doesn’t have positive results to report at their weekly group meeting. This instantly is seen as “making excuses” and said person becomes “unreliable”.
Some PI’s truly don’t care and will publish results that are based on a 10% success rate because they don’t report on the number of failed experiments, just the ones that worked.
This is a huge problem, and fundamentally plagues reproducibility in the end.”
I’m not sure what happened, but my comment got posted twice. Sorry!
If Professor Ceiling Cat sees this, please feel free to remove one of the comments.
That’s how I remember a few labs from when I was in academia. Nice quote!
Even if you have pressure to publish, it still seems like you have to be guilty of some other process error in order to convince yourself that you have something to publish.
True, but the pressure and the culture of “getting the right results” pushes people in the direction of making those mistakes – mostly subconsciously.
I recall this being brought up at both ACS and AAAS meetings many years ago, in both (or all) meetings related to private sector pharmaceutical research. So yeah, not a new problem, though maybe new in psychology. I’ll even bet that some of the impetus to study psychology experiments came from concerns in other areas of science.
I believe the one main suggestion for fixing the problem in the pharmaceutical research area was Nuzzo’s “Pre-commitment.” Specifically, the suggestion was that the FDA should require corporations to pre-register any trial studies they want to count towards an efficacy or safety determination. Those results would then have to be reported regardless of whether they were positive or negative, and no non-registered experiments would be counted. The goal here was to prevent ‘study farming,’ – the practice of performing multiple efficacy or safety studies until one of them comes up positive, then only handing that one result over to the FDA for evaluation.
As far as I know, this suggestion has never been implemented.
I thought some sort of voluntary procedure such as you suggest was implemented?
And did I read that other nations already do such a thing?
I think we could get around this problem fairly easily, though it might take some rearranging of journal content. If a journal like Science is going to “pre-approve” publications based on a review of a proposal, it would be relatively easy of them to dedicate a few pages to abstracts for those pre-approved experiments that didn’t pan out. You could pre-approve 10x the number of experiments you ever expect to publish, for example, with the expectation that about 9/10 will only get abstract-level coverage after they are completed.
This would be quite useful to the community, IMO, as it would inform other researchers in the field of all the various methods different people are trying. After all, maybe your methodology was innovative and quite sound, and got a negative result simply because of the details of your subject matter. But with it published as an abstract, I can now read about that method and use it on a different problem, where it might work better.
Physics or engineering demonstration experiments do not stand out as obviously fitting this mold. It either is the case that a measurement is correct or that a technology works or it does not. Medicine and biology are much more likely to have complex systems which are statistically significantly tied to dependent variables that an experimenter simply cannot control over.
Research that attempts to make sense of complex systems is likely to have reproducibility issues. This is something we just need to learn how to deal with.
If it was that simple… When you try to describe new phenomena (test models or other hypotheses) there isn’t an obvious “works”/”doesn’t work”. For example, when Einstein’s GR was tested. [ https://en.wikipedia.org/wiki/Tests_of_general_relativity#Deflection_of_light_by_the_Sun ]
Indeed, GR is a great example of how science should be viewed: long term success. Granted, most scientists need to get next years NSF, NASA, or DOE grants and I am ok with a little sexy exaggerations, but in the long run, mischief in science is worked out.
If I raise a NIST Al+ ion clock one meter higher in Boulder, CO, you can bet (pragmatic, real money) the red shift of earth’s gravity has to be accounted for to get the near 1/10^18…and it was good ole GR that does the job. It may not be the final word for gravity (and likely is not) but whatever outdoes GR has to be damn good.
Simple in physics/engineering I mean. (And for the latter, ask software developers if they are finished/trust their war/how many bugs it is infested with. =D)
I agree with the rest! Also, if there are 1000’s of genes that affect, say, body length, couldn’t effect sizes be inherently low for each factor?
“ware”.
Reproducibility and peer-review are the two great pillars of science.
The failed replications of some scientific studies are simply yet another reminder that we should never get too excited about the outcome of one lone study.
I think that was good news! Theoretically the error rate (as a max of non-repetition rate) could have been well over 50 %, now it is boxed in to likely be less than 30 %. And that in the worst possible science we can do. (E.g. I read yesterday that professional psychologists do not learn, so diagnoses are not improving after study years. They blame stress, complexity of work, et cetera.)
So we can guess physics has an error rate in the percent range, and as we go to messier sciences it moves into the lower tens of percent range.
So a lot of work to do, but at last science is observing itself and there is harm but not deadly damage.
I don’t think that is how it works. In particle physicists blind analysis they insert a bias that makes the data unrecognizable (working “blind”). The bias is removed after the rest of the analysis is finished.
Isn’t that what your example describes? Maybe I misunderstand your description.
One method we’ve used in statistical genetics is to hold one variable constant -say, expression at one loci- and allow all others to vary and re-do the analysis. Repeat this X number of times (X depends on computational capacity) for each of Y variables (can be in the thousands, so caveat to X applies). This helps to tease out the real factors at play since only those that are will show up in the analysis.
It also has the potential to hint (at least) at any confounding variables. using this kind of approach we discovered in a large scale F2 mouse cross looking at gene expression traits, that the position of the animals’ cages in the vivarium mattered (males housed near females, cages near the door or vents, etc) as well as who did the necropsy. This “blindedness” allowed us to account for those variables.
Yes. +-35% is very good.
Reminds me of the numbers in top cancer-research:
“A few years ago scientists at Amgen, an American drug company, tried to replicate 53 studies that they considered landmarks in the basic science of cancer, often co-operating closely with the original researchers to ensure that their experimental technique matched the one used first time round. According to a piece they wrote last year in Nature, a leading scientific journal, they were able to reproduce the original results in just six.”
From:
http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble
Two points.
First, on blinding data…for certain types of mathematical operations, there are provable cryptographically-secure methods of performing computations on data such that whoever doing the computation can’t possibly know what the data is or the result is. Anything that lends itself to such sort of computation…well, obviously, it would be prudent to take advantage of it. But it’s going to have to be a large set of data that you’re working with — and, of course, it doesn’t protect against biases creeping in from the collection of the data in the first place.
Second, taking many steps back…this whole exercise is an application of the scientific method to itself. We’re experimenting with experimentation: which methods of research are more likely to produce fruitful results? This sort of meta-analysis is very powerful and could potentially lead to a rapid acceleration of the quality of science produced.
b&
Such comprehensive blinding may also interfere with the discovery and analysis of systemic errors. Sometimes you need to know what the data set really looks like to find them, and the sort of blinding you’re talking about is likely to hide any systemic biases along with the data itself. But that’s a quibble; yes data blinding is certainly a useful tool in the toolbox, so long as we use it intelligently.
I’m not enough of a mathematician nor cryptographer to be able to authoritatively contribute an answer, but it’s my understanding that you’d still be able to do many of the expected operations, such as identifying outliers or seeing if there’s any statistical significance or the like. You wouldn’t have any idea what the actual outliers were until you unblinded the data, is all.
…but do please take all that with a fist-sized grain of salt….
b&
Outliers are a random error, not a systemic one. A systemic error would be, for example, if the entire energy spectrum was shifted up by 100 keV. Or if every data point you take in the afternoon has a bias in it. If your cryptographic mixing is shifting around the absolute value of the spectra or the time stamp on the data, you’ll never see it. You’ll basically have to wait until you remove the blind, then do a second round of data analysis to check for systemic errors.
Again, this is a quibble. This one tool may not help will all types of errors, but it is still useful.
Yes, you’re right. But it might also provide an extra incentive to cross all the i’s and dot all the t’s before doing the analysis, as well as come up with additional checks and balances to prevent those sorts of errors in the first place.
I’m still excited by the fact that all these ponderings are perfectly suited to empirical analysis. Is any of this actually a good idea? Let’s try it and find out!
b&
I like the suggestion of blinding data. And as a rule, such protocols/algorithyms should be made available online for people to test their data against bias. However, I think making an assertion that is not fully backed up, can potentially lead to better science.
If someone publishes a fabricated result in Nature the reality is that the result it makes claims on may be
a) a particular observation that is probably not even important enough to reproduce
b) a result that theory already predicted and most knew would work anyhow (even if the procedure published is flawed)
c) in the long run, ignorable, but make others think about how to improve upon the technique.
Overall, I am not entirely sure reviewing the way scientific investigations are done is important. Making mistakes is, in many ways, more revealing than trying to discover the most efficient way to not report potentially erroneous results.
” it’s just too bad if other people use your published data to their own ends.”
Not sure about this one, especially if you are from a smaller and less well funded institution where you don’t have the same time to process all your data. I also wonder what the Vice-Chancellor might say when sh/e sees the outside papers using the data developed in his institution!
We run into this issue a lot in bioinformatics, which largely embraces the “transparency” approach of open source & open data etc. The generally agreed solution is to make the data itself citeable and carry more kudos – there are many different useful contributions one can make to science. Despite the systems in place that promote competition, it’s a team game. (Rather than hide your ideas and hope that no one finds/steals them, just share them in the public domain and you (might) still get some of the credit. If credit is what you seek, of course.) The real problem is how we currently reward different kinds of contribution, i.e. by giving Nature and Science papers too much weight.
“there are many different useful contributions one can make to science”
(This is not directly related to the discussion, but) Yes, including emptying the rubbish bins in the lab.
Very interesting.
Continuous improvement in methodology. Something that the theological community could learn to do.
In theology it is better to use the worst methods available. When reasoning about things that don’t exist “Continuous improvement of methodology” probably won’t help much.
I have myself seen the sharpshooter’s fallacy or data mining in action, I have seen a PhD supervisor who couldn’t accept his student’s results because they refuted his favoured hypothesis, and I once got a crucial grant proposal rejected with the reason that I couldn’t yet be sure if I would be able to “confirm” my hypothesis, so funding the project would be too much of a gamble (no joke).
But as stupid and unfair an individual scientist’s behaviour can be, I still think that the discussion of this topic is generally overblown and sensationalist. Let’s be honest, stuff that matters people try to build on, and then it quickly comes out that a result wasn’t sound; and stuff that doesn’t matter, well, how much of a problem is it?
Also, of course, I work mostly in an area – systematics – where I really do not see the reproducibility problem (my bad experiences above were all ecology related).
“it suggests that there are endemic problems with the way science is adjudicated and published”
And therefore Jesus is Lord, amen.
This is basically an expansion and elaboration of Richard Feynman’s famous words, “The first principle is that you must not fool yourself, and you are the easiest person to fool.”
This is a relatively simple problem to solve, just require more standard deviations to publish. For instance, the Physical Review requires that any affect be significant at the 5 standard deviation level in order to publish in that journal.
This would not work in most situations. It simply is not feasible – nor necessary – to reach this level of confidence. In biology, we tend to prefer multiple independent lines of corroborating evidence rather than one single killer experiment as in Physics. (It is impossible to boil any biological system down into such an experiment – a p-value is only as good as the assumptions being made when establishing the expected results under the null hypothesis. Biology is too messy and has too many unknowns to do this well.)
I fail to see why insisting on a higher standard of statistical significance is “not feasible”. Biology is indeed messy. That just means that sometimes your result will be “X happens Y percent of the time”. There are standard statistical techniques for calculating the confidence level in the results. Achieving 3 or 4 sigma requires larger and more expensive experiments, but you can trust the results.
Only if you can accurately model the system. Which you can’t. Multiple testing, non-independence of variables, confounding factors, hidden biases in your assumptions… there are lots of reasons why statistical significance and biological significance are not the same.
That’s without even considering the limited availability of biological material and cost limits. Then there’s the issue that it might be “real” in that specific scenario but is it generalisable? Better to have lower confidence in lots of situations than high confidence in one situation that may be unrepresentative.
I would argue the opposite: rather than focus on trying to achieve results we can blindly trust, we should put more emphasis on our confidence intervals. It is fine to have a 5% or 10% chance of being wrong as long as this is clear. The mistake comes that people treat “statistical significance” as something magic, which is far from the truth.
This would require increasing the funding for each grant given. Since money doesn’t grow on trees, that means less grants. You think that payoff is worth it? Less science for better science?
I think there is definitely a case to be made for less, better science. However, I don’t think stricter p-values are the way to achieve this. Most poor science is down to bad experimental design, failing to understand the statistics correctly in the first place, missing biases in data, or failing to correct for (sometimes hidden) multiple testing. Playing with p-value thresholds will not help.
We also need to be mindful that there are FALSE NEGATIVES as well as false positives. More stringent p-values reduce false positives but at the cost of more false negatives. Depending on the question being asked – and the availability of additional (in)validating experiments – this might be much more harmful (and certainly not less “wrong”).
One way to help would be to incentivise work that either validates or contradicts original research. I personally think we put too much stock on originality and novelty. Coming up with ideas is easy. Testing whether those ideas match reality is difficult – and more important in the long run. In the meantime, I think that every scientist is aware of this and approaches new findings skeptically. I certainly don’t trust anything I read just because it’s published. If it’s important for my own research, I will my own little tests or reanalyses before committing too much time or effort. And if they don’t make the data available – it’s just an anecdote and not proper science.
“I personally think we put too much stock on originality and novelty.”
Indeed, that’s what gets the headlines.
I am always telling my wife (after she reads about some “you won’t believe the amazing break through these scientists made!!!!”): Beware the First Study Effect, which is a part of the problems outlined here.
I have a relative who has a degree (no shit) in Aruveda. Whenever she comes out with, “you should try X, it’s really effective for Y,” I am sorely tempted to reply, “well, sure, send me a link to the peer-reviewed literature on it and I’ll have a look.” My wife has forbidden this reply (in the interest of family harmony).
My wife tends to believe anything that the “alternative” medicine and nutrition people say.
For instance, “they” go on and on about how bad high-fructose corn syrup is; but then push “agave nectar” as a holy-istic alternative. When I asked my wife if she had investigated what was in that “agave nectar” (higher fructose than high-fructose corn syrup), she hadn’t.
sub
Interesting conversation! I am a coauthor (with Saul Perlmutter) of the paper on blind analysis that is mentioned above (see http://www.nature.com/news/blind-analysis-hide-results-to-seek-the-truth-1.18510). I just wanted to mention that we specifically address the idea of wanting to explore your data, and we suggest that blinding using multiple decoys (plus the true results) can actually enhance exploration. But of course whether we are right about that is an empirical question.
Hmm.
Gene sequences replicable? Look at Genbank U68312.1 The author didn’t voucher the carcass her tissue sample came from but she deposited a specimen collected with it in USNM. I’ve compared it with the type specimen of B. punctifer. The USNM specimen is Priapichthys puetzi. I have P. puetzi cyt b sequences (plural) that don’t cluster with Genbank U68312.1 I just ran a BLAST search on it. The first match is Genbank U68312.1. The rest of the page are all Homo sapiens. Hmm.
I’m not as sure as I’d like that published gene sequences can be replicated.
There is a difference between individual mistakes on one side and systems being so noisy that experiments are sometimes hard to reproduce even if you do your very best on the other. In the case you describe there would not have been any unsolvable problem if the investigator had made the voucher from the specimen they sequenced, as we usually do.
Interesting discussion. It is not entirely unconnected with yesterday’s question about public vs private funding of research.
A major difference between public and private organisations is that public organisations are required to show accountability (and this contributes to the apparent expense of public endeavours), while private enterprises are usually required to show a profit. (It’s a generalisation.) It seems obvious to me that this strongly favours publicly funded research, since accountability (which is sort of what we are discussing here) is positive force for science, while competition and the profit motive are less so.
I broadly agree with what has been stated above, that competition and the “publish or perish” motivation can have serious negative effects on research. Furthermore the push to “corporatise” everything is also putting pressure on research organisations.
I think we should be prepared to stump up the money while expecting no return. If you want a sure return on investment then research might not be the vehicle for you. Overall I think research delivers positive gains for society, but some individual projects are likely to be almost a dead loss (although that doesn’t make them a total waste).
Consider how much money and effort were lost demonstrating that vaccines don’t cause autism? Was that a waste of money? IN a sense it was, but the result is still positive for society.
Researchers at the beginning of their careers should all be required to take a course on these sorts of biases.
And perhaps there should be periodic continuing education requirements every couple of years.
I’ve just volunteered to give a 15 minute talk on this stuff at our place.
It’s a start…….
One thing I’d add to the problems is funding. Many (most?) groups wouldn’t have the money needed to work with statisticians to determine a good selection scheme and collect a large enough sample. Of course to work out what would constitute a good sample size etc. you’d need to have a clear idea of what you want to measure – a properly designed experiment should preclude any post-hoc fishing expeditions.
If you are in a University then contact your local Maths or Stats department to see if anyone can help. There should be someone willing and able to help with relatively non-specialist stuff, although of course not everyone will be familiar with all of the newer and/or more advanced stuff (as in any other subject).
If you are willing to offer joint authorship, if the contribution is substantive enough, there’s a good chance of finding someone.
I’d add to this that graduate students in various disciplines are expected to do a lot of their own “crunching”. For example, my sister’s PhD in psychology.
In my view with increasingly sophisticated methods needed, it is difficult to expect all researchers in X to also be an expert in data processing for X as well at least to the extent that is necessary sometimes. Perhaps this would be an area of improvement to consider too.
Where I work (government) the methodologists and the subject matter experts are distinct people, for example.
As mentioned on the last thread on the subject I’d add also (a) building of theories for background compatibility and not “sawdust” investigations and (b) journals of null results (perhaps as a result of preregistrations)
We’ve seen several high profile changes in government dietary recommendations recently (fats, cholesterol etc), due in large part that they were largely based in disparate studies with self reporting by people (often ones who had gotten sick). Memory is a bad means of quantifying food and exercise. Lots of bad data, though seductive, does not make good data.
Somewhere I saw a figure (forget the value but well over half) of papers that have never been replicated.
Another problem is when researchers work from the same basic data set (population, epidemiology, economics, climate can fall into this) manage to get the same results, They may well be right,but the multiple agreement does not actually add anything to the certainty of the results.
Ideology can also be a factor:
http://heterodoxacademy.org