Psychology journal deep-sixes use of “p” values

March 5, 2015 • 8:45 am

Reader Ed Kroc sent an email about a strange development in scientific publishing—the complete elimination of “p” (probability) values in a big psychology journal. If you’re not a scientist or statistician, you may want to skip this post, but I think it’s important, and perhaps the harbinger of a bad trend in the field.

Before I present Ed’s email in its entirety, let me say a word (actually a lot of words) about “p values.”  These probabilities derive from experimental or observational tests of a “null hypothesis”— i.e., that an experimental treatment does not have an effect, or that two sample populations do not differ in some way. For example, suppose I want to see if rearing flies on different foods, say cornmeal versus yeast, affects their mating behavior.  The null hypothesis is that there is no effect on mating behavior. I then observe the behavior of 50 pairs of flies raised on each food, and find that 45 pair of the cornmeal flies mate within an hour, but only 37 pair of the yeast flies.

That looks different, but is it really? Suppose both kinds of flies really have equal propensities of mating, and the difference we see is just “sampling error”—something that could be due to chance alone. After all, if we toss a coin 10 times, and repeat that twice, perhaps the first time we’ll see 7 heads and the second time only 4. That is surely due to chance, because we’re using the same coin. Could that be the case for the flies?

It turns out that one can use statistics to calculate how often we’d see a given difference (due to sampling error) if the two populations were really the same. What we get is a “p” value: the probability that we’d see the difference we observed if the populations were really the same. The higher the p value, the more confidence we have that the populations really do not differ, and we’re seeing a sampling error. For example, if the p value were 0.8, that means there’s an 80% probability of getting the observed difference—or one that’s larger—by chance alone if the populations were the same. In that case we can’t have much confidence that the observed difference is a real one, and so we accept the null hypothesis and reject the “alternative hypothesis”—in our case that the kind of food experienced by a fly really does affect its behavior. But when a p value is small, say 0.01 (a 1% chance that we’d see a difference that big or bigger resulting from chance alone), we can have more confidence that there really is a difference between the sampled populations.

There’s a convention in biology that when the p value is lower than 5% (0.05), meaning that an observed difference that big or bigger would occur less than 5% of the time if the populations really were the same, we consider it statistically significant. That means that you’re entitled by convention to say that the populations really are different—and thus can publish a paper saying so. In the case above, the p value is 0.07, which is above the threshold, and so I couldn’t say in a paper that the differences were significant (remember, we mean statistically significant, not biologically significant).  There are various statistical tests one can use to compare samples to each other (you can do this not just with two samples but with multiple ones), and most of these take into account not just the average values or observed numbers, but also, in the case of measurements, the variation among individuals. In the test of two fly samples above, I used the “chi-square” test to get the probabilities.

Of course even if your samples are really from the same population, and there’s no effect, you’ll still see a “significant” difference 5% of the time even if it just reflects sampling error, so you can draw incorrect conclusions from the statistic.  That gave rise to the old maxim in biology, “You can do 20 experiments, and one will be publishable in Nature.” And of course one out of twenty papers you read that report p < 0.05 will be rejecting the null hypothesis (of no difference) erroneously.

I should note that the cut-off probabilities differ among fields. Physicists are more rigorous, and only accept p values of much less than 0.001 as significant (as they did when detecting the Higgs boson). In psychology some journals are more lax, accepting cut-off p’s of 0.1 (10%) or less. All of these numbers are of course arbitrary conventions, and some have suggested that we don’t use cut-off values to determine whether a result is “real”, but simply present that probabilities and let the reader judge for herself. I don’t disagree with that. But, according to statistician Ed Kroc, one journal has gone farther, suggesting that we don’t report p values at all! I think that’s a mistake, for then one has no way to judge how likely it is that your null hypothesis is wrong. Ed agrees, and reports the situation below:

*******

by Ed Kroc

I wanted to pass this along in case no one else has yet, as it could be of interest to you, as well as to anyone who has the occasion to use statistics. Apparently, the psychology journal Basic and Applied Social Psychology just banned the use of null hypothesis significance testing; see the editorial here.

As a statistician myself, I naturally have a lot to say about such a move, but I’ll limit myself to a few key points.

First, this type of action really underlines how little many people understand common statistical procedures and concepts, even those who use them on a regular basis and presumably have some minimal level of training in said usage. I appreciate the editors trying to address the very real problem of seeing statistical decision making reduced to checking whether or not a p-value crosses an arbitrary threshold, but their approach of banning the use of p-values and their closest kin just proves that they don’t fully understand the problem they are trying to address. p-values are not the problem. Misuse and misinterpretation of what p-values mean are the real problems, as is the insistence by most editorial boards that publishable applied research must include these quantities calculated to within a certain arbitrary range.

The manipulation of data and methods by researchers to attain an arbitrary 0.05 cutoff, the effective elimination of negative results by only
publishing results deemed “statistically significant”, the lack of modelling, and the lack of proper statistical decision making are all real problems within applied science today. Banning the usage of (frequentist) inferential methods does nothing to address these things. It’s like saying not enough people understand fractions, so we’re just going to get rid of division to address the problem.

Alarmingly, the editors say “the null hypothesis testing procedure is invalid”. What? No caveats? That’s news to me. Invalid under what rubric? They never say.

Interestingly, they no longer require any inferential statistics to appear in an article. I don’t actually categorically disagree with that policy—in fact, I think some research could be improved by including fewer inferential procedures—but their justification for it is ludicrous: “because the state of the art remains uncertain”. Well, then we should all stop doing any kind of science I guess. Who is practicing the
state of the art anywhere? And who gets to decide what is or is not state of the art?

Finally, the editors say this:

“BASP will require strong descriptive statistics, including effects sizes. We also encourage the presentation of frequency or distributional data when this is feasible. Finally, we encourage the use of larger sample sizes. . . because as the sample size increases, descriptive statistics become increasingly stable and sampling error less of a problem.”

First off, no, as sample size increases, sampling error does not necessarily become less of a problem: that’s true only if your sampling procedure is perfectly correct to begin with, something that is likely never to be the case in an experimental psychology setting. More importantly, they basically admit here that they only want to see descriptive statistics [means, variances, etc.] and they don’t need to know any statistics the discipline doesn’t understand. Effect sizes and frequency distributions? p-values are still sitting behind all of those, whether they’re calculated or not; they are just comparative measures of these things accounting for uncertainty. The editors seem to be replacing the p-value measure with the “eyeball measure”, effectively removing any quantification of the uncertainty in the experiments or random processes under consideration. A bit misguided, in my opinion.

I could go on—in particular, about their comments on Bayesian methods—but I’ll spare you any more of my own editorializing. Part of me wonders if this move is a bit of a publicity stunt for the journal. I know nothing about psychology journals or how popular this one is, but it seems like this type of move would certainly generate a lot of attention. I do hope though that other journals will not follow suit.

127 thoughts on “Psychology journal deep-sixes use of “p” values

  1. In other news, the carpenters’ profession has banned the use of crosscut saws.

    Yeesh. L

  2. Please do go on about their statements about Bayesian statistics!
    I’d find it interesting what you think for example in light of the efforts of the “Science Based Medicine” crowd to establish the use of Bayesian methods in order to better judge the actual efficacy of treatments.

    1. Bayesian methods are very attractive in many circumstances, although I would hardly say that I am a carte-blanche supporter.

      Regarding what the editors of BASP said about Bayesian methods, they seem to be unsure of what it is they really want. They claim that they need a probability of the truth of the null hypothesis in order to make a strong case for rejecting it – which is *exactly* what Bayesian methods yield – but then go on to say that Bayesian methods will be judged on a case-by-case basis and will not be required. So they want the results of a Bayesian analysis, but not necessarily the analysis itself? Again, to me it sounds like they are trying to remove objective statistical decision making from the journal, replacing it with an “eyeball measure”.

      Further, it is patently false that the probability of a null hypothesis “is needed to provide a strong case for rejecting it.” Here, they misunderstand what frequentist methods actually yield, that they can provide an equally strong case for rejecting a hypothesis given reasonable statistical decision making protocol *and repeatability*. The decision making protocol gets at what others have said below (that low p-values don’t always signal real effects, that higher sample sizes necessarily drive p-values down, etc.), but it’s the repeatability piece that I would like to emphasize. Traditional frequentist methods demand that experiments, and so tests of hypotheses, are repeated in order to become truly confident about drawing a conclusion. This is of course a huge problem with much of published science since few journals want to publish repeat experiments. Few scientists too want to spend their time on repeat experiments (although some do). This is a critical piece of the puzzle though. If I had my way, I would institute some kind of rule where in order to get research funding, a certain percentage of your work has to fall into the “repeat” category (preferably not a repeat of your own work). It’s a duty the entire scientific community should share, like peer-review.

      Now, you might say, why not avoid all this and just push for a complete conversion to Bayesian methods? Some statisticians do advocate for this, and maybe eventually this is what will happen. But I don’t think it is reasonable to expect non-statisticians to suddenly change all their methods (where those methods have developed Bayesian counterparts – not always a given) and relearn how to do their science statistically overnight, or even within a generation or two. It’s a slow process. More Bayesian procedures is good, and we are already seeing that, but the frequentist framework can work quite well *if* it is properly used.

      Finally, to bring it back to what the editors of BASP said, they latch onto the customary criticism of Bayesian methods by complaining about the use of uninformative priors. This is only a problem where no previous research has been conducted on a topic, which is not often the case in psychology I would imagine. And even if you are restricted to a case where there is absolutely no prior information to use, their concerns can be offset by doing a bit of sensitivity analysis (not done nearly enough in my opinion) on the choice of prior.

      1. Finally, to bring it back to what the editors of BASP said, they latch onto the customary criticism of Bayesian methods by complaining about the use of uninformative priors. This is only a problem where no previous research has been conducted on a topic, which is not often the case in psychology I would imagine. And even if you are restricted to a case where there is absolutely no prior information to use, their concerns can be offset by doing a bit of sensitivity analysis (not done nearly enough in my opinion) on the choice of prior.

        The usual complaint about Bayesian statistics that I see is about the lack of a non-informative prior, or to put it another way, about subjectivity in the choice of prior. But the trend in applied Bayesian statistics in the social sciences is to use minimally informative priors for hypothesis testing, whose influence, as you say, can be checked by sensitivity analysis.

        But the choice of prior is primarily a concern in Bayesian hypothesis testing, where its influence is not diluted by the data. If experimental psychologists switch their emphasis from hypothesis testing to effect size estimation, then the choice of prior would hardly matter, because even a badly chosen prior will be overwhelmed by even a moderate amount data from the experiment, and in experimental psych, subjects are cheap and easy to come by.

  3. “Some might view the NHSTP ban as indicating that it will be easier to publish in BASP, or that less rigorous manuscripts will be acceptable. This is not so. On the contrary, we believe that the p < .05 bar is too easy to pass and sometimes serves as an excuse for lower quality research."

    Of course if will be easier! If this is their excuse, why not raise the bar and require a p < 0.001 for publication rather than eliminate NHSTP altogether? I smell something fishy…

    1. In fact, larger sample size causes bigger problems for the null-hypothesis-testing crowd. As sample sizes approach infinity, virtually any test against a null hypothesis will always yield a statistically significant p value, at whatever level you choose. The cut-off level is not the problem. The problem is that it is very unlikely that any null hypothesis is exactly true to an infinite number of decimal places. And unless it is exactly true, to an infinite number of decimal places, a large enough sample will detect this departure and (correctly) return a significant but meaningless p-value.

      1. I sometimes think the problem stems from the term ‘significance’. The p-value is what it is and no more. It speaks of the confidence with which one can rely on a result and has nothing to say on the result’s impact in the real world (or its ‘significance’ in a colloquial sense). p-value can not be used to assess the magnitude or importance of a difference in means without taking account of sample size (more-or-less by definition).

      2. “As samples sizes approach infinity”? Unless the population you are sampling from is very very large, that is unlikely. Why do you take a sample? So that you can make inferences about the population, when sampling the entire population is not feasible. That rests on reducing bias in taking your sample, blocking when necessary, etc. A large sample size can allow you to detect differences that are tiny yet significant but that does not automatically mean they are nonesense. Statistical significance does not mean biological significance, which always means that further research is required to test the results you have obtained. Multiple testing will give you more false positives but that is why you do statistical corrections when using multiple tests.

        1. “Statistical significance does not mean biological significance” is also the point I am making. P-values don’t help us answer the question we really want to answer. Most questions in biology are better answered by estimating a biologically significant parameter, and giving confidence intervals for it, rather than by a binary test of a null hypothesis (especially a point null hypothesis, like H_o = X).

      3. Lou Jost: “In fact, larger sample size causes bigger problems for the null-hypothesis-testing crowd. As sample sizes approach infinity, virtually any test against a null hypothesis will always yield a statistically significant p value, at whatever level you choose.”

        The above is true if the null hypothesis is false because, as is well-known, power increases with N, all else being equal.

        But the above is not true if the null hypothesis is true because alpha is fixed (by convention) at a constant value.

        The suggestion – which may have just been poor writing – that a larger sample has a better (or even different) chance of rejecting a true null hypothesis is the sort of thing that causes backwater journals like BASP to drop NHST completely.

        1. What I said:”…larger sample size causes bigger problems for the null-hypothesis-testing crowd… The problem is that it is very unlikely that any null hypothesis is exactly true… And unless it is exactly true…a large enough sample will detect this departure and (correctly) return a significant but meaningless p-value.”

          What you answered: “The suggestion – which may have just been poor writing – that a larger sample has a better (or even different) chance of rejecting a true null hypothesis…”

          Please re-read my comment and you will see that I make no such claim. My whole series of comments is about the case in which the null hypothesis is NOT true.

          1. Sorry for misreading your comment, but falling back on the claim that the null hypo is very unlikely to be true puts the cart before the horse (even if that is how most Bayesians roll). Some of us aren’t in the mood to reject the idea that all phenomena have natural causes and explanations because someone else thinks that this null is highly unlikely to be true. We want evidence. If you’d also like to have an attached measure of association, such as an eta-square, that’s cool, but there are too many cases where the null is plausible to use the unjustified argument that nulls are very unlikely to be true.

          2. My point is that in most ecological and genetic experiments, we usually don’t really care if the null hypothesis is exactly true or just almost exactly true (say, to five decimal points). So null hypothesis testing doesn’t really tell us what we want to know; parameter estimation does.

            In cases where we really do care whether the null hypothesis is exactly true, I agree that null hypothesis testing is valid. Above I used the Michelson-Morley experiment as an example of such a case. These cases are very rare in real-life biology.

      4. For designing simple experiments with a null hypothesis of y-bar1 = y-bar2, that’s what the power of a test is for. You’re right that the null hypothesis is almost guaranteed to not be exactly true, but you can calculate the quantity required to test for a difference deemed to be appropriate for the circumstances.

        But that doesn’t mean it’s a “bigger problem for the null-hypothesis-testing crowd.”

        My datasets routinely contain millions of observations, and while I’d frequently reject a null hypothesis that two samples are equal with regard to some metric, I can establish a more reasonable null hypothesis to search for a difference that is relevant to the situation. i.e. Being able to detect a difference of 0.0000001 might be completely meaningless (and in my work, usually is), so we might want to test for a difference of 100. The null hypothesis MUST be appropriate for the purpose of the test.

        But ultimately, this just speaks to the necessity of having someone well-versed in statistics involved in the test.

        1. Yes, null hypotheses that are ranges are much better than point null hypotheses. But again, the more appropriate philosophy is to consider your experiment as a measurement of the magnitude of the difference (with confidence intervals), not a binary decision process.

          Suppose you do your experiment, and you find a non-significant result. The interpretation of this depends on the confidence interval of your measurement of the difference. If your confidence interval is narrow, then you can say that the difference is almost certainly less than 100. If the confidence interval is broad, you can’t say much except that you need more data. That’s why it is helpful to have confidence intervals. We generally want to estimate parameters, not test null hypotheses.

          1. What we want is to draw a conclusion on whether or not the treatment had an impact (or the observations differ from nothingness, in the case of things like the existence of the Higgs boson), and that is a binary conclusion.

          2. Is that what we normally want in biology? I’d say no. As I mentioned in my other comments here, if the mere existence of an effect is newsworthy, I agree that p-values are appropriate. In biology, though, this is rarely the case. We don’t want to know if the treatment has an effect different from zero, we want to know how big of an effect the treatment has. Parameter estimation, not null hypothesis testing, as DeGroot and Schervish note in their widely used textbook, and as many other experts have said.

      5. In my work, the drive is always to reduce sample sizes (commensurate with maintaining the needed resolving power; which is always a risk-based call, regardless) because larger samples cost more in money and time, sometimes a lot more time.

        Delay is very costly in my work, so these are carefully considered.

  4. I think this is a very important issue, so please excuse the long comment which will follow this one. P-values almost never are relevant in biological work, and should be used only in those rare cases when they are relevant. They should almost always be replaced by confidence intervals on interpretable, unscaled measures of the magnitude of an effect. Please see
    https://golem.ph.utexas.edu/category/2010/09/fetishizing_pvalues.html
    for a long discussion of this, triggered by a paragraph in a paper I wrote for Molecular Ecology.

    This is not the first journal to prohibit p-values. Apparently this journal does not understand the issue either and bans them for the wrong reasons.

    The reason to avoid p-values, and null hypothesis testing generally, is that a significant p-value only means that the null hypothesis is (probably) false. But we nearly always know the null hypothesis is false without leaving our office. For example, it is virtually impossible that two forests will have exactly the same compositions (in terms of species relative abundances). Getting a significant p value is guaranteed, if sample size is large enough. This is true for most null hypotheses. This approach reduces science to a meaningless game.

    This does not mean abandoning statistical methods. The uncertainty in our conclusions is properly expressed by confidence intervals.

    The only time null hypothesis testing should be used is if the mere falsity of the null hypothesis were itself newsworthy. An example would be the Michelson-Morley experiment on the constancy of the speed of light (independent of the direction of travel of the observer). If there were statistically significant evidence that the null hypothesis of constant velocity were false, it would win someone a Nobel Prize. But these kinds of examples are very rare in biology.

    1. A nice short way of saying all this: “[A solution]is to regard the statistical problem as one of estimation rather than one of testing hypotheses.” DeGroot and Shervish, Probability and Statistics 3rd edition, p 530.

    2. Yeah, it’s very hard to see what’s actually going on without a whole lot more analyses of the data than a simple test for p-value.

      In my work, we typically look at several ways of presenting the data (generally though not not always with confidence bars) and then thinking about what’s going on, based on that.

      A low p-value isn’t enough to decide, except in very specific circumstances.

      However, a high p-value makes it very hard to get your managers to spend any money …

    3. Hi Lou,

      I fully agree that p-values should never be stand alone quantities in applied science. I would also support replacing p-values with their confidence interval kin, as these are far more informative measures of what the analyst is ultimately interested in.

      Your point that an exact point null hypothesis is, a priori, always known to be false is a good one. Have you read any of Andrew Gelman’s work? He has a nice reformulation of the Type 1 / Type 2 error idea in terms of what he calls Type M (for magnitude) and Type S (for sign) errors. I find these ideas to be quite useful. See for example: http://goo.gl/ElxOcv

      I do disagree though about your reasons to avoid hypothesis testing in general, “that a significant p-value only means that the null hypothesis is (probably) false.” I think this is too strong of a description of what p-values actually represent. The way I think about it is that a low p-value indicates that the data are less consistent with the null hypothesis than the alternative. Exactly how less depends on many things, but the point of hypothesis testing is not literally to determine if one hypothesis is probably true or false, given the data; it is to decide which hypothesis is more consistent with that data, something similar but not the same.

      1. Hi Ed, Lou

        Great thread: very informative. I especially appreciate Ed’s comment:

        “…but the point of hypothesis testing is not literally to determine if one hypothesis is probably true or false…”

        I’ve noticed over the years in too many seminars the presenter does just that: the p value is stated as if it verifies the “truthiness” of the difference being noted.
        (Always makes me think of Indigo Montoya…)
        But almost never with any context about possible biological significance.

        I think Lou brings this up nicely in comment 11.

        Many of the points brought up remind me of this article:

        Steve Goodman: “A Dirty Dozen: Twelve P-Value Misconceptions” Seminars in Hematology (2008) doi: 10.1053/j.seminhematol.2008.04.003

          1. You are welcome. And, thank you for the recent eagle and ocelot pictures. I would never have guess they would bother to prey on each other. Amazing.

  5. Thanks to both Jerry and Ed for helping to clarify/remind me about some of the basics of statistics — and why they matter.

    One of the popular ways people have of ducking epistemic accountability on extraordinary claims is to go into what I call Therapist Mode. “Let’s not look at whether belief X is technically true; let’s only consider whether or not it works well for the believer. We need to focus on helping individuals, not get side-tracked into cold ivory tower debates.” An immunizing strategy, iow.

    I would hate to think that this analogy drawn from a parody of psychology is actually reflecting an increasing sloppiness in the field.

    1. Actually some of the comments here are much more informative than are the pieces written by Jerry and Ed, which seem to lack an appreciation of why p is being rejected.

      1. This is an example of an uncivil comment. Let’s see you try to give a complicated explanation at 6 a.m. in limited time, and do it without making a few errors. I suggest that, until you can learn to have the requisite civility in my living room, you go criticize other websites.

        1. Sorry, Jerry, my comment was not meant to be uncivil. But there are posters here who seem to be experts in statistics who have added depth to the discussion and who have explained why the journal banned p values.

          I think you and Ed missed those reasons and this has caused many here who know little about statistics to equate their actions to a lessening of scientific standards, when their aim is actually to improve it.

          I know you want civil discussion, but I hope you don’t want a site of “yes” posters bowing their heads to everything you say. I was appealing to these posters to thnk a bit more about what they say and have a basis for saying it rather than just mouth acceptance of whatever they read here.

          Posters who post here merely to say “good post Jerry” don’t really add much to the discussion, unless they actually say why they think its a good post and perhaps why it is not completely good.

          Anyway, it’s your website, so I guess you will decide what you want, and if you don’t want me to be part of it then I have to accept that decision. I will add that I agree with most of what you write about here. It’s a good antidote to a lot of rubbish I read elsewhere.

          1. Apology accepted. But really, don’t even imply that I want only people who agree with me.There was no need for you to raise the accusation of me fostering an echo chamber: look at that thread for crying out loud. In the meantime, I suggest you learn to disagree in a civil manner, and treat the other posters (and me) with respect. Many people don’t see how they come off to others, or to me, and I think you need to do a bit of brushing up on that.

          2. Okay, thanks. Perhaps my post was clumsily disrespectful reading it again now a day later.

          3. I suggest you reread my original post, the cited editorial, and the comments I have made since, if you think I missed what the journals reasons were for banning p-values. I do indeed believe that the journal’s actions represent a lessening of scientific standards: as I said originally, “The editors seem to be replacing the p-value measure with the ‘eyeball measure’, effectively removing any quantification of the uncertainty in the experiments or random processes under consideration.” Remember that they have banned *all* (frequentist) inferential procedures, not just p-values. Their aim is undoubtedly to improve science in their field, but the main point I was making (and Jerry too, I believe) is that their method does not meet their aim; i.e the action they are taking here is misguided and ultimately reduces the scientific quality of the published work.

  6. My first thought, based on how many psych studies don’t seem to be reproduced, was a comment by a chiropractor leaving an unsuccessful test of some woo, “That’s why we don’t like double-blind studies.”

    – quoted on The Thinking Atheist podcast 17Feb2015.

  7. Oh my! What’s the point of looking at descriptive statistics if your p value shows that you should accept the null hypothesis anyway? I actually like statistics even though I suck at math (maybe that’s because statistics is supported with lots of tools that will do the math for you). Even I, as a non scientist, used p-values, statistical tests & descriptive statistics in my own work (but not really anymore because what I’m doing doesn’t really need that).

    Doing this is just going to make the Scientologists think they are right.

    1. Similar for me. I’m not a scientist and my stats education doesn’t go past 101, but I think statistics are incredibly important. This to me is like embracing woo. Surely there needs to be some form of validation?

      1. By which I mean I don’t know if p values are the best measure, and probably shouldn’t be the only way to get data accepted, but to ban a measure that can provide valid information seems short-sighted.

        I like what Ben Goren says below too.

  8. p-values have their problems, but eliminating them is not a good solution. I admit to only a vague understanding of Bayesian logic, but the use of Bayesian analysis would (I believe) allow one to calculate a probability value which is more intuitive to most people’s understanding. Also, one can calculate the probability of a model of interactions/ cause and effect relationships, instead of simple null hypothesis testing, as Lou Jost stated.

  9. I think that biologists’ reliance on p-values and null hypothesis testing has had a terrible effect on the field. The example I have the most experience with is in the measurement of diversity and compositional differentiation in genetics and ecology. In those fields, measures of diversity and differentiation tended to be used as mere tools to generate p-values. In genetics, for example, it is still common to say something like “The genetic differentiation Fst was 0.04, which is highly significant, p<<0.001." This satisfied the researchers since they got their significant p-value and could publish. But they never asked themselves whether that low value of Fst really meant anything. And they rarely noticed that in fact Fst does not measure differentiation. They didn't notice because they never tried to interpret the actual magnitude of the measure. They were satisfied with the fact that it was statistically significantly different from zero.

    As a result large chunks of population genetics (including key elements of evolutionary theory surrounding speciation) are wrong, because scientists used invalid measures as their basis. The lack of validity of these measures would have been instantly apparent had workers tried to interpret their actual magnitudes instead of being satisfied with rejecting a null hypothesis.

  10. My wife is s developmental biologist at Washington University and uses a p-value of 0.01. She is also very critical of people who rely on the p-value to bolster the argument that the results are ‘true’ as p-value just means the numbers are different and, in-and-of-itself, does not mean the difference is true.

    You can get different numbers from the same population just out of sampling variations/errors/observer biases.

    You can apply a treatment to homogenous populations differently (or as my wife calls them — pipette errors) and end up with different results. Even though they shouldn’t have generated a difference.

    Your populations could be different after your treatment, but its not because your hypothesis is correct, but because you blew it in your observations.

    And so on and so on and so on.

    All of which can pass a p-value test, but (ultimately) produce absolute rubbish science.

    So, while they can be helpful, and she uses them, they are not the be-all and end-all of biological research and not, in-and-of-themselves, a winning argument. Rather they are just one of many factors that may indicate you’ve conducted your research properly and have successfully made a new advance in science.

  11. Shorter BASP editors: Look, the people who submit to us aren’t really doing science, so let’s drop the pretense.

    I think the last few sentences were interesting: they talk about not limiting creativity, and not using p-values as a “crutch.”

    I think there is a struggle for identity going on there …

    1. This sounds, from a Bayesian point of view, inherently more likely than any of the more detailed explanations.

      1. I’ve never read a scientific psychology paper, but it’s hard to imagine how one goes about the task of aligning results or even measuring them when the subjects are basically self-reporting results. I don’t mean to disparage psychology, by any means, as a profession: there’s nothing wrong with being more art than science, if that’s what you are. I’m reminded of a truism I heard from a campaign consultant in a poli sci class: 90% of what we spend money on in political campaigns doesn’t work, but we’re never sure what makes up the 10% that does. Too many variables.

  12. I think a really big part of the problem is that only the “successful” results get reported on.

    The journals and researchers should agree before the research starts on the plan of the study and the fact that the journal is going to publish the results. Then the researchers can go do their thing, and the journal goes ahead and publishes whatever they find, even if they don’t find anything that rises above random chance.

    It’s important to know about all those areas of potential research where nothing interesting lies. By publishing the fact that nothing was found there, they either save other people the bother of looking, or they inspire other people to look in a way that they didn’t.

    And, this would mean that the journal would have the full spread of possibilities represented in their collective results, giving a more accurate picture of what people are actually finding.

    Yes, of course; give special attention to those studies that find something interesting…but include everything.

    Especially in this day and age of the Internet where it costs literally nothing to add even hundreds of pages to a Web site.

    b&

    1. “The journals and researchers should agree before the research starts on the plan of the study and the fact that the journal is going to publish the results.”
      I don’t agree with this part of your post – that would mean that the journal controls and drives research. To deal with founding organisations is more than enough without adding a new layer of research control.
      But I fully agree with the idea to be able to publish negative results.

      1. Some places (I believe they’ve already done it for some medical journals) have a *third party* repository for stuff in progress to avoid the “file drawer effect”.

        1. Ah, OK – I hadn’t seen your comment when I wrote mine. Glad to hear that some people are taking this up.

    2. I don’t agree with determining in advance that a journal would publish the results. One of the reasons for having journals in the first place is to filter the vast amount of papers written down to those of some reasonable interest to researchers in that area. Publishing everything may or may not be achievable by the journal, but it would be a nightmare for the readers.

      But it would be healthy if each study were at least preceded by publication of the plan. “Publication” might in this case simply mean placing it on a suitably authoritative website.

      If no journal publication results, it would then at least be possible to get a figure for the number of unpublished studies in that area. It would also provide some protection against significance-fishing.

      (Preferably there would also be a reason appended to the plan for the failure to publish: statistically insignificant, funding proved insufficient, researcher fell under a bus …)

      1. This is a good idea, at least when it comes to certain types of research. Preregistration of research protocols is already in place for certain clinical trials – this is where it is needed most. It would be good to see a bit more of this too, although I wouldn’t advocate for this type of thing across the board. The downside would be that we would restrict our ability to do a lot of exploratory data analysis. That type of analysis can often lead to new research questions the current study perhaps can’t answer, but that future studies could.

        I am much more an advocate of a reproducibility plan; i.e. somehow ensuring that research is repeated by other people. Right now, it is both harder to fund and to publish research that is considered “unoriginal” in that it is repetitive of someone else’s work. This is a big mistake, in my opinion.

        1. That was one of my first thoughts. Science always claims that replicability is a big part of the method, but in real life almost no one has the time and/or money to afford to try replicating research.

          Maybe we need to start rewarding somehow those who try to replicate results; or set up some sort of replication facilities that do nothing BUT address that issue.

          Or maybe just establish a random auditing of published results, a la the IRS. Put out the word that each year a random set of experiments will be chosen for replication tests.

          But I’m sure it would be tough to come up with money to support any of these ideas.

          1. Before retiring I headed up an NIH grant specifically to try and replicate promising therapy for spinal cord injuries. It was a 5 year grant to attempt 5 therapies. I don’t think NIH has plans to fund this type of work in the future. We were partially sucessful in 3/5 studies.

          2. One can hardly think of a more important area for replication to be standard in than medicine. What a scandal that work such as yours is being curtailed for political reasons!

            People also have career reasons for not “wasting time” trying to duplicate results. All the more reason for the establishment of oversight and priority-setting by entities charged (and funded) to do just that.

            (People get mad at Big Pharma for running their own clinical trials, but no one else is stepping forward to run them for them. And as I understand it, the FDA doesn’t cover medical device testing at all.)

          3. There’s a more interesting form of replicatability that takes place all the time, however. This is based on cumulativity and how well-established results are used to help select plausible hypotheses. One reason that pseudoscience is such is because its refusal to take these background hypotheses seriously – homeopathy, for one particularly egregious example. While the “background” consolidates is when the replication takes place – but as a matter of “less than settled, more than unknown”, etc. and so *similar* things get tried, or things get tried presupposing it, etc. and then something funny happens and we realize we weren’t as right as we thought about the original matter.

  13. Probably the worst abuse of p-values is in medicine. A drug company will often report (correctly) that their drug had a highly significant effect in treating X. But with a large enough sample, even the tiniest non-zero effect can always be made to reach whatever p-value they wanted. It is just a question of money to get a large enough sample size.

    To do this right, the drug companies should instead state how much their drug affects a meaningful measure of health or disease severity, and should give the uncertainty in their estimate of the size of that effect. (They should also include health risks and costs, of course.)

    1. Are there measures of effect size that aren’t also inextricably bound up with sample size the same way p-values are? I’m thinking of effect sizes in meta-analysis, which are effective the same as t-statistics and take into account the sample size.

      1. The best advice these days is to use non-scaled measures of effect size, so that the magnitude of the effect can be directly interpreted using units that are meaningful to the problem at hand.

        If you do want to get a single summary p-value from a meta-analysis of a set of experiments that each report (one-tailed) p-values, I derived a simple formula for that here:
        http://www.loujost.com/Statistics%20and%20Physics/Significance%20Levels/CombiningPValues.htm

        But the best approach to meta-analysis is to treat the studies as independent estimates of the actual, unscaled magnitude of the effect. The trouble with that is the studies may not all use the same measure of magnitude….

      2. Huh? No. The effect size is what it is. It is independent of the sample size used to estimate it. Of course, for a given effect size and population, the larger the sample size, the more precise we expect our estimate of it to be.

        In meta-analysis, a relationship between effect size and sample size usually indicates publication bias—most often, non-publication of small, null studies.

    2. This is definitely a problem in an advertising brochure that just quotes the p value but normally in a scientific paper we would be given rather more information. We would normally have at least the sample sizes and the means of the treatment and control groups so in the hypothetical case you describe we could conclude with confidence that there is a difference between the treatment and the control but with a similar degree of confidence that the size of that difference is very small. With the other information that you mention – health risks, costs, etc – that would be a reasonable basis for deciding if the treatment in question was medically valuable, surely?

      1. Yes, I was definitely thinking more about the PR than the scientific work in this case. In genetics and ecology, though, people often use measures with no simple interpretation, and are often satisfied by p-values.

    3. Jost is arguing that large samples mislead; the opposite is true. If the sample size is the whole population, then a measured difference is a real difference.

      1. No, I am arguing that in biology, large samples are almost certain to produce statistically significant results at whatever p-value level you want. This happens because the null hypothesis is usually not exactly true to an infinite number of decimal places. So the null hypothesis will correctly be rejected (the method is doing its job) but this is not usually the question we should be asking. We know the null hypothesis cannot be exactly true to twenty decimal places, so we could have rejected it without stepping out the door of our office. What we (usually) really want to know is the SIZE of the departure from the null hypothesis, along some biologically meaningful dimension, and we also want to know the uncertainty in our estimate of that size difference. This is a parameter estimation paradigm.

  14. p-values are definitely overused and overrated.

    What we’re interested in is the probability of the hypothesis given the evidence. Instead, what we’re doing is assuming some other hypothesis and then seeing what the probability of the evidence given that hypothesis. The entire approach seems backwards. Choosing the null hypothesis itself can be somewhat arbitrary. In some cases it’s quite clear, in others it’s a judgment call.

    Psychologists themselves have known about the issues with null hypothesis testing and its misinterpretations for a while now. It was old news back in 1993:
    http://www.stats.org.uk/statistical-inference/Cohen1994.pdf

    All that said, I’m not sure outright banning is the right approach. While I’m a Bayesian and think the frequentist school of thought (which null hypothesis testing comes from) is incoherent, there are certain advantages of to their methods. They are generally much easier to implement and are more familiar to researchers, especially older ones. Asymptotically, frequentist and Bayesian methods frequently (no pun intended) get the same result. Thus, despite their shaky philosophical underpinnings, frequentist methods are still of practical use. In general, I’d prefer to see a p-value than nothing at all.

    Perhaps a more middle-of-the-road approach, where the journal discouraged the use of null hypothesis testing, rather than outright banned it, would have been more appropriate.

    1. If psychologists know this is an issue, they should start cracking down on their peers who “abuse” the P-value and send them all back to statistical school!

      1. Some psychologists are aware that there are pervasive problems in how practitioners conduct and analyze their experiments. See the references in post #24 and #26, for examples of published criticisms. Much of the self-reflection in the field was kicked off by the publication in a top psych journal of the supposed paranormal findings in Daryl Bem’s “Feeling the Future” paper. IIRC, 10 out of 11 p-values were less than .05.

  15. As a person with some statistical training at university level but otherwise an observer of the field, I’ll take a stab at two points:

    @Alex (2): Why does “Science Based Medicine” like Bayesian statistics?
    I think the answer lies in what they’re commenting on – namely, non-science-based medicine, like homeopathy (loosely, “woo”).
    Frequentist statistics don’t consider the underlying plausibility of a technique, such as the underlying plausibility that diluting cinchona to well beyond Avogadro’s number would increase its potency in curing malaria because concentrated cinchona causes fever. Roughly speaking, you have 1 chance in 20 of getting a positive result at the p = 0.05 level with a given test of woo. Torture the data enough and you will find something. See also http://xkcd.com/882/ on jelly beans and acne.
    Bayesian statistics start with a prior probability of something, then modify that probability as new data come in to get the posterior probability. Now, if you consider the prior probability of water curing malaria, it’s probably around 0. Run a test 20 times and get one “positive”, and it’s not going to change the posterior probability. The trick is, outside of homeopathy, say in the real world of clinical trial, what is the prior probability?

    @Lou (20): p-values in medicine.
    It is my experience, and I’ve spent most of my working life in and around pharma, that companies try to measure something that is clinically significant when they conduct a trial – for example, for example, decrease in number of blood transfusions required in a given interval in patients with a blood cancer; number of strokes on an anticoagulant. The FDA wants clinically significant endpoints. But this is not to say that the same companies sell the drug to the public in the same way. Anticoagulants are a particularly egregious example: the newer ones are largely being sold on convenience.

    1. They also only compare people receiving the drug to those receiving a placebo, a false analogy considering they should actually compare the people receiving the drug to those receiving the old drug used to treat a specific disease. We wan;t to know if new drugs are better than old medicines, not if they are better than receiving nothing.

      1. I don’t think that’s the case – most trials (at least Phase 3 trials) are run against best current therapy. The reason is that you want/need clinical equipoise in the trial (the likelihood of benefit to be equal in both arms) for it to be ethical: a trial in which the risk to the subjects is significantly unequal would be unethical and would not get Institutional Review Board approval. You may see A+B versus A trials, where you are looking at the benefit of adding B to A; but you won’t see a B versus placebo trial unless there is no best current therapy.
        Also, even if you make the unjustified assumption that a company would run a placebo-controlled trial just to show “benefit” of their drug, it would be unhelpful – regulatory agencies and national health services are increasingly reluctant to approve drugs merely because they are safe and effective, they are looking for clinical benefit and often cost benefit over current therapy.

      2. Perhaps that had something to do with the over-hyping of COX-2 inhibitors, when in most cases they were no better than existing NSAID’s, and turned out to have more frequent serious side-effects.

  16. The problem is not the statistics itself, but that people aren’t trained to understand them. They use it without thinking. For example, I’ve seen articles where they claim that the Bonferroni correction (for multiple testing) reduces the false positive rate to zero! Of course, this can never be true (we can never know if we have any false positives or not): all the Bonferroni correction does is reduces the significance level alpha back to 0.05 (or whatever significance level you are looking at) for every single test. Without correction, you are vastly increasing your chances of getting false positives but that doesn’t imply that by correcting you are not going to get any!

  17. Certainly care is needed when using p-values, but to ban them altogether seems extreme. Part of the problem is that when RA Fisher decreed p=0.05 to be relevant, he intended this to be interpreted as worth further investigation, and he emphasised the importance of replication. This seems to have been forgotten. Furthermore, when WS Gosset introduced the t-test, his concern was to demonstrate that there would be no discernible difference between batches of Guinness brewed with ingredients from different sources. The use of the p-value to corroborate the null hypothesis is still both valid and valuable. I would also like to know the alternative to p-values in the analysis of variance for multi-factor experiments. Perhaps psychology does not find a need for such approaches.

    1. The reason biologists use a p-value cut off of 0.05 is that it gives the “best” balance between your false positive rate (Type I error) and your Power (ability to detect differences between your populations). You could use a lower cut-off, but you will increase your false negative rate and hence decrease the power of your test. Many statisticians of of the opinion that you should publish the actual p-value obtained because then the reader can make a judgement on how significant the results are. For example, a p-value of 0.048 has less evidence against the null hypothesis than a p-value of 0.00018, and hence may be less “significant” even though it still passes the test if a cut-off of 0.05 is used.

      1. The relationship between the false positive rate (alpha) and the power (1-beta) of a hypothesis test is a bit more complex though. For one thing, the relationship depends on the actual test that you are performing.

        In general, once a test statistic has been fixed, power is a function of the false positive rate, the sample size, the effect size of interest, and the measurement error. The balance between what is the “best” calibration of alpha to beta levels changes with all these parameters. So, for example, if you run an experiment with a huge sample size, you can afford to lower your false positive rate while still retaining high power, *keeping all other factors fixed*. If you want to use that high sample size to detect a smaller effect though, then you may be forced to keep the alpha level where it is.

        There really isn’t anything special about the alpha = 0.05 cutoff; it’s a convention, and one that is often inappropriate. I understand Fisher’s original motivation for introducing the cutoff from a philosophical level, but I think it’s a very simplistic and outmoded way of making decisions statistically. Our knowledge and methods have certainly expanded enough to be a lot more sophisticated (and accurate) in our decision making abilities.

    2. I would also like to know the alternative to p-values in the analysis of variance for multi-factor experiments.

      Effect-size estimates, confidence intervals, Bayes factors, credible intervals, likelihood ratios, decision trees, regression models,…

      1. Some of the techniques in your list are readily incorporated in the standard frequentist approach to AOV. However, I am interested in Bayesian factors as an alternative to p-values. Can you recommend a useful introduction, preferably with a bias towards practical applications? The stuff that I have found on the internet has been heavily theoretical.

        Thanks.

        1. The best paper that comes to mind is “Default Bayes Factors for ANOVA Designs” by Rouder et al (2012), available from Jeff Rouder’s website, although it is fairly technical. If you’re interested in a practical method of computing such Bayes factors, there is a very nice R package, “BayesFactor,” which can be installed into R in the usual manner (of course, you need R, but it’s free). Finally, Jeff Rouder has an online Bayes factor calculator for regression problems at his website, which, since any ANOVA problem can be formulated as a regression problem, could be useful.

          1. Thanks. I have downloaded those to study when I have a good chunk of time.

  18. Physicists are more rigorous, and only accept p values of much less than 0.001 as significant (as they did when detecting the Higgs boson).

    Minor nit-pick. The standard they went for in the case of the Higgs was exceptionally high, since they’d spent oodles of bucks on the experiment, and the publicity consequences of getting it wrong would have been horrendous.

    But, lower values are also widespread in physics, it all depends on the result in question and how much is riding on it.

    1. Physics has the advantage of dealing with much less variation than biology, and thus is correct in setting the bar higher, as it were.

      In much of biology it is difficult to keep one set of factors constant while testing for another particular other one. As the old laboratory sign says, “under carefully controlled conditions of temperature, humidity, and pressure, the organism will do as it damn well pleases.” This “messiness” is why biology p-value standards are conventionally more lenient.

      1. It is precisely this messiness that makes the whole null-hypothesis-testing scheme meaningless in most biological applications, where it is known in advance that the null hypothesis is not true to an infinite number of decimal places. Such a null hypothesis can always be rejected if sample size is large enough. That’s why it is important to shift focus away from null hypothesis testing and towards a parameter-estimation approach.

        1. It’s nice to know that my long-standing annoyance at null hypotheses might have more going for it than just my reflexive contrarianism.

          Joking aside, thank you for your informative and easy-to-grasp posts here. I’ve learned quite a bit.

  19. Oh dear. I really do hope that our host does not actually believe what he wrote about p-values.

    “For example, if the p value were 0.8, that means there’s an 80% probability of getting the observed difference by chance alone if the populations were the same”

    No, p-values measure the probability of seeing an effect at least as extreme as the one you see. So there’s an 80% chance of getting that difference or greater.

    p-values can be useful, but they are slippery characters and need very careful thought.

    1. Yeah, noticed that too. However, I’ve seen far worse mischaracterizations of p-values so decided to let it slide.

    2. Yes, that’s what I meant: a difference that large or larger. I apologize for that, but what irks me a bit is how willing people are to jump on me for what is, after all, an error committed in haste (I write these things quickly, as you must realize). Some of the “corrections” aren’t especially polite.

  20. Thanks Jerry and Ed. Nice explanation of p-value.

    It seems to me that 99+% of Americans are statistically illiterate. I can’t tell you how many times I’ve had to hammer this stuff home to people.

    “No, a single sample does not tell you how a population will behave”

    “The average is more significant that the extreme values”

    etc., ad nausum.

    I will admit to being basically pretty thick about statistics. I had one Probability & Statistics class at university, which was much more focused on probability than on statistics and what they mean — didn’t take.

    I had another one in a previous career life, provided by the company I worked for. Same issues, same result.

    Finally, I had an excellent hands-on statistics course (forget the probability part, except as integral to the stats. part) and then I finally became statistically literate and can now apply and explain them and why they are important.

    I often say: “There is no exact anything in the real world, only statistical probabilities” (which I am sure for which someone will provide counter examples)

    I also try to explain to my son that there’s almost never any point in measuring past 3 significant figures; because almost nothing can be (or at least will be, given the cost) made more precisely than that.

  21. I strongly recommend:
    Greenland S, Poole C Problems in common interpretations of statistics in scientific articles, expert reports, and testimony. Jurimetrics 2011; 51: 113-29
    http://www.ph.ucla.edu/epi/faculty/greenland/Epi204/GreenlandPoole2011.InterpretingStats.pdf

    From the blog of University of Columbia statistician Andrew Gelman:
    Statistical Significance – Significant Problem?
    http://andrewgelman.com/2015/02/20/statistical-significance-significant-problem/

    I agree with Ed Kroc: Abuse of hypothesis testing does not mean we can do better by pretending that sampling variability doesn’t exist.
    There are better solutions. See for instance:

    Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn: False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant
    Psychological Science, 2011, 22, 1359-1366 DOI: 10.1037/e636412012-001
    http://people.psych.cornell.edu/~jec7/pcd%20pubs/simmonsetal11.pdf

    From abstract: “we suggest a simple, low-cost,and straightforwardly effective disclosure-based solution to this problem [the misuse of hypothesis testing”

    Also the treatment of hypothesis testing is excellent in the introductory-level textbook by David Freedman, Robert Pisani and Roger Purves: Statistics. 4th edition, 2007
    http://books.wwnorton.com/books/webad.aspx?id=11597

  22. A quick correction to the penultimate paragraph of Jerry’s piece:

    “Of course even if your samples are really from the same population, and there’s no effect, you’ll still see a “significant” difference 5% of the time even if it just reflects sampling error, so you can draw incorrect conclusions from the statistic. That gave rise to the old maxim in biology, “You can do 20 experiments, and one will be publishable in Nature.” And of course one out of twenty papers you read that report p < 0.05 will be rejecting the null hypothesis (of no difference) erroneously."

    The first part of this is true, immortalized in the xkcd cartoon http://xkcd.com/882/ (already posted, I know, but too good not to repost). The converse though does not necessarily hold; i.e. 1 out of 20 papers that report p < 0.05 will not necessarily be rejecting the null hypothesis erroneously. What we are asking for here is an estimate on the size of the posterior probability of a hypothesis. But this depends on the prior probability of the hypothesis being correct.

    This comes from Bayes' Theorem: that the posterior probability of a testable hypothesis is proportional to the likelihood times the prior,

    Pr(H_0 true | observe extreme test statistic) ~ Pr(H_0 true) * Pr(observe extreme test statistic | H_0 true).

    The second factor on the right hand side is the p-value. The quantity we are interested in now though is the left hand side, the probability that the null hypothesis is actually true.

    On a practical level, if it is reasonable to think that most papers appearing in a reputable biology journal, say, are testing null hypotheses that are expected to be false (i.e. their prior probability Pr(H_0 true) is
    low), then we can expect that the number that erroneously reject the null is far less than 1 in 20. In the harder sciences, this is most certainly the case. In the social sciences, well, that's debatable. For a field like
    biology though, this is a good thing, for sure!

  23. Experimental psychology is in big trouble. With two studies (here and here) finding that over 80% of published findings were likely based on improper statistical practices, it has become questionable whether the field is producing much valid scientific work at all.

    Much of the problem in the field is due to the pervasive practice of “p-hacking”: doing whatever it takes to get a p-value below .05, including dropping inconvenient data points, selectively reporting subgroup analyses, changing the hypothesis to fit the data, suppressing null findings, and stopping data collection if the p-value happens to randomly dip below the sacred cutoff.

    P-values do more harm than good. They tell you the inverse of what you want to know: the probability of the data, given the null, rather than the probability of the null, given the data. They are almost ubiquitously misinterpreted as the posterior probability of the null hypothesis. And, as has been well established, they overstate the evidence against the null hypothesis: a p-value of .05 indicates at very most odds against the null of 2.7:1, which is nowhere nearly strong enough evidence to declare a finding “true.”

    Presumably, the editors of BASP recognize these problems and are trying to improve the field. But banning all statistical inference in favor of mere descriptive statistics is not the answer. Experimental findings are difficult to interpret without some probabilistic measure of error. And as one commentator at the blog of statistician Andrew Gelman commented, banning p-values will likely replace torturing the data until you get a significant p-value with “tortur[ing] the descriptive stats, tables, and displays until you ‘see’ a ‘finding’.” If researchers can p-hack, they can just easily graph-hack.

    N.B.: In the first paper linked above, one of the papers we criticized was “Analytical Thinking Promotes Religious Disbelief,” which Jerry had written an article about in 2012.

    1. Agree with a lot of what you wrote but this:

      “But banning all statistical inference in favor of mere descriptive statistics is not the answer.”

      is incorrect.

      In their statement they indicate that while they are banning null hypothesis significance testing and confidence intervals and favoring descriptive statistics, they are will sometimes accept Bayesian inference methods:

      “with respect to Bayesian procedures, we reserve the right to make case-by-case judgments, and thus Bayesian procedures are neither required nor banned from BASP.”

      Quite frankly, I think the banning of confidence intervals is an even worse mistake than the banning of p-values.

      1. I was going to write “banning all frequentist inference and sitting on the fence about Bayesian inference,” but I thought it was splitting hairs. I guess no hair is too fine to split on the Internet.

        1. “I guess no hair is too fine to split on the Internet.”

          A recent study found that this was indeed the case (p < 0.05). 🙂

          Actually, I think the distinction is important, but am willing to agree to disagree.

    2. On the other hand, if one has a good theoretical reason to suspect outliers, one should do *something* with them. Milikan, for example, discarded some values in the famous oil-drop experiment. (This is physics of course – which has more robust theories – background knowledge – than psychology, admittedly.)

  24. These exchanges remind of 2 t-shirts I own.
    One has “Statistics. Never having to say you’re certain” silkscreened across the front.
    The other has”When all else fails, manipulate the data”
    We should see a change in this journal’s impact factor.
    Last 5 year average was an “impressive” 1.182, if my source is correct. I think it is, p<0.001 😉

  25. “If you’re not a scientist or statistician, you may want to skip this post”
    I am happy I did not skip it.
    I have great interest in how science in made and found this topic layman-friendly and enlightening.
    While some of the comments contain technical details and terminology which I don’t fully understand, many of them are interesting even for my ignorant kind and I am grateful for that 😉

  26. “In that case we can’t have much confidence that the observed difference is a real one, and so we accept the null hypothesis and reject the ‘alternative hypothesis’.”

    Two common misunderstandings here. You can neither accept the null nor reject the alternative (nor accept the alternative for that matter). The only options under null hypothesis significance testing (NHST) are reject the null and fail to reject the null. The logic of NHST is p(D|H_null). It says nothing about the alternative.

    ” In psychology some journals are more lax, accepting cut-off p’s of 0.1 (10%) or less.”

    I’ve never seen a psychology journal use p = .1 for a cutoff.

    1. The logic of NHST is p(D|H_null). It says nothing about the alternative.

      And therein lies the problem with NHST. P(D|H_0) being small does not rule out P(D|H_1) being even smaller! The logic of p-values fails because it is based on the fallacy that if the probability of the data (or more extreme data) under the null is low, then we should accept the alternative hypothesis (H_1), because, implicitly, the probability of the data (or more extreme data) must be greater under the alternative hypothesis. But that is a flat-out fallacy! No matter how improbable the data are under the null, they can always be (and often are) more improbable under the alternative hypothesis.

      It is this fallacy that Bayesian hypothesis testing avoids, because a Bayes hypothesis test compares P(D|H_0) with P(D|H_1). Thus, if the probability (technically, we should say “likelihood”) of the data under the null is low, but it is even lower under the alternative, then the Bayes test will favor the null over the alternative. Which leads to another disadvantage of NHST, which you mentioned: NHST can never favor the null; Bayes tests can.

        1. And let’s hope that you learn how to comment with civility. I have fixed one thing, but the corrections are in the comments, and I’m clearly not the final arbiter of statistics. I have fixed one thing.

          1. Unless this was also written in haste, how about showing us some evidence of a single psychology journal that uses a p = .1 cutoff. Excluding “failure to replicate” papers, can you support the claim that a peer-reviewed psychology journal publishes papers where the lowest p-value reported is no lower than p = .1? You said “some journals,” so I assume you had in mind more than one. But I think you’ll have a hard time finding even one.

          2. You are a rude, person, you know, so after I answer you I want you to go away and be rude at other sites. I distinctly remember reading papers in psychology (or it could have been social science, in which case I erred on the exact field) when I took stats in grad school, and seeing 0.1 used as the significance level.

            Here’s from the Encyclopedia of Survey Research Methods:

            Alpha, Significance Level of Test
            Andrew Noymer

            “Alpha is a threshold value used to judge whether a test statistic is statistically significant. It is chosen by the researcher. Alpha represents an acceptable probability of a Type I error in a statistical test. Because alpha corresponds to a probability, it can range from 0 to 1. In practice, 0.01, 0.05, and 0.1 are the most commonly used values for alpha, representing a 1%, 5%, and 10% chance of a Type I error occurring (i.e. rejecting the null hypothesis when it is in fact correct).”

            From the Wiki of Science:

            In this context, a predetermined significance level can be used as a “cutoff point” for deciding whether the results of a statistical test are improbable enough for the null hypothesis to remain valid. A significance level of ‘0.05’ is conventionally used in the social sciences, although probabilities as high as ‘0.10’ as well as lower probabilities may also be used. Probabilities greater than ‘0.10’ are rarely used.

            I found several other references by Googling.

            x

            x

        2. If you think I was criticizing Jerry, then you didn’t understand my comment. I was criticizing NHST.

          The essential fallacy underlying NHST is that a low p-value implies that we should “accept” the alternative hypothesis. However, as various Bayesian analyses have shown, even when the p-value is low, that data can be more consistent with the alternative hypothesis than with the null.

          This occurs because the alternative hypothesis in NHST is just the negation of the null. Indeed, the alternative is usually that the effect size is anything other than 0, a very broad statement that often includes both plausible and implausible values of the hypothesized effect size.

          Consequently, sometimes when the p-value “rejects” the null, it does so because the effect size observed in the experiment is in the implausible range of the alternative hypothesis. When this happens, the data can be more consistent with the null than with the alternative hypothesis. But since NHST never considers the likelihood of the data under the alternative, it misses these cases and says to reject the null (and, necessarily, accept the alternative).

          In contrast, a Bayes test compares the likelihood of the data under the null with the likelihood of the data under a realistic statement of the alternative. Therefore, a Bayes test will catch paradoxical cases (where the alternative is favored despite the classical p-value being low) and recommend accepting the null over the alternative.

          1. Oops. In the last paragraph I wrote a sentence, but meant its opposite.

            What I wrote:

            Therefore, a Bayes test will catch paradoxical cases (where the alternative is favored despite the classical p-value being low)

            What I meant to write:

            Therefore, a Bayes test will catch paradoxical cases (where the null is favored despite the classical p-value being low)

      1. One of the main advantage of Bayesian methods is that inference and decision making can essentially be combined into a single process. This is an attractive feature, for sure. But it is entirely possible (and indeed, has been successfully performed for a century) to perform sound statistical decision making within a frequentist framework. The complication lies in the fact that making an inference alone is not sufficient to make a sound decision.

        It is of course not good practice to make decisions based solely on whether or not a null hypothesis has been rejected; and I don’t think anyone is advocating for that here. What is being argued is that it is worse than useless to disregard frequentist inferential procedures as being orthogonal to sound decision making. Bayesian methods are often another way to go, but decision making within a Bayesian framework suffers from many of the same difficulties as decision making in a more classical one.

    2. Fisher argued against “accepting the null” but Neyman-Pearson do “accept the null” but make it clear (at least in some writings) that accepting the null does not mean one believes the null hypothesis to be true. Rather, accepting the null means only to decide on action A, which action may be “do nothing”. So regardless, it would be wrong to to find P > alpha and claim something like “we find no difference in mating behavior” (or whatever one is investigating). This is a hugely common mistake in the literature and one I’m sure I’ve made many times.

  27. Weird. If anybody who fears stats and probability wants to see how the Null Hypothesis Significance Test works, the ease of misinterpreting it, and the subtle difference in the wording that it does vs doesn’t answer, but are often confused, I hope this link will help: adnausi.ca/post/12640080262

    I agree. The NHST itself is perfectly valid, if used correctly. It tells you the likelihood that “something” is going on vs nothing. Yes, it gets used horribly wrong often and the “significance” threshhold is arbitrary and perhaps should be eliminated, and indeed alternative approaches have been proposed. (Steven Pinker has been a promoter of such changes, for instance.) However, I don’t see that doing away with useful information, such as the odds of the results being random (no effect), is an improvement. I’d be happy enough just to leave the calculations alone and throw away the threshholds. Science is really all about the probability of something being true or false anyway, not the binary conclusion of true or false.

    1. “However, I don’t see that doing away with useful information, such as the odds of the results being random (no effect), is an improvement.”

      That sums up the reason for continuing to use p-values. (IMHO)

      1. Except that in biology, this is usually not useful information. It is more likely to mislead than inform. In many applications of NHST the null hypothesis cannot be exactly true (for example, the difference in some parameter between one group and another is almost certainly never going to be exactly zero, to ten decimal places). So a result that shows that this difference is not exactly zero, with p<<.001, is still not telling us anything we didn't already know before we did the experiment.

        Again, what we really want to know is the size of the difference, and our uncertainty in our estimate of that size. This gives us everything that p-values give us, and much more.

  28. Here is an instructive video that demonstrates, using a computer simulation complete with graphs and musical notes that p values are more or less totally unreliable.

    https://www.youtube.com/watch?v=ez4DgdurRPg

    There is also the problem of, what has become known as, p-hacking, mentioned by other posters here and detailed in this paper:

    http://people.psych.cornell.edu/~jec7/pcd%20pubs/simmonsetal11.pdf

    It is also worth noting thst the journal that banned NHST, initially made it optional, but found that an outright ban was the only way to deal with this unhealthy emphasis on the unreliable p

    1. p-values are not the only measures in question here; BASP has eliminated *all frequentist inferential measures*, including confidence intervals and test statistics. Again, p-values are not the problem when properly understood. It is of course entirely *improper* to think that a p-value alone is sufficient to make a sound statistical decision, and that is one of the real problems here. p-values can be useful pieces of the decision making process, but they are never sufficient.

    2. What that video really shows is that power of 0.5 is unreliable. It means that you have a 50-50 chance of getting a significant result, which is pretty much the definition of unreliable.

  29. Speaking of psychology…the most recent faking-data scandal I remember reading about was the one recently written up in the NYT Sunday magazine involving a Dutch social psychologist.

Leave a Comment

Your email address will not be published. Required fields are marked *