Reader Ed Kroc sent an email about a strange development in scientific publishing—the complete elimination of “p” (probability) values in a big psychology journal. If you’re not a scientist or statistician, you may want to skip this post, but I think it’s important, and perhaps the harbinger of a bad trend in the field.
Before I present Ed’s email in its entirety, let me say a word (actually a lot of words) about “p values.” These probabilities derive from experimental or observational tests of a “null hypothesis”— i.e., that an experimental treatment does not have an effect, or that two sample populations do not differ in some way. For example, suppose I want to see if rearing flies on different foods, say cornmeal versus yeast, affects their mating behavior. The null hypothesis is that there is no effect on mating behavior. I then observe the behavior of 50 pairs of flies raised on each food, and find that 45 pair of the cornmeal flies mate within an hour, but only 37 pair of the yeast flies.
That looks different, but is it really? Suppose both kinds of flies really have equal propensities of mating, and the difference we see is just “sampling error”—something that could be due to chance alone. After all, if we toss a coin 10 times, and repeat that twice, perhaps the first time we’ll see 7 heads and the second time only 4. That is surely due to chance, because we’re using the same coin. Could that be the case for the flies?
It turns out that one can use statistics to calculate how often we’d see a given difference (due to sampling error) if the two populations were really the same. What we get is a “p” value: the probability that we’d see the difference we observed if the populations were really the same. The higher the p value, the more confidence we have that the populations really do not differ, and we’re seeing a sampling error. For example, if the p value were 0.8, that means there’s an 80% probability of getting the observed difference—or one that’s larger—by chance alone if the populations were the same. In that case we can’t have much confidence that the observed difference is a real one, and so we accept the null hypothesis and reject the “alternative hypothesis”—in our case that the kind of food experienced by a fly really does affect its behavior. But when a p value is small, say 0.01 (a 1% chance that we’d see a difference that big or bigger resulting from chance alone), we can have more confidence that there really is a difference between the sampled populations.
There’s a convention in biology that when the p value is lower than 5% (0.05), meaning that an observed difference that big or bigger would occur less than 5% of the time if the populations really were the same, we consider it statistically significant. That means that you’re entitled by convention to say that the populations really are different—and thus can publish a paper saying so. In the case above, the p value is 0.07, which is above the threshold, and so I couldn’t say in a paper that the differences were significant (remember, we mean statistically significant, not biologically significant). There are various statistical tests one can use to compare samples to each other (you can do this not just with two samples but with multiple ones), and most of these take into account not just the average values or observed numbers, but also, in the case of measurements, the variation among individuals. In the test of two fly samples above, I used the “chi-square” test to get the probabilities.
Of course even if your samples are really from the same population, and there’s no effect, you’ll still see a “significant” difference 5% of the time even if it just reflects sampling error, so you can draw incorrect conclusions from the statistic. That gave rise to the old maxim in biology, “You can do 20 experiments, and one will be publishable in Nature.” And of course one out of twenty papers you read that report p < 0.05 will be rejecting the null hypothesis (of no difference) erroneously.
I should note that the cut-off probabilities differ among fields. Physicists are more rigorous, and only accept p values of much less than 0.001 as significant (as they did when detecting the Higgs boson). In psychology some journals are more lax, accepting cut-off p’s of 0.1 (10%) or less. All of these numbers are of course arbitrary conventions, and some have suggested that we don’t use cut-off values to determine whether a result is “real”, but simply present that probabilities and let the reader judge for herself. I don’t disagree with that. But, according to statistician Ed Kroc, one journal has gone farther, suggesting that we don’t report p values at all! I think that’s a mistake, for then one has no way to judge how likely it is that your null hypothesis is wrong. Ed agrees, and reports the situation below:
*******
by Ed Kroc
I wanted to pass this along in case no one else has yet, as it could be of interest to you, as well as to anyone who has the occasion to use statistics. Apparently, the psychology journal Basic and Applied Social Psychology just banned the use of null hypothesis significance testing; see the editorial here.
As a statistician myself, I naturally have a lot to say about such a move, but I’ll limit myself to a few key points.
First, this type of action really underlines how little many people understand common statistical procedures and concepts, even those who use them on a regular basis and presumably have some minimal level of training in said usage. I appreciate the editors trying to address the very real problem of seeing statistical decision making reduced to checking whether or not a p-value crosses an arbitrary threshold, but their approach of banning the use of p-values and their closest kin just proves that they don’t fully understand the problem they are trying to address. p-values are not the problem. Misuse and misinterpretation of what p-values mean are the real problems, as is the insistence by most editorial boards that publishable applied research must include these quantities calculated to within a certain arbitrary range.
The manipulation of data and methods by researchers to attain an arbitrary 0.05 cutoff, the effective elimination of negative results by only
publishing results deemed “statistically significant”, the lack of modelling, and the lack of proper statistical decision making are all real problems within applied science today. Banning the usage of (frequentist) inferential methods does nothing to address these things. It’s like saying not enough people understand fractions, so we’re just going to get rid of division to address the problem.
Alarmingly, the editors say “the null hypothesis testing procedure is invalid”. What? No caveats? That’s news to me. Invalid under what rubric? They never say.
Interestingly, they no longer require any inferential statistics to appear in an article. I don’t actually categorically disagree with that policy—in fact, I think some research could be improved by including fewer inferential procedures—but their justification for it is ludicrous: “because the state of the art remains uncertain”. Well, then we should all stop doing any kind of science I guess. Who is practicing the
state of the art anywhere? And who gets to decide what is or is not state of the art?
Finally, the editors say this:
“BASP will require strong descriptive statistics, including effects sizes. We also encourage the presentation of frequency or distributional data when this is feasible. Finally, we encourage the use of larger sample sizes. . . because as the sample size increases, descriptive statistics become increasingly stable and sampling error less of a problem.”
First off, no, as sample size increases, sampling error does not necessarily become less of a problem: that’s true only if your sampling procedure is perfectly correct to begin with, something that is likely never to be the case in an experimental psychology setting. More importantly, they basically admit here that they only want to see descriptive statistics [means, variances, etc.] and they don’t need to know any statistics the discipline doesn’t understand. Effect sizes and frequency distributions? p-values are still sitting behind all of those, whether they’re calculated or not; they are just comparative measures of these things accounting for uncertainty. The editors seem to be replacing the p-value measure with the “eyeball measure”, effectively removing any quantification of the uncertainty in the experiments or random processes under consideration. A bit misguided, in my opinion.
I could go on—in particular, about their comments on Bayesian methods—but I’ll spare you any more of my own editorializing. Part of me wonders if this move is a bit of a publicity stunt for the journal. I know nothing about psychology journals or how popular this one is, but it seems like this type of move would certainly generate a lot of attention. I do hope though that other journals will not follow suit.