# The New York Times might have used Bayes’ Theorem

January 4, 2022 • 10:00 am

by Greg Mayer

The New York Times has a data analysis division which they call The Upshot; I think they created it to compensate for the loss of Nate Silver’s 538, which was once hosted by the Times. The Upshot reporters and analysts tend to be policy wonks with some statistical savvy, so I took note of a big story they had on page 1 of Sunday’s (2 January) paper on why many prenatal tests “are usually wrong.

The upshot, if you will, of the story is that many prenatal tests for rare chromosomal disorders unnecessarily alarm prospective parents because, even if the test result is positive, it is unlikely that the fetus actually has the disease. This is because when a disease is rare most positives are false positives, even when the test is quite accurate. For the five syndromes analyzed by the Times, the proportion of false positives (i.e. “wrong” results) ranged from 80% to 93%!

The Times does not go into detail of how they got those figures, but from links in their footnotes, I think they are empirical estimates, based on studies which did more conclusive followup testing of individuals who tested positive. My first thought, when looking at Sunday’s paper itself (which of course doesn’t have links!), was that they had used Bayes’ Theorem, the manufacturers’ stated sensitivity and specificity for their tests (the two components of a test’s accuracy), and the known prevalence of the condition to calculate the false positive rate.

Bayes’ Theorem is an important result in probability theory, first derived by the Rev. Thomas Bayes, and published posthumously in 1763. There is controversy over the school of statistical inference known as Bayesian statistics; the controversy concerns how one can form a “prior probability distribution”, but in this case we have an empirically derived prior probability distribution, the prevalence, which can be thought of as the probability of an individual drawn at random from the population in which the prevalence is known (or well-estimated) having the condition. There is thus no controversy over the application of Bayes’ Theorem to cases of disease diagnosis when there is a known prevalence of the condition, such as in the cases at hand.

Here’s how it works. (Remember, though, that I think the Times used empirical estimates of the rate, not this type of calculation.)

Using Bayes’ Theorem, we can say that the probability of having a disease (D) given a positive test result (+) depends on the sensitivity of the test (= the probability of a positive result given an individual has the disease, P(+∣D)), the specificity of the test (= the probability of a negative result given an individual does not have the disease, P(-∣ not D)), and the prevalence of the disease (= the probability that a random individual has the disease, P(D)). Formally,

P(D∣+) = P(+∣D)⋅P(D)/P(+)

where the terms are as defined above, and P(+) is the probability of a random individual testing positive. This is given by the sensitivity times the prevalence plus the specificity times (1- the prevalence), or

P(+) = P(+∣D)⋅P(D) + P(+∣ not D)⋅(1-P(D))

The whole thing in words can be put as

probability you are ill given a positive test =

sensitivity⋅prevalence/[sensitivity⋅prevalence + (1-specificity)⋅(1-prevalence)]

Let’s suppose we have a sensitive test, say P(+∣D)=.95, which is also quite specific, say P(-∣ not D)=.95 (sensitivity and specificity need not be equal; this is only a hypothetical), and a low prevalence, say P(D)=.01. Then

probability you are ill given a positive test =

= (.95)(.01)/[(.95)(.01)+(.05)(.99)]

= .16.

Thus, if you had a positive test, 84% of the time it would be “wrong”! This is right in the neighborhood of the rates found by the Times for the five conditions they examined. Notice that in this example, both sensitivity and specificity are high (which is good– you want both of these to be near the maximum of 1.0 if possible), but because prevalence is low (.01), the test is still usually “wrong”.

In an earlier discussion of Bayes’ Theorem, Jerry noted:

This [tests for rare conditions being usually wrong] is a common and counterintuitive result that could be of practical use to those of you who get a positive test. Such tests almost always mandate re-testing!

He’s absolutely right. A test with these properties is useful for screening, but not for diagnosis– you’d usually want to get a more definitive test before making any irreversible medical decisions. (For COVID 19, for example, PCR tests are more definitive than the quicker antigen tests.) The Times also discusses some of the unsavory aspects of the marketing of these tests, and the tragedy of the truly life and death decisions that can ensue, all of which flow from the results of the tests being misunderstood.

(Note: an alert reader spotted a mistake in the verbal equation, and in checking on it I spotted another in one of the symbolic equations. Both corrections have now been made, which are in bold above. The numerical result was not affected, as I’d used the correct numbers for the calculation, even though my verbal expression of them was wrong!)

For a nice but brief discussion, with some mathematical details, of the application of Bayes’ theorem to diagnosis, see sections 1.1-1.3 of Richard M. Royall’s Statistical Evidence: A Likelihood Paradigm (Chapman &Hall, London, 1996). Royall is not a Bayesian, which demonstrates the uncontroversial nature of the application of Bayes’ Theorem to diagnosis.

# Further thoughts on the Rev. Bayes

April 19, 2015 • 11:37 am

by Greg Mayer

I (and Jerry) have been quite pleased by the reaction to my post on “Why I am not a Bayesian“. Despite being “wonkish“, it has generated quite a bit of interesting, high level, and, I hope, productive, discussion. It’s been, as Diane G. put it, “like a graduate seminar.” I’ve made a few forays into the comments myself, but have not responded to all, or even the most interesting, comments– I had a student’s doctoral dissertation defense to attend to the day the post went up, plus I’m not sure that having the writer weigh in on every point is the best way to advance the discussion. But I do have a few general observations to make, and do so here.

First, I did not lay out in my post what the likelihood approach was, only giving references to key literature. No approach is without difficulties and conundrums, and I’m looking forward to finding the reader-recommended paper “Why I am not a likelihoodist.”  Among the most significant problems facing a likelihood approach are those of ‘nuisance’ parameters (probability models often include quantities that must be estimated in order to use the model, but in which you’re not really interested; there are Bayesian ways of dealing with these that are quite attractive), and of how to incorporate model simplicity into inference. My own view of statistical inference is that we are torn between two desiderata: to find a model that fits the data, yet retains sufficient generality to be applicable to a wider range of phenomena than just the data observed. It is always possible to have a model of perfect fit by simply having the model restate the data. In the limit, you could have the hypothesis that an omnipotent power has arranged all phenomena always and everywhere to be exactly as it wants, which hypothesis would have a likelihood of one (the highest it can be). But such an hypothesis contains within it an exact description of all phenomena always and everywhere, and thus has minimal generality or simplicity. There are various suggestions on how to make the tradeoff between fit (maximizing the likelihood of the model) and simplicity (minimizing the number of parameters in the model),  and I don’t have the solution as to how to do it (the Akaike Information Criterion is an increasingly popular approach to doing so).

Second, there are several approaches to statistical inference (not just two, or even just one, as some have said), and they differ in their logical basis and what inferences they think possible or desirable. (I mentioned likelihood, Fisherian, Neyman-Pearson, Bayesian, and textbook hodge-podge approaches in my post, and that’s not exhaustive.) But it is nonetheless the case that the various approaches often arrive at the same general (and sometimes specific) conclusion in any particular inferential analysis. Discussion often centers on cases where they differ, but this shouldn’t obscure the at times broad agreement among them. As Tony Edwards, one of the chief promoters of likelihood, has noted, the usual procedures usually lead to reasonable results, otherwise we would have been forced to give up on them and reform statistical inference long ago. One of the remarks I did make in the comments is that most scientists are pragmatists, and they use the inferential methods that are available to them, address the questions they are interested in, and give reasonable results, without too much concern for what’s going on “under the hood” of the method. So, few scientists are strict Bayesians, Fisherians, or whatever– they are opportunistic Bayesians, Fisherians, or whatever.

Third, one of the differences between Bayesian and likelihood approaches that I would reiterate is that Bayesianism is more ambitious– it wants to supply a quantitative answer (a probability) to the question “What should I believe?” (or accept). Likelihoodism is concerned with “What do the data say?”, which is a less complete question, which leads to less complete answers. It’s not that likelihoodists (or Fisherians) don’t think the further questions are interesting, but just that they don’t think they can be answered in an algorithmic fashion leading to a numerical result (unless, of course, there is a valid objective prior). Once you have a likelihood result, further considerations enter into our inferential reasoning, such as

There is good reason to doubt a proposition if it conflicts with other propositions we have good reason to believe; and

The more background information a proposition conflicts with, the more reason there is to doubt it.

(from a list I posted of principles of scientific reasoning taken from How to Think about Weird Things). Bayesians turn these considerations into a prior probability; non-Bayesians don’t.

Fourth, a number of Bayesian readers have brought attention to the development of prior probability distributions that do properly represent ignorance– uninformative priors. This is the first of the ways forward for Bayesianism that I mentioned in my original post (“First, try really hard to find an objective way of portraying ignorance.”). I should mention in this regard that someone who did a lot of good work in this area was Sir Harold Jeffreys, whose Theory of Probability is essential, and which I probably should have included in my “Further Reading” list (I was trying not to make the list too long). His book is not, as the title would suggest, an exposition of the mathematical theory of probability, but an attempt to build a complete account of scientific inference from philosophical and statistical fundamentals. Jeffreys (a Bayesian) was well-regarded by all, including Fisher (a Fisherian, who despite, or perhaps because of, his brilliance got along with scarcely anyone). These priors have left some unconvinced, but it’s certainly a worthy avenue of pursuit.

Finally, a number of readers have raised a more philosophical objection to Bayesianism, one which I had included a brief mention of in a draft of my OP, but deleted in the interest of brevity and simplicity. The objection is that scientific hypotheses are not, in general, the sorts of things that have probabilities attached to them. Along with the above-mentioned readers, we may question whether scientific hypotheses may usefully be regarded as drawn from an urn full of hypotheses, some proportion of which are true. As Edwards (1992) put it, “I believe that the axioms of probability are not relevant to the measurement of the truth of propositions unless the propositions may be regarded as having been generated by a chance set-up.” Reader Keith Douglas put it, ” “no randomness, no probability”. Even in the cases where we do have a valid objective prior probability, as in the medical diagnosis case, it’s not so much that I’m saying the patient has a 16% chance of having the disease (he either does or doesn’t have it), but rather that individuals drawn at random from the same statistical population in which the patient is situated (i.e. from the same general population and showing positive on this test) would have the disease 16% of the time.

If we can array our commitments to schools of inference along an axis from strict to opportunistic, I am nearer the opportunistic pole, but do find the likelihood approach the most promising, and most worth developing further towards resolving its anomalies and problems (which all approaches, to greater or lesser degrees, suffer from).

Edwards, A.W.F. 1992. Likelihood. Expanded edition. Johns Hopkins University Press, Baltimore.

Jeffreys, H. 1961. The Theory of Probability. 3rd edition. Oxford University Press, Oxford.

Schick, T. and L. Vaughn. 2014. How to Think About Weird Things: Critical Thinking for a New Age. 7th ed. McGraw-Hill, New York.

# Why I am not a Bayesian*

April 16, 2015 • 8:45 am

JAC: Today Greg contributes his opinion on the use of Bayesian inference in statistics. I know that many—perhaps most—readers aren’t familiar with this, but it’s of interest to those who are. Further, lots of secular bloggers either write about or use Bayesian inference, as when inferring the probability that Jesus existed given the scanty data. (Theists use it too, sometimes to calculate the probability that God exists given some observations, like the supposed fine-tuning of the Universe’s physical constants.)

When I warned Greg about the difficulty some readers might have, he replied that, “I tried to keep it simple, but it is, as Paul Krugman says about some of his posts, ‘wonkish’.” So wonkish we shall have!

___________

by Greg Mayer

Last month, in a post by Jerry about Tanya Luhrmann’s alleged supernatural experiences, I used a Bayesian argument to critique her claims, remarking parenthetically that I am not a Bayesian. A couple of readers asked me why I wasn’t a Bayesian, and I promised to reply more fully later. So, here goes; it is, as Paul Krugman says, “wonkish“.

Approaches to inference

I studied statistics as an undergraduate and graduate student with some of the luminaries in the field, used statistics, and helped people with statistics; but it wasn’t until I began teaching the subject that I really thought about the logical basis of the subject. Trying to explain to students why we were doing what we were doing forced me to explain it to myself. And, I wasn’t happy with some of those explanations. So, I began looking more deeply into the logic of statistical inference. Influenced strongly by the writings of Ian Hacking, Richard Royall, and especially the geneticist A.W.F. Edwards, I’ve come to adopt a version of the likelihood approach. The likelihood approach takes it that the goal of statistical inference is the same as that of scientific inference, and that the operationalization of this goal is to treat our observations as data bearing upon the adequacy of our theories. Not all approaches to statistical inference share this goal. Some are more modest, and some are more ambitious.

The more modest approach to statistical inference is that of Jerzy Neyman and Egon Pearson. In the Neyman-Pearson approach, one is concerned to adopt rules of behavior that minimize one’s mistakes. For example, buying a mega-pack of paper towels at Sam’s Club, and then finding that they are of unacceptably low quality, would be a mistake. They define two sorts of errors that might occur in making decisions, and see statistics as a way of reducing one’s decision making error rates. Although they, and especially, Neyman, made some quite grandiose claims for their views, the whole approach seems rather despairing to me: having given up on any attempt to obtain knowledge about the world, they settle for a clean, well-lighted place, or at least one in which the light bulbs usually work. While their approach makes perfect sense in the context of industrial quality control, it is not a suitable basis for scientific inference (which, indeed, Neyman thought was not possible).

The approach of R.A. Fisher, founder of modern statistics and evolutionary theory, shares with the likelihood approach the goal of treating our observations as data bearing upon the adequacy of our theories, and the two approaches also share many statistical procedures, but differ most notably on the issue of significance testing (i.e., those “p” values you often see in scientific papers, or commentaries upon them). What is actually taught and practiced by most scientists today is a hodge-podge of the Neyman-Pearson and Fisherian approaches. Much of the language and theory of Neyman-Pearson is used (e.g., types of errors), but, since few or no scientists actually want to do what Neyman and Pearson wanted to do, current statistical practice is suffused with an evidential interpretation quite congenial to Fisher, but foreign to the Neyman-Pearson approach.

Bayesianism, like the Fisherian and likelihood approaches, also sees our observations as data bearing upon the adequacy of our theories, but is more ambitious in wanting to have a formal, quantitative method for integrating what we learn from observation with everything else we know or believe, in order to come up with a single numerical measure of rational belief in propositions.

So, what is Bayesianism?

The Rev. Thomas Bayes was an 18th century English Nonconformist minister. His “An Essay Towards Solving a Problem in the Doctrine of Chances” was published in 1763, two years after his death. In the Essay, Bayes proved the famous theorem that now bears his name. The theorem is a useful, important, and nonproblematic result in probability theory. In modern notation, it states

P(H∣D) = [P(D∣H)⋅P(H)]/P(D).

In words, the probability P of an hypothesis H in the light of data D is equal to the probability of the data if the hypothesis were true (called the hypothesis’s likelihood) times the probability of the hypothesis prior to obtaining data D, with the product divided by the unconditional probability of the data (for any given problem, this would be a constant). Ignoring the constant in the denominator, P(D), we can say that the posterior probability, P(H∣D), (the probability of the hypothesis after we see the data), is proportional to the likelihood of the hypothesis in light of the data, P(D∣H), (the probability of the data if the hypothesis were true), times the prior probability, P(H), (the probability we gave to the hypothesis before we saw the data).

The theorem has many uncontroversial applications in fields such as genetics and medical diagnosis. These applications may be thought of as two-stage experiments, in which an initial experiment (or background set of observations) establishes probabilities for each of a set of exhaustive and mutually exclusive hypotheses, while the results of a second experiment (or set of observations), providing data D, are used to reevaluate the probabilities of the hypotheses. Thus, knowing something about the grandparents of a set of offspring may influence my evaluation of genetic hypotheses concerning the offspring. Or, in making a diagnosis, I may include in my calculations the known prevalence of a disease in the population, as well as the test results on a particular patient. For example, suppose a 95% accurate test for disease X is positive (+) for a patient, and the disease X is known to occur in 1% of the population. Then, by Bayes’ Theorem

P(X∣+) = P(+∣X)⋅P(X)/P(+)

= (.95)(.01)/[(.95)(.01)+(.05)(.99)]

= .16.

The probability that the patient has the disease is thus 16%. Note that despite the positive result on a pretty accurate test, the odds are more than four to one against the patient actually having condition X. This is because, since the disease is quite rare, most of the positive tests are false positives. [JAC: This is a common and counterintuitive result that could be of practical use to those of you who get a positive test. Such tests almost always mandate re-testing!]

So what could be controversial? Well, what if there is no first stage experiment or background knowledge which gives a probability distribution to the hypotheses? Bayes proposed what is known as Bayes’ Postulate: in the absence of prior information, each of the specifiable hypotheses should be accorded equal probability, or, for a continuum of hypotheses, a uniform distribution of probabilities. Bayes’ Postulate is an attempt to specify a probability distribution for ignorance. Thus, if I am studying the relative frequency of some event (which must range from 0 to 1), Bayes’ Postulate says I should assign a probability of .5 to the hypothesis that the event has a frequency greater than .5, and that the hypothesis that the frequency of the event falls between .25 and .40 should be given a probability of .15, and so on. But is Bayes’ Postulate a good idea?

Problems with Bayes’ Postulate

Let’s look at simple genetic example: a gene with two alleles (forms) at the locus (say alleles A and a). The two alleles have frequencies p + q = 1, and, if there are no evolutionary forces acting on the population and mating is at random, then the three genotypes (AA, Aa, and aa) will have the frequencies p², 2pq and q², respectively. If I am addressing the frequency of allele a, and I am a Bayesian, then I assign equal prior probability to all possible values of q, so

P(q>.5) = .5

But this implies that the frequency of the aa genotype has a non-uniform prior probability distribution

P(q²>.25) = .5.

My ignorance concerning q has become rather definite knowledge concerning q² (which, if there is genetic dominance at the locus, would be the frequency of recessive homozygotes; as in Mendel’s short pea plants, this is a very common way in which we observe the data). This apparent conversion of ‘ignorance’ to ‘knowledge’ will be generally so: prior probabilities are not invariant to parameter transformation (in this case, the transformation is the squaring of q). And even more generally, there will be no unique, objective distribution for ignorance. Lacking a genuine prior distribution (which we do have in the diagnosis example above), reasonable men may disagree on how to represent their ignorance. As Royall (1997) put it, “pure ignorance cannot be represented by a probability distribution”.

Bayesian inference

Bayesians proceed by using Bayes’ Postulate as a starting point, and then update their beliefs by using Bayes’ Theorem:

Posterior probability ∝ Likelihood × Prior probability

which can also be given as

Posterior opinion ∝ Likelihood × Prior opinion.

The appeal of Bayesianism is that it provides an all-encompassing, quantitative method for assessing the rational degree of belief in hypotheses. But there is still the problem of prior probabilities: what should we pick as our prior probabilities if there is no first-stage set of data to give us such a probability? Bayes’ Postulate doesn’t solve the problem, because there is no unique measure of ignorance. We must choose some prior probability distribution in order to carry out the Bayesian calculation, but you may choose a different distribution from the one I do, and neither is ‘correct’: the choice is subjective.

There are three ways round the problem of prior distributions. First, try really hard to find an objective way of portraying ignorance. This hasn’t worked yet, but some people are still trying. Second, note that the prior probabilities make little difference to the posterior probabilty as more and more data accumulate (i.e. as more experiments/observations provide more likelihoods), viz.

P(posterior) ∝ P(prior) × Likelihood  × Likelihood × Likelihood × . . .

In the end, only the likelihoods make a difference; but this is less a defense of Bayesianism than a surrender to likelihood. Third, boldly embrace subjectivity. But then, since everyone has their own prior, the only thing we can agree upon are the likelihoods.  So, why not just use the likelihoods?

The problem with Bayesianism is that it asks the wrong question. It asks, ‘How should I modify my current beliefs in the light of the data?’, rather than ‘Which hypotheses are best supported by the data?’. Bayesianism tells me (and me alone) what to believe, while likelihood tells us (all of us) what the data say.

*Apologies to Clark Glymour and Bertrand Russell.

The best and easiest place to start is with Sober and Royall.

Edwards, A.W.F. 1992. Likelihood. Expanded edition. Johns Hopkins University Press, Baltimore. An at times terse, but frequently witty, book that rewards careful study. In many ways, the founding document of likelihood inference; to paraphrase Darwin, it is ‘origin all my views’.

Gigerenzer, G., et al. 1989. The Empire of Chance. Cambridge University Press, Cambridge. A history of probability and statistics, including how the incompatible approaches of Fisher and Neyman-Pearson became hybridized into textbook orthodoxy.

Hacking, I. 1965. The Logic of Statistical Inference. Cambridge University Press, Cambridge. Hacking’s argument for likelihood as the fundamental concept for inference; he later changed his mind.

Hacking, I. 2001. An Introduction to Probability and Inductive Logic. Cambridge University Press, Cambridge. A well-written introductory textbook reflecting Hacking’s now more eclectic, and specifically Bayesian, views.

Royall, R. 1997. Statistical Evidence: a Likelihood Paradigm. Chapman & Hall, London. A very clear exposition of the likelihood approach, requiring little mathematical expertise. Along with Edwards, the key work in likelihood inference.

Sober, E. 2002. Bayesianism– Its Scope and Limits. Pp. 21-38 in R. Swinburne, ed., Bayes’ Theorem. Proceedings of the British Academy Press, vol. 113. An examination of the limits of both Bayesian and likelihood approaches. pdf (read this first!)