The New York Times might have used Bayes’ Theorem

January 4, 2022 • 10:00 am

by Greg Mayer

The New York Times has a data analysis division which they call The Upshot; I think they created it to compensate for the loss of Nate Silver’s 538, which was once hosted by the Times. The Upshot reporters and analysts tend to be policy wonks with some statistical savvy, so I took note of a big story they had on page 1 of Sunday’s (2 January) paper on why many prenatal tests “are usually wrong.

The upshot, if you will, of the story is that many prenatal tests for rare chromosomal disorders unnecessarily alarm prospective parents because, even if the test result is positive, it is unlikely that the fetus actually has the disease. This is because when a disease is rare most positives are false positives, even when the test is quite accurate. For the five syndromes analyzed by the Times, the proportion of false positives (i.e. “wrong” results) ranged from 80% to 93%!

The Times does not go into detail of how they got those figures, but from links in their footnotes, I think they are empirical estimates, based on studies which did more conclusive followup testing of individuals who tested positive. My first thought, when looking at Sunday’s paper itself (which of course doesn’t have links!), was that they had used Bayes’ Theorem, the manufacturers’ stated sensitivity and specificity for their tests (the two components of a test’s accuracy), and the known prevalence of the condition to calculate the false positive rate.

Bayes’ Theorem is an important result in probability theory, first derived by the Rev. Thomas Bayes, and published posthumously in 1763. There is controversy over the school of statistical inference known as Bayesian statistics; the controversy concerns how one can form a “prior probability distribution”, but in this case we have an empirically derived prior probability distribution, the prevalence, which can be thought of as the probability of an individual drawn at random from the population in which the prevalence is known (or well-estimated) having the condition. There is thus no controversy over the application of Bayes’ Theorem to cases of disease diagnosis when there is a known prevalence of the condition, such as in the cases at hand.

Here’s how it works. (Remember, though, that I think the Times used empirical estimates of the rate, not this type of calculation.)

Using Bayes’ Theorem, we can say that the probability of having a disease (D) given a positive test result (+) depends on the sensitivity of the test (= the probability of a positive result given an individual has the disease, P(+∣D)), the specificity of the test (= the probability of a negative result given an individual does not have the disease, P(-∣ not D)), and the prevalence of the disease (= the probability that a random individual has the disease, P(D)). Formally,

P(D∣+) = P(+∣D)⋅P(D)/P(+)

where the terms are as defined above, and P(+) is the probability of a random individual testing positive. This is given by the sensitivity times the prevalence plus the specificity times (1- the prevalence), or

P(+) = P(+∣D)⋅P(D) + P(+∣ not D)⋅(1-P(D))

The whole thing in words can be put as

probability you are ill given a positive test =

sensitivity⋅prevalence/[sensitivity⋅prevalence + (1-specificity)⋅(1-prevalence)]

Let’s suppose we have a sensitive test, say P(+∣D)=.95, which is also quite specific, say P(-∣ not D)=.95 (sensitivity and specificity need not be equal; this is only a hypothetical), and a low prevalence, say P(D)=.01. Then

probability you are ill given a positive test =

= (.95)(.01)/[(.95)(.01)+(.05)(.99)]

= .16.

Thus, if you had a positive test, 84% of the time it would be “wrong”! This is right in the neighborhood of the rates found by the Times for the five conditions they examined. Notice that in this example, both sensitivity and specificity are high (which is good– you want both of these to be near the maximum of 1.0 if possible), but because prevalence is low (.01), the test is still usually “wrong”.

In an earlier discussion of Bayes’ Theorem, Jerry noted:

This [tests for rare conditions being usually wrong] is a common and counterintuitive result that could be of practical use to those of you who get a positive test. Such tests almost always mandate re-testing!

He’s absolutely right. A test with these properties is useful for screening, but not for diagnosis– you’d usually want to get a more definitive test before making any irreversible medical decisions. (For COVID 19, for example, PCR tests are more definitive than the quicker antigen tests.) The Times also discusses some of the unsavory aspects of the marketing of these tests, and the tragedy of the truly life and death decisions that can ensue, all of which flow from the results of the tests being misunderstood.

(Note: an alert reader spotted a mistake in the verbal equation, and in checking on it I spotted another in one of the symbolic equations. Both corrections have now been made, which are in bold above. The numerical result was not affected, as I’d used the correct numbers for the calculation, even though my verbal expression of them was wrong!)

For a nice but brief discussion, with some mathematical details, of the application of Bayes’ theorem to diagnosis, see sections 1.1-1.3 of Richard M. Royall’s Statistical Evidence: A Likelihood Paradigm (Chapman &Hall, London, 1996). Royall is not a Bayesian, which demonstrates the uncontroversial nature of the application of Bayes’ Theorem to diagnosis.

27 thoughts on “The New York Times might have used Bayes’ Theorem

  1. An excellent and striking real-world application of good statistics! I have rarely ever known a doctor who understood this problem, though it applies widely to many medical tests.

    1. The psychologist Gerd Gigerenzer mentions this in one of his books. During a lecture, he pointed out how many doctors misinterpret test results. A doctor stood up and said “I teach statistics to medical students and I can assure you that my students understand this”. A lady stood up and said “I went to his lectures and I can assure you that we don’t” !

    1. Yes, it is extremely important, especially for things like cancer, since testing is widely promoted. For breast cancer, the probabilty that a positive result is false = 9/10. That’s an astounding number. we might expect specialists to know this, but doctors are almost as ignorant of statistics as other people.

      In one study of gynecologists presented with the actual breast cancer data and error rates, only 21% of them got the right answer to this problem. About half the doctors thought the number was 10% instead of the true answer of 90%. This is a huge error

      There are potentially millions of women victimized by this ignorance annually.

      1. In states that allow abortion, the restrictions on abortion immediately set up pressure to get one “just in case”. Parents in such a scenario will be traumatized, either way.

      2. Unfortunately, actual risk, while widely acknowledged by most of my colleagues, is only one of the considerations in medical practice. As has been said here, most people (i.e., most patients) don’t understand risk calculations. They are looking for a simple answer, so statistical discussions are frequently not productive. Then there is the reality that lack of precision in statistical modeling isn’t usually an effective defense when something goes wrong. Hence defensive medicine. People want binary answers.

    2. Grant Sanderson cites Gerd Gigerenzer in this video – I thought he was obscure when I watched it but the readers here suggest he’s a big name to cite!

  2. It was interesting to read the previous post (from 2015) that I missed at the time, although in fact I completely disagree with the assessment made by Greg. In physics, astronomy, and computer science, Bayesian methods are used for model comparison, machine learning, etc.
    I actually teach how to apply them to astronomical problems avoiding issues with simpler likelihood approaches. Marginal likehoods (“evidence values”) are computed to e.g. discriminate between cosmological models of dark matter. Note also the deeper philosophical underpinnings of Bayesianism vs Frequentism on interpretations of quantum mechanics.

    1. I agree with you that physicists and astronomers are acutely aware of Bayesian statistics and use it heavily. But Greg is right about biologists and doctors. There is data to back him up on that. Seemy comment above.

      By the way, the YouTube Channel “Cool Worlds”, my favorite science channel, makes extensive use of Bayesian statistics to answer difficult astronomical puzzles (like evaluating the probability of alien civilizations).

    2. I’m interested in cosmology, so I know that it has become a tremendously useful tool. But I am also trained as a bioinformatician, where it has become essential to analysing large sets of genetic sequence information for homology (treeing) as in Markov Chain Monte Carlo methods [say, Mr. Bayes software]. The cosmology use appear similar.

      [But my current picture is still myopic, for instance I hear that bayesian methods in general are thought of as “pessimistic” in multidimensional use by statisticians while in phylogenetic use they tend to be “optimistic”. They complement maximum likelihood methods in boxing in trees that have nodes with high bootstrap support values and posterior probabilities.]

      Models of statistical probabilities differ between Fisher null hypothesis and Neyman–Pearson hypothesis comparison testing, so I wouldn’t worry about it. Models (“interpretations”) in quantum mechanics is mostly uninterested in modern relativistic quantum field physics*, so I wouldn’t worry about them having some deeper meaning either.

      *Disclaimer: I have here earlier proposed the relativistic model for wavefunctions, part of a quantum field as they are, as promising and simple. So I may be a bit myopic here too.

  3. Usually prenatal tests come in two forms: a less inaccurate but non-invasive screening test and a more accurate but invasive test used to check a positive result from the first test. The chance of getting a false positive on both is quite low, though I don’t know exactly what the chance is. (I don’t know if they mentioned it, since the article is behind a paywall.)

    Prospective parents being alarmed at a positive result shows, to me, the usefulness of the screening, because it’s information parents are anxious to know. (I don’t doubt the claim that marketing unnecessarily heightens anxieties, though.) But it’s still better than ignorance!

    Anyway, I’m sure that was just an example used to introduce Bayes’ theorem, which I’ve relearned repeatedly and keep forgetting because I don’t have a chance to use it…

  4. And of course saying that the test is wrong 80% of the time is very misleading. Virtually all the negative results are correct, saving all those people from having to decide whether to undergo an invasive procedure. If positive results occur 1% of the time, the test is actually correct more than 99% of the time even if the false positives outnumber the true positives 100 to 1.

    1. Yes, the 80%-93% is the frequency with which positive results are wrong, not the total error rate. For the numbers in the hypothetical example, of all people tested (assuming a random sample, of course), 4.95% will be healthy and get a positive result and .05% will be ill yet get a negative result, for a total of 5% erroneous results. As Roz points out below, though, epidemiologists would want a screening test to have a specificity greater than .95, which would lower the total proportion of erroneous results. (For example, if the specificity increased to .98, there would be a total of 2.03% erroneous results, a rather substantial reduction.)


  5. For those wanting a deep dive into Bayesian statistics, and how one can construct it based on notions of intuition, I can’t recommend “Probability Theory: The Logic Of Science” by Edwin Jaynes too highly. There appear to be PDF copies of the book freely available on the web.

  6. In my day job, this would be an example of confusing screening with classification. AI systems used in medical applications for instance must do both since this mimics well established, efficient, and effective diagnostic principles. Genetic tests like this that depend on statistical methods are extremely useful for identifying (screening) at risk populations. Those identified as part of an at risk population must then be individually classified – i.e., evaluated more broadly (or monitored more closely) to convert the statistical screening into a binary – you either are afflicted or you are not.

  7. Minor quibble with Greg: He wrote that 95% specificity is “quite specific”. But I want to quickly point out that a specificity of 95% is ghastly for rare-disease screening in a large population of humans. It means TONS of false-positives. This was his point, but while 95% may look high for lab scientists or lay people, most anyone who works in epidemiology knows 95% specificity in the context of rare-disease screening means you have a nightmare on your hands. Bayes is taught in the most introductory of epidemiology courses…

    1. That is to emphasize that epidemiologists start their training as scientists with Bayes. If one can’t pass the brain-puzzle quizzes on Bayes in intro Epi courses, it is all but pathognomonic that one will fail in the field. Bayes is used a lot for prediction in Epi. Admittedly, not done by hand, usually.

      (I wrote this as I gather most who read WEIT are basic scientists and lay people. Both groups usually have mistaken notions of what epidemiologists do. For instance, it’s a myth that epidemiologists often mistake correlation for causation. I see more of that mistake made by those in bioinformatics and computational biology, where people forget that their data being molecular doesn’t protect it from being correlational or reverse-caused, and by journalists misinterpreting epi papers. In contrast, epidemiologists usually spot those errors quickly, whether their data are disease outcomes, genes, or other molecular features.)

      1. As a bioinformatician I can assure you that basic courses teach excellent statistics, and have corrected some of my misperception of using such methods in higher dimensional spaces that comes from my physics background.

        But you do see a lot of crappy papers that makes problematic claims, since “bioinformatics” is pasted on anything demanding statistics while you often don’t see a bioinformatician involved in the analysis.

        That suggests to me that the solution could be more bioinformatics proper, not less.

    1. Jeff, I recommend, if you want to understand this, you take a look at Gerd Gigerenzer’s book “Calculated risks : how to know when numbers deceive you” (2002). He looks at the example of disease screening, using both Bayes’ theorem and natural frequencies. When you see it in terms of natural frequencies you will understand it.
      You can look at the the book in the Internet library here (sign-up is for free, without any strings attached):

      You could also look at this article (it’s ungated), co-authored by Gerd Gigerenzer:
      Helping Doctors and Patients Make Sense of Health Statistics
      Psychological Science in the Public Interest, 2007

      In the same vein, this book (again available for free download):
      Know Your Chances: Understanding Health Statistics. University of California Press, 2008

      1. Another freely accessible article (written for non-specialists) is this:
        Understanding Bayes’ rule. Journal of Economic Perspectives, 1996

        Gilbert Welch: Sensitivity & specificity [of screening tests]. 2012, 8 mins
        Welch is one of the authors of “Know Your Chances: Understanding Health Statistics” cited above.

        Julia Galef: Bayes: How one equation changed the way I think. 3 mins
        Julia Galef: A visual guide to Bayesian thinking. 11 mins

        A. Philip Dawid & Donald Gillies: A Bayesian analysis of Hume’s argument concerning miracles. Philosophical Quarterly, 1989, 39(154), 57-65

  8. As a former teacher of critical appraisal I will stick my neck out. As Roz says, tests to be useful for population screening where the prevalence of all diseases is very low must be highly specific. Most ordinary tests with sensitivity and specificity in the 90-95% range are useful only when the true prevalence of the disease is quite high. In an Emergency Dept, the likelihood that an adult is having a heart attack if s/he has chest pain, sweating, shortness of breath, and a normal ECG is between 30 and 70%. A “heart attack” test that has 95% sensitivity and specificity can “rule in” the diagnosis sufficiently strongly (at least 89% for even the low-end pre-test estimate of 30%) to justify immediate treatment. The impact of a negative test can be worked out mathematically but then the question is, Is that post-test probability low enough to allow the patient to be discharged safely to follow up with his GP? (Answer, No. That’s not a mathematical decision, it’s what the standard of care says.)

    False-negatives are more dangerous than false-positives in treating the sick. In screening the well, it’s the reverse.

  9. There’s a visual calculator for this problem for Covid lateral flow tests here

    The final answers are very slightly wrong as they use the rounded numbers in the percentages.

    I set this is an exercise for our Forensic students this year.

    Note that there is a huge difference between Bayes theorem and Bayesian inference.

    Bayes theorem is completely standard, as illustrated here.

    Bayesian inference is a particular approach to statistical inference where the uncertainty in the parameters of probability distributions is itself modelled by probability distributions.

Leave a Reply