The coronavirus and some basic evolutionary genetics

February 7, 2021 • 9:15 am

by Greg Mayer

Jerry and I were both working independently on posts about the coronavirus. When we realized this, we conferred and decided to continue our efforts, but with some coordination and cross-fertilization. Jerry’s piece was posted on Friday. 

[JAC: Greg has a “technical notes” section at the end which clarifies terms in the text that might confuse nonbiologists.]

1). Getting people vaccinated will impede the origin of new variants, because adaptive evolution is faster in larger populations. Widespread vaccination, by reducing the number of cases, will reduce the population size of the virus. Adaptive evolution is faster in large populations because selection is more effective in large populations; this is a well-known population-genetical result. And it’s also faster because large populations, by having a greater total number of mutations, explore more of the total mutational space—including the possibility of favorable double (or more) mutations in which the component single mutations are not favored but the combinations are. This is, in part, the principle behind the AIDS “cocktail” treatments: by attacking HIV in multiple ways at once, no single resistance-conferring mutation will allow the virus to escape, because if one drug doesn’t get it, another one will. Only having multiple mutations will confer resistance to the whole “cocktail”, but this is very improbable because the individual mutations, not being favored, will not accumulate. But in a very large sample (i.e., a large population), improbable things can happen.

There are also interesting issues of components of fitness or levels of selection in the evolution of viruses (or any disease-causing micro-organism, for that matter). Jerry discussed this in his piece, contrasting the evolution of virulence within an infected host versus transmissibility between hosts. These can be viewed as two components of reproductive fitness: competition to reproduce within the host, and competition to move to new hosts. Or it can be viewed as different levels of selection—individual selection among virus particles within hosts, and group selection between the populations of viruses between the hosts—they all get sneezed out to the next host as a group. The evolution of myxoma virus in rabbits in Australia, which Jerry discusses, has been interpreted from both points of view. The interest comes from the potential conflict between what’s “good” within the host (reproducing very rapidly), and getting to the next host. If you are too good at “taking over” the host, you might kill off the host before you can spread to the next host. And if you don’t spread, you go extinct. So, what’s good in the host may not be good for getting to the next host.

There’s also an interesting issue of what is the proper estimate of population size for the virus. Is it the number of viral particles? The number of hosts? For within-host selection, it would be the number of viral particles in that host. For selection between host populations, it might be nearer to the number of hosts. (I would guess that the theory for this has already been developed in the context of group selection theory.) Either way, fewer hosts, with lower viral loads within hosts, lowers the rate of adaptive evolution of the virus.

2.)  By a *very* crude analysis, the UK variant does not show evidence of selection on its protein sequences. The ratio of Nonsynonymous (N) to Synonymous (S) mutations is 13/6 = 2.17, which is very close to the expected ratio of 2.66 for neutral (i.e., unselected) mutation in a completely *random* genome. The defect of this analysis is that the virus’s genome is of course not random. I would expect that someone with the genomic sequence and the right software is already carrying out a proper analysis using the actual nucleotide and codon distribution of the virus. (In fact, I wouldn’t be surprised if it’s already been done; not being a virologist, I don’t follow that literature.) A second, and perhaps more important defect, which would apply even to a proper analysis, is that nonsynonymous/synonmymous ratios average over sites for a whole protein or genomic sequence, so even strong selection at one or a few sites in a protein can be lost in a sea of neutral change in the rest of the protein. (See Technical note below for more details.)

There are other ways of inferring selection, and Jerry stressed one of those: if the virus evolves in parallel in multiple locations, that suggests the action of selection. We seem to be seeing that, independently, in several different locations, the same variant is spreading widely and increasing in frequency. If the variants were neutral, their frequencies would change only due to chances of sampling and which variant happened to get somewhere first, so we wouldn’t expect the same variant to “get lucky” and take over all the time.

Another hint of selection would be if substitutions affecting function (such as nonsynonymous mutations and deletions) are concentrated in a part of the genome known to be of adaptive significance, such as the spike protein. That protein is a highly functional part of the virus, for it’s the part it uses to stick to host cells. The UK variant shows at least two nonsynonymous mutations and one deletion in the spike protein, but without full data, I can’t say if this is a greater than expected number for the spike protein (which forms ca. 10% of the genome).

3). The variants are differentiated strains, not “mutations”. The identified variants differ by multiple substitutions, and thus are not a mutation, but the accumulation of multiple mutations. Some substitutions in a strain may be subject to selection, but others will not be. If we think of the virus as a “species” (which, being a collection of asexual lineages, is not quite what the virus is), then the variants or strains are like “subspecies”: differentiated descendants of a common ancestor, differing in a number of ways, some of which may be adaptive, while others may not be. (In biological species, subspecies interbreed, and thus are a form of geographical variation; in viruses, however, the variants can exist without interbreeding in the same geographic area, including inside the same host, so the analogy to subspecies is inexact.)

4).  Some of the media, or at least reporter Apoorva Mandavilli of the NY Times, are grasping that virus evolution is key to the course of the pandemic. Words and phrases in her article include: “selection pressure”, “evolve” (4 times!), “evolving”, “evolutionary biologist”, “adaptation”, and “coronavirus can evolve to avoid recognition”. And here’s a statement in the article of the distinction between genetic drift and selection:

Some variants become more common in a population simply by luck, not because the changes somehow supercharge the virus. But as it becomes more difficult for the pathogen to survive — because of vaccinations and growing immunity in human populations — researchers also expect the virus to gain useful mutations enabling it to spread more easily or to escape detection by the immune system.

This article is a pretty direct affirmation of the importance of understanding how evolution works when dealing with viral diseases.

5).  After the AIDS epidemic, we all should have learned the importance of evolutionary biology for transmissible diseases. The lessons learned during the spread, evolution, and control of HIV and other viruses are so clear that they have become textbook examples of evolutionary principles, from elementary grades to college texts. Epidemics are all about evolution.

6.)  You should call it the “UK variant”. The article at Ars Technica from which I got the (limited) genomic data I used above, falls over itself trying not to use geographic terms because they cause “stigma”. This is stupid. One of the oldest practices in taxonomy is to name species after the place they are found. The native anole of the southern United States is named Anolis carolinensis, because the description was based on lizards supposed to be from Carolina. It was later found to occur all over the southeastern United States, with closely related forms (sometimes considered conspecific) on a number of West Indian islands. It has also been introduced all over the world, from California to Hawaii to Japan. It is still Anolis carolinensis. Stability of names is important, and names related to place are a useful mnemonic, since they require no knowledge of Latin or an arcane numbering system. (The article refers to the UK variant as “B.1.1.7”. If there’s only one variant this might do, but with multiple ones it becomes an exercise in memorization.) 

Technical note. “Nonsynonymous” mutations are mutations of the DNA sequence which change the amino acid structure of the resulting protein. Because the genetic code is redundant (DNA codes for the same amino acid in more than one way), some mutations are “synonymous”, resulting in an unchanged protein. There are 549 possible mutations of the 61 amino acid coding codons (61 codons X 3 nucleotides per codon X 3 possible nucleotides to change into). Of these possible mutations, 399 are nonsynonymous and 150 are synonymous. (I couldn’t find these numbers anywhere, so I counted them up myself from the table in Muse and Gaut (1994); my count could be off, but, I hope, not by much.) If a protein coding DNA sequence has a completely random sequence (i.e. all 61 protein coding codons are equally represented), then mutations occurring at random will occur with a nonsynonymous to synoymous ratio of

N/S = 399/150 = 2.66

and, if the mutations are neutral, will be fixed (i.e. will reach a frequency of 100%) in the same ratio, which is where I got the expected N/S ratio of 2.66 for evolution by neutral mutation.

However, the DNA sequence is not random, so we usually express the nonsynonymous/synonymous ratio by looking at the rate of substitution per site. Thus, we divide the the number of nonsynonymous mutations by the number of nonsynonymous sites (i.e. the number of nucleotide positions which would give rise to a nonsynonymous amino acid if mutated), and similarly for synonymous mutations. This gives us the dN/dS ratio, which is expected to be 1 under neutrality, because we have normalized by the expected rates of each type of mutation. It is greater than 1 when there is positive selection in favor of new mutations. In calculating dN/dS, adjustments can be made for known biases in the process of mutation (e.g. the different rates at which mutations which change the ring structure of the nucleotides occur).

dN/dS ratios are subject to some of the same limitations as raw N/S ratios, including the averaging effect noted above. Yang and Bielawski (2000) is a modestly readable introduction to using rates of nonsynonymous versus synonymous substitution to detect selection.

Charlesworth, B. and D. Charlesworth. 2010. Elements of Evolutionary Genetics. Roberts, Greenwood Village Colorado. An upper level text, but not as daunting as some. Amazon

Diamond, J., ed. Virus and the Whale: Exploring Evolution in Creatures Small and Large. NSTA Press, Arlington, Va. Uses HIV as an example of viral evolution. Amazon

Emlen, D. J. and C. Zimmer. 2020. Evolution: Making Sense of Life. 3rd ed. Macmillan, New York. Uses influenza as an example of viral evolution. Amazon

Herron, J.C. and S. Freeman. 2014. Evolutionary Analysis. 5th ed. Pearson. Uses HIV as an example of viral evolution. publisher

Muse, S.V. and B.S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Molecular Biology and Evolution 11:715-724. pdf

Yang, Z. and J.P. Bielawski. 2000. Statistical methods for detecting molecular adaptation. Trends in Ecology and Evolution 15:496-503. pdf

h/t Brian Leiter for the Ars Technica piece.

28 thoughts on “The coronavirus and some basic evolutionary genetics

  1. “The article below[…] falls over itself trying not to use geographic terms because they cause “stigma”. This is stupid. ”


    1. Unfortunately, it isn’t stupid. The non-geographic labels are hopelessly inconsistent and confusing (quick, is 501Y.V1 the same as B.1.177 or B.1.1.7?), but geographic names aren’t a good idea either. People are naturally prone to prejudice and superstitious associations (even when they aren’t being intentionally stoked, as by Trump). Using terminology that’s likely to trigger prejudice and false assumptions isn’t wise.

      The association between disease names and prejudice goes back a long way. Consider syphilis:

      From the very beginning, syphilis has been a stigmatized, disgraceful disease; each country whose population was affected by the infection blamed the neighboring (and sometimes enemy) countries for the outbreak. So, the inhabitants of today’s Italy, Germany and United Kingdom named syphilis ‘the French disease’, the French named it ‘the Neapolitan disease’, the Russians assigned the name of ‘Polish disease’, the Polish called it ‘the German disease’, The Danish, the Portuguese and the inhabitants of Northern Africa named it ‘the Spanish/Castilian disease’ and the Turks coined the term ‘Christian disease’. Moreover, in Northern India, the Muslims blamed the Hindu for the outbreak of the affliction. However, the Hindu blamed the Muslims and in the end everyone blamed the Europeans.

      In 2015, the WHO recommended against naming diseases after places (or people):

      “In recent years, several new human infectious diseases have emerged. The use of names such as ‘swine flu’ and ‘Middle East Respiratory Syndrome’ has had unintended negative impacts by stigmatizing certain communities or economic sectors,” says Dr Keiji Fukuda, Assistant Director-General for Health Security, WHO. “This may seem like a trivial issue to some, but disease names really do matter to the people who are directly affected. We’ve seen certain disease names provoke a backlash against members of particular religious or ethnic communities, create unjustified barriers to travel, commerce and trade, and trigger needless slaughtering of food animals. This can have serious consequences for peoples’ lives and livelihoods.”

      In the particular case of SARS-CoV-2 variants, this probably isn’t a major concern, but it’s not “stupid” either.

      1. “People are naturally prone to prejudice and superstitious associations”

        some people aren’t. what is your point?

        Rocky Mountain Spotted Fever
        Epstein-Barr virus
        Hodgkin’s Lymphoma
        Lyme Disease

        Might there be someone somewhere who would take these names personally or as a means to harm the economy of an area? Yes. Will they? Maybe.

        Is the name of a disease important? Yes.

        Is the name of a disease important to find cures for it? No.

        I’ll call a disease by it’s proper name as the experts are naming them, be it the WHO or the discoverers. But some Decision Makers in San Francisco definitely will have problems with that.

  2. For another layer of complexity, beyond the sequence, consider the glycosylation. All I know is that, besides HIV envelope glycoprotein having various sequences of carbohydrate attached to the membrane bound protein (gp120?) and exposed to the milieu, there are a lot of articles in Pub Med on coronavirus glycosylation (that I have not read).

  3. Lots of good information in this posting, thanks for that. I read a piece the other day that specifically addressed those who have had the virus. Since that includes me I was interested to listen. It recommended that I wait 60 to 90 days before getting the vaccination. Apparently the antibodies I have will cover me for that period and probably longer. There was mixed opinion on whether I should get just the 1st of two shots if I got that type of vaccine. One thought was that with the immunity I have plus one shot would be the same as getting both vaccines for those who have not contracted the virus.

  4. Thanks for providing this write-up, greg. It is extremely helpful to have subject matter experts like you and jerry providing thoughtful commentary on matters of such importance. I, with no formal biology background, relying on what i have read since retiring from an aerospace engineering career, find your comments particularly helpful..for example i had never thought about synonymous vs nonsynonymous mutations due to nonuniqueness of sequences. I also distribute articles like this to a couple of dozen other non-experts…avoiding information mutations through judicious use of copy and paste! Thanks again.

  5. Always interesting to read this kind of stuff.

    “You should call it the “UK variant”. The article at Ars Technica from which I got the (limited) genomic data I used above, falls over itself trying not to use geographic terms because they cause “stigma”. This is stupid.”

    While it might be stupid, the problem with these geographic names is that politicians can use them to falsely further their own agenda. While people should learn the lesson you suggest, perhaps it would be better to avoid the problem entirely when it comes to vectors of infection. It’s not a problem for animals of course.

    Turning the argument around, If tagging a particular pathogen with the location it was first detected is not useful to biology or medicine, a more neutral naming scheme would seem to be called for. Besides the problem of misuse by politicians and the public, this would avoid the potential naming problem that occurs when more than one is discovered in the same location.

    1. “If tagging a particular pathogen with the location it was first detected is not useful to biology or medicine, a more neutral naming scheme would seem to be called for.”

      Indeed. Given that Britain currently leads the world in sequencing clinical samples of the virus (in large part due to the magnificent work of my colleagues at the Wellcome Sanger Institute), it was perhaps hardly surprising that a major new variant was discovered in Britain at the start of 2021.

    2. “ politicians can use them to falsely further their own agenda“

      They do that with whatever they get – what are we supposed to do, bowdlerize science? They will use that too.

      1. They don’t do it with numbers, for example. They don’t do it with the “19” in COVID-19. (Actually, at least one politician assumed that it meant there had been 19 pathogens in the COVID line. They were appropriately corrected.)

        1. And the reason it is called COVID19 is because of a conscious effort to get away from using place names to label viruses. However, that didn’t work with one “politician” and his followers in this case. In fact, arguably, it made things worse. “Why are they refusing to call it the China virus? What are they hiding from us?”

          1. Nothing will stop bad actors from inventing their own names for things. I think we free speech supporters have to go along with that. Of course, when someone does invent their own name for something, they have to take responsibility for doing so. We shouldn’t help them with terms like “Beijing Flu”.

        2. That was Kellyanne Conway, and when I heard that I flipped out, expecting, correctly as it turned out, that people would drink that Kool-Aid. Sure enough, the only Tr*mp supporter I talk to – whose wife has since come down with and recovered from COVID and who himself was asymptomatic but is now seropositive, made oblique reference to the virus being v 19.0. He was swiftly corrected.

          (I’m sympathetic to this guy because he’s friendly and helpful. And I think he has a suspicion of the stuff he gets via Fuxx News. I think I may be his only source of rebutting some of that stuff.)

  6. Thoughts on Weinstein and Heyling assertion that there is a 90% chance that the virus came from a lab… And that it was a lab construct based on gain of function research?

    1. I don’t buy that at all, especially the 90% figure. I don’t think there’s any universal agreement about whether it escaped from a lab, but I think there’s general agreement that the virus does NOT show signs of human tampering to make it dangerous. I wouldn’t believe Weinstein’s and Heyer’s assertion about this.

      1. when solving problems one must consider all possibilities, as well as their likelihoods.

        life-threatening viruses are studied in labs in order to find cures. since some coronaviruses are life-threatening, the likelihood it has been studied in a lab is 100% – we know this from the vaccines.

        if a virus was found that was life-threatening, the likelihood it would be taken to a lab to study it is high – but not perhaps 100%. the likelihood of an accident is non-zero – contamination, poor maintenance, etc. In general, things can get everywhere. In the United States and Canada, study of viruses has excellent safeguards in place that are named BSL, 1-4, with 4 being the highest :

        whether this was the other way around is the thing that apparently some talk show hosts are confused by. Do they think the virus was never found in nature but was introduced to nature? It is unclear but also would be a conspiracy theory.

      2. “ I think there’s general agreement that the virus does NOT show signs of human tampering”

        But even if there were signs that appeared to be evidence of “human tampering”, these would possibly (based on my limited knowledge) be indistinguishable from e.g. nucleic acid sequences that molecular biology labs all over the world have been using for decades, having been found in viruses or bacteria to begin with. Off the top of my head, I think of various promoters – repressors – phage display – restriction endonuclease sites (some of those are bacterial) I don’t know enough without looking them up, but it is an absolute staple of molecular biology to use these nucleic acid fragments to get work done in research. To say nothing about the sequences in the human genome which originated from viruses. I don’t know enough about what it means if bacterial sequences are found in viruses or vice versa. But as I write this out, it appears to me the notion of the “nefarious plot” is unfalsifiable the whole way down.

    2. Damnit I wrote a comment on their youtube about that. I’m under the impression (and I am NO expert at all) …from a virologist that “man made alterations leave a large genetic fingerprint” and I think their speculation about the matter is unhelpful, much as I like them and their show.

  7. All interesting. The bit about the ratio between synonymous and nonsynonymous mutations was also interesting to me. It was nice to see some actual numbers behind the point that for most amino acids, the codons that specify them can have different bases in the 3rd position, so point mutations that change that 3rd base does not change the amino acid specified. Meanwhile, the 1st and 2nd base of a codon are usually the ones that specify the amino acid, so changes in those leads to changing of the amino acid that is specified.
    So ‘back of the envelope’ and very crude calculation: 2/3 of the base changes will be “nonsynonymous”, they will change the amino acid, while 1/3 base changes will be “silent”, or “synonymous”. The amino acid is not changed. There are various wrinkles in the genetic code that throws off the 2/3 to 1/3 ratio slightly , but it was still nice to see that the actual proportions are in the ballpark.

  8. Thanks for that 2.66 number. I was wondering about that just last night and supposed that it was out there but didn’t know it offhand myself.

    Since the number observed is lower than 2.66, that would imply that there is selection against just any change that yields a different amino acid. Also, if I’m thinking this through correctly while looking at the codon table, there should be 14 mutations that result in stop codons, right? I guess those usually cancel out as non-viable and are ignored.

    Another thing to look at, I think (and this depends on whether whole-genome sequencing is still going on at the same pace is was almost a year ago when something like 3200 had already been done by March or so) is whether nonsynonymous changes are preferentially showing up in the spike v the rest of the expressed genome. Whatever the result will be equivocal, though. If changes are showing more in the spike, it likely reflects that the spike is more free to change since it’s surface-exposed without packing constraints that other nucleocapsid proteins have, and those that pack around the RNA. If there are fewer in the spike it would point to a delicate balance re what will work for docking with ACE-2.

    As far as actual codons, the virus itself should not have any preferences since it relies on the host for reproducing it. What would be interesting is an analysis of the codons in the reference (Wuhan) sequence vs. what’s out there now, to see if the new variants are heading in the direction of adopting human codon preferences. That might just be more of a reason for enhanced virulence v actual amino acid substitutions.

    Another thought on that – if indeed the codons are becoming humanized, the reference strain sequence might provide a clue as to the species it jumped from. (Absolutely no idea whether different mammalian species have different codon preferences.)

  9. I gather from point 1 that it might be useful to vaccinate with 2 or even 3 different vaccines (when they target different parts of the virus)?

    1. The “cocktail” approach is used in tri- and quadrivalent vaccines, like the seasonal flu shot. So, you could make a vaccine that has components designed to raise immunity against various antigenic parts or strains of the coronavirus. Such vaccines are formulated to be given simultaneously, rather than sequentially, but I don’t see an in-principle objection to sequential vaccines. You would of course need to insure that the combinations do not have adverse interactions, and the proper timing.


Leave a Reply