A lovely graph that tells our story

October 9, 2017 • 8:00 am

by Matthew Cobb

I came across this beautiful graph in an article in the journal Cell this week. It shows declining levels of genetic variability among 51 populations of humans across the planet, plotted against the distance of each population from East Africa:

The data in the figure are from a 2008 paper in Science by Jun Li and co-workers [JAC: reference at bottom; free access] looking at human genetic variation. They studied 938 unrelated people and  650,000 genetic variants, measuring the levels of heterozygosity in each population – this the frequency with which individuals had different copies of a each genetic variant.

This striking result is additional evidence that we originated in Africa and gradually moved around the planet, losing genetic variability as we went. The last places we reached in this survey – the Americas, show the smallest levels of variability.

This is exactly what you would expect: in species that have spread geographically, the ancestral populations have the highest levels of genetic variability. Populations that have moved into new areas tend to lose variability for two reasons. First, they initially contain just a subset of the variability present in the original population. This is probably the explanation for most of the effect on this figure, as many of the genetic variants they have studied will be in ‘junk’ DNA that has no effect on the phenotype. Where the variants are in genes that have an effect, variability can be lost again as the population is subject to new selection pressures in their new environment, which further reduces heterozygosity. Or, as the authors put it:

This trend is consistent with a serial founder effect, a scenario in which population expansion involves successive migration of a small fraction of individuals out of the previous location, starting from a single origin in sub-Saharan Africa.

The final reason why this figure is so pleasing is that it gives a straight line—that doesn’t happen very often in biology!

However, if we look closely, it’s not totally linear – in particular, African populations can show varying levels of variability that do not appear to be related to geographical distance from East Africa (in fact, from Addis Ababa). If you plotted only the African data, you wouldn’t be very impressed. This African variability may be for a number of reasons: the origin of humans may not have been precisely in East Africa, or humans have lived for far longer in Africa than anywhere else on the planet, and may have been subject to particular selection pressures reducing their variability (for example, in an isolated group). An explanation of those four African points at the top, which show essentially identically high levels of variability, may be that there were consistently high levels of gene flow between these groups, maintaining the variability.

Whatever the case, this figure underlines that we are a global species, spanning out across the planet, adapting and losing genetic variability as we traveled.


Li, J. Z., D. M. Absher, H. Tang, A. M. Southwick, A. M. Casto, S. Ramachandran, H. M. Cann, G. S. Barsh, M. Feldman, L. L. Cavalli-Sforza, and R. M. Myers. 2008. Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. Science 319:1100-1104.

48 thoughts on “A lovely graph that tells our story

      1. It is high time that the ontario ministry of education get into the 20th century and start including this in the curriculum. We are cheating our kids by preventing them from learning these fascinating facts about human origins.

    1. I was thinking along the same lines. Until I read yours, my post was going to be: “Shouldn’t they have measured the heterozygosity against distance from Mount Ararat? Oh, wait, I keep forgetting I’m no longer a know-nothing creationist wack-a-loony tune.”

      Since this graph is about as close as you’re going to get to a simple “smoking gun” for human evolution, I hope it gets wide distribution.

  1. I wonder how this might change far into the future due to modern mobility? Assuming humans are even here that long.

      1. Thanks…Very interesting videos. The next billion years does not look good and our only chance will be getting off this planet.

        The idea of population peaking out at 11 billion or so is very interesting as well.

        1. Hello Randy,

          Thank you for visiting my special post and watching some of the videos there. I would appreciate the chance of reading your feedback in the form of one or more comments at the comment section of my special post.

          Regarding your statement “The next billion years does not look good and our only chance will be getting off this planet.”, it may not be enough to leave the Earth or the solar system, as poetically explained and graphically depicted at https://soundeagle.wordpress.com/2012/09/07/soundeagles-poetry-with-enigmas-goodbye-milky-way/

          The multiple comments at the comment section of the post are also worth reading.

          Please enjoy!

    1. My guess is that the effect of mobility will be negligible compared to the effect of genetic engineering, and that era of natural genetic variation in humans is almost over. A thousand years from now, most human genomes (for some definition of “human”) will likely be the result of deliberate design rather than random recombination.

  2. I haven’t but could read the paper. With mathematical training I might possibly be able to understand the actual procedure for calculating the term “mean heterozygosity” of the vertical (‘dependent’) axis of the graph. But I’d be pessimistic on that, i.e. whether I could understand the specialized DNA language needed to understand it—getting too damn old!

    So could anyone here explain it in simple enough terms for a non-biologist?

    Also I’m a native North American, born in Noranda, Quebec, but not indigenous/aboriginal, so likely that’s what was meant by “native” in a couple of comments above.

    In any case, that graph is a very convincing simple geometrical explanation of the change over the last century of understanding the ‘where’ of human origins. DNA has advanced stuff like that, and e.g. like whether Thor Heyerdahl’s conjectures were correct, by several orders of magnitude.

    1. It’s pretty simple. The oldest populations tend to be the most genetically variable, as they’ve been around the longest to accumulate mutations in their DNA. But a small sample of this population will have less genetic variation. If you see a variable population, like that in East Africa, and then the variation declines as one moves away from that location, a reasonable inference—and one now supported by copious data—is that humans colonized the world by moving out of East Africa, with each colonization involving a small sample of mobile humans from the preceding population. Thus, as we see variation declining as we move away from Africa, it’s fair to conclude (which we now know from fossil data) that East Africa was our ancestral home, and then humans moved, like stepping stones, away from that home to populate the world, with each stone containing successively less genetic variation.

      Does that make sense?

      1. Yes it does, very clear on the overall argument from those numbers in the graph. Thanks.

        I was wondering though how numbers in the graph are actually computed from DNA data of people. For example, it has very roughly the number 0.5 for Americans, both North and South presumably, and similarly approximately 0.7 for Africans.

        Maybe there’s a good reference for non-experts on the technicalities of DNA. That non-expertise a failure of mine which I regret more than most other such failures. And I must admit that when feeling ‘bitchy’, I myself might internally reply to somebody: ‘Well, get out the damn elementary textbooks and learn the stuff!” Perhaps that is the only possible way to get more understanding.

      2. I’d like to add that this is quite a misleading graph because of its use of heterzygosity as the y axis. If we sample two alleles from the population at a given locus, expected heterozygosity is the probability that the two alleles are different alleles. Like any probability, heterozygosity cannot exceed unity. If we had one population with 20 equally common alleles at a locus, and another population that had 200 equally common alleles at that locus, the expected heterozygosities of the two populations would both be with 5% of each other, and withing 5% of unity, in spite of the vastly greater diversity of the second population.

        So this graph greatly understates the degree to which African diversity exceeds that of the other groups. A mathematically more reasonable approach would linearize the diversity measure with respect to pooling of groups, using Kimura and Crow’s “effective number of alleles”. The conversion is
        The y-axis would be stretched and we would see the differences between populations on a more meaningful scale.

        1. But changing the scale wouldn’t change the direction of the relationship would it? Wouldn’t the line still slope down from left to right with data points falling similarly along the line?

          1. No, it’s a monotonic transformation, preserving the order of the points on the y-axis. It just gives a more meaningful representation of the magnitudes of the diversity differences between populations. Oceania and Americas would still be similar in diversity, but the highest African points would be MUCH higher than the East Asian points. The highest African populations are nearly twice as diverse as the Oceanic populations.

            1. “The highest African populations are nearly twice as diverse as the Oceanic populations”
              should have been
              “The highest African populations are nearly twice as diverse as the East Asian populations”

    2. Recall that everyone has two copies of each gene, one from each parent. If these two copies are identical, that individual is said to be homozygous at that location; conversely, someone with two different gene variants is said to be heterozygous at that locus.

      So given the 650,000 loci in the study, it’s straightforward to calculate the fraction of those loci for which any given individual is heterozygous. Averaging those numbers over a population then gives the mean heterozygosity for that population.

      Does that help?

      1. Yes, thanks very much. It seems simpler than I feared, and not dependent on the deeper knowledge of DNA that I don’t have.

        Somebody changing the definition of that
        so-called dependent variable by squaring, or logging, or whatever, would destroy the approximate linearity. But that would be silly, with it being a fairly simple probability by frequency.

        1. See my argument above about why treating heterozygosity (which can never be greater than 1) as a legitimate measure of diversity (which is unbounded above) is a bad idea.

          1. Yes, I wrote that before seeing your argument, after which I realized that argument contradicted my last sentence there! You are clearly well ahead of me on the biology side, maybe the mathematical too. In any case, the noting of linearity seems not very important here, nor slam-dunk convincing.

      1. Yes, because lack of genetic diversity in your one crop means that a pathogen like Phytophthora infestans wipes out your Irish Lumper variety of potato. Other potato varieties, sadly not in Ireland, survived and thus so did the potato. Phytophthora infestans may have killed off the Lumpers but not all the potatoes. Humans are the Lumpers in this analogy, not the Irish.

        1. I do not understand why you are saying this on the apropos of this article though. This is just about the relative genetic diversity of human groups. We still have higher diversity than a specific breed of potato. The lowest diversity group, Native Americans (or Amerindians, as often called in genetic articles), were pretty much decimated by a mixture of diseases upon European contact, but even they weren’t wiped out by them.

  3. “populations moved to new areas” – a frequent comment. But never any comments about how much of a given population moved, and how much stayed behind, or why those who moved did move. Maybe because there can only be speculative guesses? When, eg., people trekked across the Bering land path to North America, how many, if any of their groups stayed behind in Asia? And why? Was it only the adventurous who migrated?

    1. I don’t think it would be guesses, but more like estimates with error bars. No idea how its done, though.

      1. Well you can calculate the effective population size (not the actual size, but the breeding size)and this can tell you how big the breeding population was at various times. Jerry posted on this a while back:


        But more to the point, you don’t need to think of those movements being some long-distance migration or bold expedition.

        Look at the Y-axis: 25,000 km. Think about the route we’d have taken – we could have walked (with the exception of Oceania). Around 70,000 years ago we left Africa; we reached Tierra del Fuego about 10,000 years ago. So we could have gone all that distance by just moving our home range slightly, by less than 1/2 a km, each year. Nothing amazing, or bold (except Oceania. Just hunter-gatherer people walking and living in the landscape and going a little bit further each year.

        That having been said, we don’t actually know – the archaeology isn’t there. But this scale shows you can envisage it in very small steps.

        – MC

        1. So we could have gone all that distance by just moving our home range slightly, by less than 1/2 a km, each year. Nothing amazing, or bold (except Oceania. Just hunter-gatherer people walking and living in the landscape and going a little bit further each year.

          I sometimes use the “row with the in-laws” scenario. Every generation (~20 years) someone has a row with the in-laws and moves off far enough that it’s barely possible to walk to see them and get home in one day – around 15km each way. Which gives an average of around a km/year.

  4. This very clearly illustrates a nice part of our story. It works with other representations that look at genetic diversity in our mitochondrial DNA. These, as shown here: http://www-old.hud.ac.uk/research/researchcentres/targ/ show differences in branch length as a measure in diversity withing a population. And here one can see that there are sub-Saharan African populations that are most diverse and therefore older. The connections of the branches also show that the European and Asian populations are descended from particular African populations.

  5. The data are clearly heteroskedastic (it’s a statistical term and I don’t mean heterozygosity). A GLS (generalized least squares) model would likely fit very well where the error variance is modelled as a function of the independent variable. There is an obvious outlier in the American data which would need special statistical treatment.

  6. I couldn’t help but notice that the data aren’t really normally distributed. You can get the raw data from the Science website. If you try regression using a nonparametric method, like Siegel’s repeated medians, you get basically the same answer, but the 95% confidence range is much wider, especially for the values on the right-hand side where sampling is not as dense.

  7. Wonderful Graph (despite Lou Jost’s observation).
    I guess the highest peak represents southern Africa which would include the Khoi and Bushmen populations.

    1. I thought the Khoi and/ or San had a significant input of genes from the spread of pastoralist populations from the S/E around a millennium ago.
      People is complex.

Leave a Reply