More on genes and geography: diagnosing your ancestry from your DNA

February 29, 2012 • 6:54 am

When discussing the question of human races yesterday, I wondered if anyone had ever tried to diagnose an individual based on his/her complement of genes (“genotype”), and said that I was unaware of any such attempt.  Clearly I haven’t been keeping up with the human-genetics literature, because several people called my attention to a paper in Nature (2008) by John Novembre et al. (free at the link), which does just that.  It doesn’t really bear on the question of “races”—except showing that discrete racial groups don’t exist in Europe—but it does show that you can do a pretty good job telling where people came from by looking at their DNA.

I’ll be brief here: Novembre et al. did a “high-throughput” DNA analysis of variable bits of DNA in the genome of 1387 people from every country in Europe, ranging from Russia to Portugal. For each of these individuals they looked at an astounding 197,146 bits of genes: variable nucleotide sites known as “SNPs” (single nucleotide polymorphisms). Yes, this degree of analysis is possible on a single chip, in fact, one can examine variation at 500,568 sites!. From the genetic differences among people they used a statistical algorithm to group together individuals with similar genotypes.  They also knew the “ancestry” of each individual, defined as the country of origin of each individual’s grandparents, as well as the place where the sampled individual lived.

The first observation is that statistical analysis clearly showed individuals falling into clusters corresponding to their geographic location.  This figure, a plot from “principal components analysis”, is a way to get the most information out of individuals’ genotypes using two axes of differences.  The plot shows where each individual fell on the combination of axes. (the big dots are the median values for individuals from each country; click to enlarge):

As the authors note:

The resulting figure bears a notable resemblance to a geographic map of Europe (Fig. 1a). Individuals from the same geographic region cluster together and major populations are distinguishable. Geographically adjacent populations typically abut each other, and recognizable geographical features of Europe such as the Iberian peninsula, the Italian peninsula, southeastern Europe, Cyprus and Turkey are apparent.

In other words, genetically closer populations are more genetically similar, as expected if individuals tend to mate with other individuals from the same country, and close by.  This is an “isolation by distance” model: genetic similarity falls off gradually with distance.  As the authors note, this does not support the existence of “discrete, well-differentiated populations,” i.e., there are no races. None are expected in such a small area, particularly because biological “races” are those populations that (at least at one time) were geographically isolated and genetically differentiated. That geographical isolation never happened in Europe.

The authors also note that “the data reveal structure even among French-, German- and Italian-speaking groups within Switzerland.” Here’s what that small land looks like genetically:

How about using an individual’s DNA to predict his/her ancestry?  The analysis here involved “training” a computer algorithm on the centers of each country of ancestral origin, and then using that and a multiple-regression approach (presumably based on the decay of genetic similarity with distance observed by the authors), “predicted” the ancestral origin of their genes (i.e. the location of their grandparents).  The prediction did pretty well: here’s a figure showing the “predicted” location of each individual, labeled by actual country of. Note the close correspondence between prediction based on genes and actual country of origin based on self-report, and how individuals of the same color group together (meaning that genetically similar individuals tend to have geographically similar origins):

And here’s a bar graph showing how accurate one can predict the geographic origin of each individual from its genes.  The “accuracy” shows the discrepancy between the individual’s actual origin and the place of origin predicted by his/her DNA. It’s cumulative (accuracies must sum to 1 over the total distance, 2500 km), but the darkest bar at the bottom, for instance, shows the proportion of individuals in each country whose place of origin was assigned, by genetics, to within 400 km of their actual place of origin.

The ability to assign locations from genes alone is damn good; as the authors note:

As the fine-scale spatial structure evident in Fig. 1 [first figure shown above] suggests, European DNA samples can be very informative about the geographical origins of their donors. Using a multiple-regression-based assignment approach, one can place 50% of individuals within 310 km of their reported origin and 90% within 700 km of their origin (Fig. 2 and Supplementary Table 4, results based on populations with n.6). Across all populations, 50% of individuals are placed within 540 km of their reported origin, and 90% of individuals within 840 km.

Obviously, Europe has not been so intermixed genetically that you can’t diagnose where an individual’s ancestors came from.  This also means that if a an individual whose ancestors came from Europe, but who was unsure of where, was subject to this kind of genetic analysis, you could tell that individual with high probability where his ancestors resided.  Unfortunately, this is an expensive procedure, far more accurate than the DNA tests you can buy for about $125, but eventually you’ll be able to do this for a reasonable amount of money.

Given that the genetic differences between worldwide populations is substantially larger than differences among European countries, this method could obviously be used to diagnose an individual’s recent ancestry from any place in the world, assuming of course that one had huge samples of human genotypes, analyzed for many SNPs, from many places on Earth. That hasn’t been done yet, but I’m sure it’s in the works.  When that happens, you’ll be able to plunk down a hundred bucks and find out with pretty good accuracy where your ancestors resided.

As I said, this doesn’t show that there are discrete “races” in Europe, and I don’t think there are obviously discrete “races” anywhere these days, though there is large-scale genetic differentiation among worldwide population suggesting that such races once existed as relatively discrete and geographically isolated populations.  The discreteness that once existed, or so I think, is now blurring out as transportation and migration are beginning to mix the discrete groups into not a melting pot, but sort of a lumpy pudding of humanity.

What is clear is that, with considerable accuracy, you can diagnose an individual’s geographic origin from his genes.  Nearly everyone’s DNA contains reliable information about their recent and ancient past.  We are not all genetically alike. If we were, you couldn’t do studies like the one of Novembre et al.  But neither are we radically different genetically, for if we were, you wouldn’t need hundreds of thousands of genes for such accurate predictions.


Novembre, J., T. Johnson, K. Bryc, Z. Kutalik, A. R. Boyko, A. Auton, A. Indap, K. S. King, S. Bergmann, M. R. Nelson, M. Stephens, and C. D. Bustamante. 2008. Genes mirror geography within Europe. Nature 456:98-101.

46 thoughts on “More on genes and geography: diagnosing your ancestry from your DNA

  1. You can also have a good guess just from looking at people. If you looked at Richard Dawkins and Rio Ferdinand and were told one was born in Britain and one in Africa, its obvious which one is which!

    For anyone interested there are loads of posts on this sort of topic at Gene Expression on Discover blogs.

    1. I see what you did there!

      Interesting that the Cypriots are sufficiently distinct from both Greeks and Turks. And that their nearest cluster seems to be British (probably from north London).

      Wonder why the Slovaks are clustered among the Italians, rather than their neighbours.

      1. I recall Steve Jones In the Blood talking about the linguistic divide in Pembrokeshire (south west Wales) where populations were separated along linguistic lines. This would always inhibit mixing I suppose to some extent. Modern borders are artificial constructs & they in turn have an effect on gene flow. I suppose you had two post war generations in East Germany where people only married within East Germany. That would be much different today.
        NB The Gene Expression page is what the Ötzi link above is from.

    2. If you were told one of them had an ancestor who was a slave-owner, and the other had ancestors who were slaves, you’d have a better chance of getting it right. 🙂

  2. My husband interviewed someone about some similar research a few years ago. The researchers had made the interesting discovery that, due to historic migration patterns, white people living in Yorkshire were genetically more similar to Asians (in the UK, the term is used for people of Indian or Pakistani origin) living in Yorkshire than they were to white people living in Lancashire, across the Pennines.

    My husband covered it as an interesting story because of the historic enmity between Yorkshire and Lancashire, but it also showed that people of supposedly different ‘races’ could be more closely related than people of the same ‘race’.

    1. That might be the case for one or two genes but I find it highly unlikely for the genome as a whole. The asian population of northern England has not intermarried to a significant extent with the indigenous population so what you seem to be describing is a situation where the indigenous population of one county is genetically more similar to a population located several thousand miles away than to the indigenous population in the next county!
      That is incredibly unlikely, especially as there is likely to have been widespread gene transfer between the two counties over thousands of years and virtually no gene transfer between either county and India/Pakistan.

  3. “This also means that if a an individual whose ancestors came from Europe, but who was unsure of where, was subject to this kind of genetic analysis, you could tell that individual with high probability where his ancestors resided.”

    I’m curious about how well this would work if ones ancestors were not all from the same part of Europe. For example, I am a South African with Dutch, German, French, English and Portuguese ancestors. Would this analysis show that the majority of my European ancestors were Dutch or German?

    1. I had the same question. My mother is from England, and my Dad from Austria. What would this analysis show I wonder? Would it generate two points, conclude my ancestry is French, or go back hundreds of generations to their last common ancestor?

      1. They deliberately eliminated from analysis all those individuals whose grandparents came from different countries, but there were also some outliers among the populations, which may represent parents from different places.

      2. I presume the results would be sufficient to distinguish contiguous regions of alleles associated with particular regions and hence would be able to say you were the offspring of someone from those two countries. I guess it becomes more difficult the more mixed was you ancestry (for instance figuring out someone who had one grandparent from a different country would be a lot easier than someone whose 4 grandparents all came from different regions.)

      3. Principle component analysis assumes that the various “spectra” being deconvoluted will add together in a linear fashion. Here, I guess a spectra would be a list of absence/presence of various snps. Think 10110 + 01100 = 11210.

        So, hypothetically, yes they could take your mixed spetrum and determine which region-specific spectra could be added together to produce yours. However, there’d probably be multiple solutions – i.e., more than one combination of ancestors which would explan your particular pattern of snps.

  4. this doesn’t show that there are discrete “races” in Europe

    Why not? The data clearly show statistically distinct discrete clusters among the Europeans. If that weren’t the case, then it would be impossible to predict ancestral geography, which can be done very accurately.

    These European clusters are no different than those observed worldwide, except that the differentiation measures are much smaller (0.4% versus 10–20%). If one says that biological “races” for humans exist worldwide based on discrete clusters, what’s different about saying the same thing locally? Nothing. The language about race already reflects what we observe in the data, it exists, but is “socially constructed” based on how many arbitrary k-means clusters you choose: “human race” (k = 1), “Caucasian race” (k .= 3), “Scottish race” (k .= 100).

    Rabbie Burns:

    They Scotia’s race among them share:
    Some fire the soldier on to dare;
    Some rouse the patriot up to bare
    Corruption’s heart

  5. I went to an interesting talk on human evolution yeaterday that shed some light on some of our distant ancestors – or at least the ancestors of some of us. The speaker was Svante Pääbo, one of the leaders of the neanderthal genome project, and who was also involved in the sequence analysis of the Denisova genome.
    Pääbo spoke in some detail about this Denisova genome. This is the ancient hominid genome assembled from the sequencing of the DNA contained within a little fingertip bone found within a cave in Siberia.
    The published details about this find are fascinating – that it is even more distantly related to modern humans than neanderthals – the common ancestor of all three lived perhaps 1 million years ago – and that it contributed about 5% of the genomes of melanesians but not other modern human groups.
    Anyway the unpublished data that Pääbo mentioned involved the fact that they have now discovered several other skeletal parts, namely teeth, of Denisovan individuals from which they were able to extract DNA and check for their relationship to the originally sequenced individual. The really interesting thing Pääbo mentioned was that these teeth looked “scary”. He said that the researchers actually thought they might be teeth from a cave bear until they got the DNA results back!
    Apparently the Denisovans had very archaic looking teeth compared to humans or neanderthals. It’s fascinating stuff and it is worth noting that the original DNA sequence, although very complete, wasn’t enough to give us a picture of how they looked, it took bones to provide that evidence.
    He also mentioned an analysis of both the neanderthal and Denisova genomes to check for the chromosome 2 fusion (the one that distinguishes humans from the great apes) – and it is present in both neanderthals and Denisovans – so the fusion event predates the common ancestor of us, the neanderthals and the Denisovans.

  6. I believe that should be 190,000 SNPs, not genes.

    Thanks for posting this; the methods are very timely as I tackle subspecies issues in tree species!

  7. good stuff.
    small correction:

    In other words, genetically geographically closer populations are more genetically similar,

  8. ‘It doesn’t really bear on the question of “races”—except showing that discrete racial groups don’t exist in Europe—but it does show that you can do a pretty good job telling where people came from by looking at their DNA.’

    I don’t quite understand why people get so hung up on the ‘discreteness’ issue. If there is correlation between genetic structure and geography (as here) then you perforce have between people from different places statistical differences both in genes, and in things like personality, culture, physical and mental attributes, etc etc. The important question – whether in a scientific or in a political sense – is going from correlation to causation: do these genetic differences have any explanatory power in understanding the differences between groups? Any useful science, and any explosive/icky politics comes out right there. Whether the groups so defined are “discrete” seems to be of pretty limited interest by contrast – we *know* the physical attributes of men and women (height, weight, strength, longevity) overlap a lot. That doesn’t stop the differences from existing and mattering. And it matters even more out in the tails – plenty of women sprint faster than me, but not one could outrun a professional male sprinter.

    1. To put it more succinctly, the difference between ‘racial’ and ‘clinal’ variation isn’t that politically relevant. Whichever way you do it, Indians differ from Belgians genetically, and you can start do investigate whether the genetic differences matter, and how much.

  9. The mapping based on nationalities is misleading. The Swiss example shows why.
    Ethnic Italians are separated from the Germanophones and Francophones by the geographic barrier of the Alps. There is no such sharp separation between Francophones and Germanophones. (Although a cultural frontier roughly on the present linguistic boundary can be observed as far back as the Neolithic, 6000 years ago, millennia before the arrival of current ethnic groups.) The linguistic and genetic territorial distribution follows a mixture model, rather than discrete clusters.

    This is hardly the exception. Two examples: native Alsatians are hardly distinguishable from their neighbours across the Rhine. The Basque are even more fun. Linguistically and culturally, they are clearly ‘the odd people out’. Those arguing for particularism point out the phylogeny of mitochondrial haplogroup U8. Those emphasising regional continuity point out the Y-chromosomal similarities to their neighbours.

    Anyhow, national boundaries provide the wrong level of territorial granularity for this type of analysis.

  10. I seen the first picture many times on the web in the last few years and always wondered how big is the sample if an obvious error like the position of the Slovakians (SK) could occur. Now I see from the later pictures that the Slovakian sample is actually next to zero. They just picket up somebody with an unusual genetics for that area.

    Other researches about the genetic distribution in Europe have the same or even worse problem because of not enough data. This team at least was careful enough not to colour maps with computer models, unlike some others. 😛

    Posted February 29, 2012 at 8:43 am | Permalink

    Only half-joking, have they reversed the labels on Slovaks and Slovenes?”

    That is an other explanation. But in this case they should have fixed it a long time ago. It is an old picture.

  11. JAC: “Unfortunately, this is an expensive procedure, far less accurate than the DNA tests you can buy for about $125, but eventually you’ll be able to do this for a reasonable amount of money.”

    I think you mean to say “this is an expensive procedure, far more accurate…”

  12. I would definitely recommend the National Geographic project that lets you send in a DNA sample and they track your deep ancestry (if they’re still doing it). I did it and got a map showing the path my ancestors took out of Africa, through the Middle East, over into Asia somewhat, and then back into Europe and finally more or less into Italy (although it doesn’t show a detailed family tree of that sort). It was worth it.

  13. As a useful complement to the common mental model of thinking of racial groups as smaller and less distinct species, I suggest thinking of racial groups as larger and more distinct extended families. We’re all used to the reality of belonging to multiple extended families, and of using different definitions for different purposes (e.g., who should you send a Christmas card to v. who should you send a Christmas present to). The mechanism that makes racial groups more less diffused than typical extended families is inbreeding.

    1. Steve Sailer said:

      The brutal truth: Obama is a “wigger”. He’s a remarkably exotic variety of the faux African-American, but a wigger nonetheless.

      The funny part: Steve Sailer is a wigger too, a white person descended from black people. Now that we have a confirmed racist at the site, I’d like to ask Steve Sailer how it feels to know that his own ancestors were all niggers. Do racists deny genetics? If they don’t, how do they support their racism? Answers to these questions would be fascinating.

  14. @ Steve Smith,

    You’re showing your ignorance here. Sailer actually is very familiar with genetics. Ask academics Steven Pinker, Greg Cochran, Henry Harpending, or Steve Hsu.

    1. Sailer actually is very familiar with genetics.

      Familiarity with genetics isn’t often accompanied with racist statements like this:

      Steve Sailer said:

      What you won’t hear, except from me, is that ‘Let the good times roll’ is an especially risky message for African-Americans. The plain fact is that they tend to possess poorer native judgment than members of better-educated groups. Thus they need stricter moral guidance from society. … In contrast to New Orleans, there was only minimal looting after the horrendous 1995 earthquake in Kobe, Japan — because, when you get down to it, [the] Japanese aren’t blacks.

      It would be interesting to watch Sailer defend his racist regurgitations in light of the fact that Sailer’s human ancestors are mostly black. I would also be interested to hear Sailer tell us which genes blacks have that cause them to possess “poorer native judgment,” and how it came to be that Sailer doesn’t possess these genes himself.

      He’s posted here twice now and there’s no reason he couldn’t respond to these simple questions, unless of course he’s a kkkoward.

  15. One interesting special case: Atzmon et al., Abraham’s Children in the Genome Era: Major Jewish Diaspora Populations Comprise
    Distinct Genetic Clusters with Shared Middle Eastern…, The American Journal of Human Genetics (2010), doi:10.1016/j.ajhg.2010.04.015

    Here you have a group that has remained geneticaly recognisable through endogamy despite wide geographical dispersal.

  16. @ Steve Smith,

    You may also want to look up the definition of wigger. Quoting the Urban Dictionary: ” male caucasion, usually born and raised in the suburbs that displays a strong desire to emulate African American Hip Hop culture and style.”

  17. Greetings,

    A few years ago, Steve Jones did a program on BBC about using DNA to study family histories.

    One “blue-blooded” English woman was NOT AMUSED to discover that she had 40% Middle Eastern DNA in her!

    Another family, whose father was noted for having large hands, discovered that they were descended from a French knight called “Grand Main” (Large Hand)

    Quite fascinating.

    Kindset regards,


  18. What $125 tests do you mean? Tests of hundreds of thousands of SNPs are commercially available from several sources. The $200 23andMe test covers a million SNPs, probably including 90% of the SNPs in the paper.

  19. I wonder if the bad geographic positioning in Sweden, dragghing down the average, results from a small sample within a large nation (~ 1600 km in length)?

    1. The spatial distance coefficients in such studies should be seriously corrected for anisotropy. AFAIK, they are not, at least not in the one at hand.

      Lao et al. 2008 (DOI 10.1016/j.cub.2008.07.049), a concurrent study similar to Novembre et al. 2008, based in part on the same data pool, uses kernel density plots (Fig. 1A) of PC1 and PC2 in lieu of the cluster scatters to visualise the spatial distribution. This approach has the merit of at least making the anisotropy explicit, if not adjusting for it.

      Empirical anthropologists learn early on that, until very recently, roughly 70-80% of the population married and settled whithin a radius of, roughly, 5-30km from their origin. The distribution of the density is usually crater-shaped, with a corresponding decay function. This would suggest a field approach, rather than measurements based on linear distances. Is anyone aware of such studies? If so, could they please weigh in?

    2. I noticed that dip in accuracy. There can be a very large distance between southern swedish population and northern swedish population.

  20. As I’ve been examining this material, I was introduced (at the suggestion of anthropologist Henry Harpending) to the work of Guido Barbujani. His 2010 paper (co-authored with Vincenza Colonna) is a very careful overview of the scientific literature from someone who has been studying human genetic diversity for a long time. It’s readable and current and answers many of the issues posed here: Human genome diversity: frequently asked questions.

    I discuss related issues in a blog-post titled Race redux: What are people “tilting against”?

    Thank you for your consideration.

  21. I heard a paper at the AAAS Western meetings last year which included some material concerning the connection between vitamin D deficiency in dark-skinned individuals leading to rickets which lead to deaths in childbirth, resulting in a shift toward lighter skin color.


Leave a Reply