Genomic ancestry tests are not cons, part 2: the problem of ethnicity

The results to the left are from 23andMe for someone whose paternal grandparents were immigrants from southern Germany. Their mother had a father who was of English American background (his father was a Yankee American with an English surname and his mother was an immigrant from England), and grandparents who were German (Rhinelander) and French Canadian respectively on their maternal side.

Looking at the results from 23andMe one has to wonder, why is this individual only a bit under 25% French & German, when genealogical records show places of birth that indicates they should be 75% French & German (more precisely, 62.5% German and 12.5% French). Though their ancestry is 25% English, only 13% of their ancestry is listed as such.

First, notice that nearly half of their ancestry is “Broadly Northwestern European.” Last I checked 23andMe uses phased haplotypes to detect segments of ancestry. This is a very powerful method and is often quite good at zeroing in on people of European ancestry. But with Americans of predominant, but mixed, Northern European background rather than giving back precise proportions often you obtain results of the form of “Broadly…” because presumably, recombination has generated novel haplotypes in white Americans.

But this isn’t the whole story. Why, for example, are many of the Finnish people I know on 23andMe assigned as >90% Finnish, while a Danish friend is 40% Scandinavian?

The issue here is that to be “Finnish” and “Scandinavian” are not equivalent units in terms of population genetics. Finns are a relatively homogeneous ethnic group who seem to have undergone a recent population bottleneck. In contrast, Scandinavia encompasses several different, albeit related, ethnicities which are geographically widely distributed.

Ethnic identities are socially and historically constructed. Additionally, they are often clear and distinct. This is not always the case for population genetic classifications. On a continental scale, racial classification is trivial, and feasible with only a modest number of genetic markers. Why? Because the demographic and evolutionary history of Melanesians and West Africans, to give two concrete examples, are distinct over tens of thousands of years. Population genetic analyses which attempt to identify or differentiate these groups have a lot of raw material to work with.

To the left, you see a PCA plot of Papuans, Yoruba, and Swedes. They are clear and distinct populations. I pruned the marker set down to 750 SNPs. Now, since these were SNPs selected to be variable in human populations, they aren’t just random markers. They are biased toward being informative of population history. That being said, notice how distinct the groups are.

The Yoruba and Swedes and Papuans are separated by 50,000 to 100,000 years of history. That history is reflected in the genetic variation. And the social construct of an ethnocultural identity is nested within that demographic history. The Yoruba people are a coherent cultural unit. Similarly, the Swedes emerged in the last 1,000 years through a fusion of tribes such as the Geats and Svear. The Papuans are a different case, as “Papuan” brackets a whole range of groups. To a great extent, one can argue that a self-conscious Papuan identity is a product of the 20th century, because of political forces (the independence of Papua New Guinea), and large-scale contact with Europeans and Austronesians. Nevertheless, when comparing extreme different groups, an artificial catchall ethnic identity such as “Papuan” is quite informative.

Using the same marker set I plotted individuals from the Yoruba and Esan ethnic groups from the southwest and south of Nigeria, respectively. It is immediately clear that you can barely differentiate the Esan from the Yoruba genetically. At least with 750 SNPs.

The Esan and Yoruba have distinct identities, but culturally they are not too distinct from each other. They even share some traditional deities. Being close neighbors there has likely been a great deal of gene flow, as the shared common common ancestors are much closer in time to the present than in the cases I illustrated above.

But when I increased the marker set to ~250,000 SNPs the Yoruba and Esan were clearly distinct populations. This is not surprising. Often today we are wont to assert that ethnic identities are recent historically contingent creations. The reality is many ethnic identities were assembled out of clear and distinct preexistent elements, which had their own history, and so could be reflected in genetics.

That being said, the closer two ethnic groups are geographically and socioculturally, the more likely the two groups are to overlap genetically (more precisely, they can be much harder to differentiate). Sometimes though genetics and culture are very different. The Basque people of northern Spain and southwest France are only mildly genetically distinct from their Romance-speaking neighbors, but they are an ethnolinguistic isolate. The cultural chasm in language is huge. But the genetic chasm is much smaller.

Scandinavia is a coherent ethnolinguistic category which encompasses various northern Germanic people who were relatively untouched by Roman cultural influences. This is in contrast to many Germanic tribes to the south, such as the Franks, who emerged in dynamic tension with the rise of the Roman Empire. The final Scandinavian conversion to Christianity, and so admission into the post-Roman European world, began about two centuries after the conversion of the pagan Saxons by Charlemagne.

Later, the two centuries of the Kalmar Union brought all the modern nations of Scandinavia under one ruler. Today, the concept of Norden, which includes non-Scandinavian Finland, expresses the cultural and social connections of the northern peoples.

And yet genetically the reality is more muddled. Looking at samples of Germans, Danes, Swedes and Norwegians, the geographic patterning is clear. Danes occupy a position between Germans on the one hand, and Norwegians and Swedes on the other. Because of Sami ancestry in many Norwegians and Sami and Finnish ancestry in many Swedes they are genetically distinct from continental Germanic peoples to the south, including Danes.

So what is a Scandinavian? A Scandinavian is a Swede, Dane, or Norwegian (or an Icelander). Scandinavians share 1,000 years of history since their integration into the European system. As a cultural category Scandinavians are clear and distinct.

But as a genetic cluster things are not so clear. First, there is the Danish connection to Germany. This is due to both history and geography. People from northern Germany are clearly genetically close to the Danes. While the Angles and Jutes were from modern Denmark, the Saxons were from northern Germany. Yet in Britain, they fused seamlessly into one people. Before the mass conversion of the continental Saxons under the Carolingians, the cultural barriers between the peoples of Jutland and Saxony must have been marginal at best.

Second, an enormous number of Swedes in particular seem to be highly admixed with Finnic peoples. Many Swedes are highly “Finn-shifted”, both due to Sami assimilation in the past few hundred years, and the long history of Finnish migration into Sweden (which dominated Finland either politically or culturally for nearly 1,000 years). But culturally, and in their ethnolinguistic identity, these people are nothing but Scandinavian at this point.

Going back to the results of the 23andMe user above, who genealogically is more than 60% German, but comes back as 25% German, how to make sense of it? Anyone who has looked at German data realizes that it is very difficult to identify a ‘prototypical’ German. Germans are people who speak Germanic languages, whose ancestors out of the European Bronze Age, when much of Northern European population structure was established. But being at the center of Europe means that Germans have been subject to gene flow by peoples to from all other directions. Also, some ethnic Germans in the eastern regions clearly descend from Slavic tribes, and more recently there were migrations of peoples such as French Huguenots.

A PCA of Danes, English, French, and Germans, show differences across the groups. But Germans overlap a great deal with the English, and a substantial minority overlap with Danes. Also, many more of the Germans are “French-shifted” than the English.

The point is that to be German is to be many things. At least in the context of Northern European peoples.

There are powerful methods of ancestry inference using more information than just genotypes, such as fineSTRUCTURE. And, there are methods relying in rare variants, which allow for much more fine-grained distinctions. But all these methods suffer from the fact that one has to define populations with labels in the first place. Genetically Germany has several closely related clusters, and all of them are arguably authentically Germany.

Because ethnolinguistic categories are constructions of human history and social preferences they do not always map onto genetic differences at a fine-grain. But, because ethnolinguistic categories were created by humans to give intelligibility to national and cultural variation they are incredibly powerful ways in which to communicate classification to the general public.

Some people believe that personal genomics tests are wrong and false because of the discrepancies as the one I highlight in this post. Actually, the issue is that the language we use shapes our preconceptions, and these companies are attempting to leverage categories and classes which are highly informative to give us a general sense of the patterns they are detecting. Language does not shape reality, but it shapes our perception of reality. To say someone is 25% French-German is more informative to the end-user than to say someone is 25% Generic Continental North European, even though really they are basically the same thing. And yet, if you told someone they were 25% Generic Continental North European they might be less likely to cross-reference that result with their genealogy, because the term is expansive and vague that one does not assume ethnolinguistic precision.

Ultimately I don’t think there is a right answer on this sort of issue. My own preference is clearly to avoid national and ethnic terms to which people bring their own preconceptions. At least when possible.