About 20 percent of the world’s population is Chinese (and since over 90% of Chinese citizens are ethnically Han, so by Chinese here I mean Han to a first approximation). In comparison to other non-European groups a fair amount of genetics research has been done with Chinese populations. But in comparison to their overall numbers, not too much has really been done. That will change.
A new preprint, A comprehensive map of genetic variation in the world’s largest ethnic group – Han Chinese, aims to enrich our knowledge set somewhat. The authors used low coverage next generation sequencing to get increase their sample sizes greatly (cheaper). By low coverage, I mean instead of hitting each genetic position on average 30 times or more, as is in the norm in medical genomics, they sampled a position closer to twice.
But while any given genome was usually not given much close attention, their overall sample size of individuals was 11,670 Han Chinese women. Impressive This means that if they called a position as a variant, they could assess their confidence that it was a variant by looking at how many times it was called as a variant across their data set (as coverage declines one’s confidence that a call of a variant is a true call declines because there is a relatively high base rate of error set against the proportion of true expected polymorphisms; in contrast if you sample 30 times the error rate gets overwhelmed by repeated sampling). Overall they counted 25,057,223 variants, which sounds about right. They also found 548,401 novel variants with at least a count of 10 in the data set (a ~0.04% allele frequency, so a very low cut-off).
The most important thing about this preprint is not that the sample size is large enough that they could detect low frequency variants and add to the catalog. No, for me, it is that they sampled so many of the provinces. As you can see in the figure up top just like Europe China’s Han population recapitulate the map of China. That is, populations arrange themselves spatially when projected onto a principle components analysis plot in the same manner that they do geographically. This is a new finding in some ways because previous sampling strategies had not been robust enough to detect the east-west cline (though to be honest if you looked at the Chinese samples in the 1000 Genomes there was suggestion of this).
All that being said, please note that the PCA is not to scale, insofar as most of the variation is north-south (4 to 5 times more than east-west). Rather like Europe in this regard. Part of this difference is due to the fact that gene flow from non-Han populations, particularly in the South, inflate the genetic variation on the first dimension. Another aspect of interest is that genetic variation between Han populations is rather low to begin with.
One way to visualize this is a matrix like the one to the left. You see pairwise population Fst statistics. The largest is between Guangdong in the south, home to Hong Kong and Guangzhou (Canton), and the northern provinces. The Fst value between Guangdong and Shanxi in the center-north is 0.0029. You may know that the Fst value between Han Chinese and Northern Europeans is ~0.10. A 34 factor difference, more than one order of magnitude. As a point of comparison you can find Fst tables which show values between English and Croations and English and Spaniards are about the same as between Guangdong and Shanxi.
What is just as interesting is the very low genetic differentiation on the North China plain. Why is this? There are two reasons I can think of. The easy explanation is that across politically unified flat landscapes gene flow occurs so easily that genetic differences disappear over time.
But, this presupposes there were genetic differences in the first place. The reason I say this is that though there was a early period of migration from the north to the south (from the Han dynasty onward), and absorption of non-Chinese peoples, there were also periods when much of China north of the Yangtze river valley was under barbarian domination or politically unstable. Elite northern families fled to the south, and eventually when political stability reemerged migrated back to the north (similarly, persistent north-south migration occurred, as the Hakka people of South China are clearly of northern provenance).
The low genetic differentiation across northern China may then be thought of as the outcome of structural fixtures of the landscape (no mountains to obstruct gene flow), as well as possibly due to historical instances of copious back-migration from various regions of southern China (or perhaps more accurately Central China, as I’m presuming much of the settlement would come from the lower Yangtze river valley). Both of these dynamics may have led to little intra-regional structure. In contrast you notice that genetic distance between Fujian and Guangdong, two regions adjacent to each other in the South, is still higher than between any of the northern regions.
Again, this is not surprising due to both geography and history. The dialect map of China shows that southeast China is more fragmented than the north (or southwest). These differences are long-standing and date to the initial founding of Han communities in the south via migrants from the north. Unlike North China South China is a topographically diverse landscape, with beautiful escarpments and deep gorges. Fujian literally hugs the ocean, and has long had a relationship to overseas communities for this reason. Geographic barriers mean there are genetic barriers. Combined with admixture with local populations this means it is not surprising that there were greater genetic differences between southern regions than in the north.
Additionally, China south of the Yangtze has been relatively shielded from foreign conquest and invasion compared to the North China plain. Obviously events like the Taiping rebellion and famine more generally had impacts on South China, but North China has had more periods of domination in a destabilizing manner by non-Chinese invaders over the past 2,000 years.
Perhaps more intriguing than the modern genetic relationships within China are the relations with non-Chinese populations. It is not surprising that the South Chinese populations show evidence of admixture with Dai and Tawainese aboriginals (the basal group of the Austronesian migration). The genetics and cultural practices in parts of South China have long suggested relationships to indigenous groups, as well as Sinicization. Honestly I suspect many were surprised how similar North and South Chinese were, indicating either continuous gene flow or descent from a large demographic expansion.
More curious is that some North Chinese seem to show evidence of admixture with West Eurasians. In particular, they show affinities with European populations. Again, this is not surprising. Some earlier analyses have shown evidence of European-like admixture in northern China, and among ethnic groups like Mongolians. More precisely there are strong signals of European-like admixture in the northwestern provinces of Gansu, Shaanxi, and Shanxi.
The details here are important though. The authors note that Hellenthal et al. detected admixture in the from Northern Europeans into North China using haplotype based methods to around 1200 AD. This preprint finds a similar admixture date. But they caution that these admixture dates may only signal the latest of the events.
As for what that event could be, there was clearly turmoil on the Silk Road in the years around 1000 AD. After 750 AD for all practical purposes the Chinese lost control of their portion of the Silk Road, what is now Xinjiang. Turkic groups like Uyghurs and Iranian ones such as Sogdians were prominent in China due to a power vacuum (the Uyghurs were used by the Tang emperors like the Germans were used by the later Roman Emperors, as federates). Later on one saw the emergence of Tanguts, various groups from Manchuria, and finally the Mongols. Since both haplotype based methods and these preprint suggest something around 1000 AD, the most likely candidate was the absorption of Central Asians with some European-like ancestry into the Chinese substrate. The Uyghur conquest of the major cities of in the centuries before the rise of the Mongols famously resulted in the assimilation of a European-like population which had earlier spoken Indo-European languages.
But admixture was not a feature of just recent Chinese history. The figure to the right is somewhat difficult to read, but it shows on the y-axis variance in the f3 statistic. In short, how well does the Chinese data set here form a clade with the outgroup, and how much does that statistic vary between groups. The x-axis is for the D statistic, which measures the relationship of four populations, with two clades. On the bottom left you see the Siberian genome from 45,000 years ago. On the y-axis you can see all provinces show very little variation, and that’s because the Siberian genome is old enough that it is basal to all the Chinese and Europeans. The D statistic indicates no gene flow between the Siberian populations and modern groups. Not so with other populations. You see the Pleistocene European populations are shifted to the right, and that’s because they all contribute to later Europeans. The Chinese-European clade is not a good fit. This is true across the Chinese populations (so the variance of the f3 statistic is very low),.
Also in the text they note that there is high shared drift with the three “Ancient North Eurasian” (ANE) samples from Siberia. This is discussed extensively in the supplements to Lazaridis et al. 2016. Another replicated finding is that the Chinese share drift with ancient European hunter-gatherers. The drift declines later on, likely because the Chinese do not share as much drift with the early farmers. This is due in part to the “Basal Eurasian” (BEu) element. But in Fu et al. 2016 they observe that drift between East Eurasians and European hunter-gatherers increases after 15,000 years BP, when there was a genetic turnover, and the Villabruna cluster (in their terminology) came to dominate the landscape.
The most probable, though not certain, explanation for this pattern is that ANE populations contributed ancestry to both antipodes of Eurasia. To European hunter-gatherers, and, to the ancestors of the Chinese in Pleistocene East Asia (remember that there was a fusion between a proto-East Asian population and ANE to give rise to the ancestors of Amerindians 15-20,000 years ago). Another explanation could be East Asian gene flow rather early on into Europe, some time after the Last Glacial Maximum ~20,000 years ago. We don’t have the sample density outside of Europe to really say with certainty.
Finally, I have to mention that at SMBE Melinda Yang of Qiaomei Fu’s lab gave a talk about the Tianyuan genome. Their group has found that the Tianyuan individual, who dates to 40,000 years ago, is the likely ancestor of modern East Asians. That is, Tianyuan shares more drift with modern East Asians than Europeans. No huge surprise. What was surprising though is that Tianyuan also shared appreciable drift with GoyetQ116, a 35,000 year old sample from Belgium, whose descendants seem to have played a role in the emergence of the Magdalenian culture. But not later European hunter-gatherer populations. The Tianyuan sample also seemed to share some drift with Australasian samples (a possible resolution for why some Amerindians share drift with Oceanians presents itself here obviously). Overall, the group’s conclusion was that this might be evidence of ancient population structure rather early on in the “Out of Africa” populations, which eventually carried over as the groups dispersed (rather than each geographic region being direct descendants from a single panmictic “Out of Africa” group). The implications here are beyond the purview of Chinese genetics so I’ll address it in a later post.
I have to mention there is a fair amount within this paper on selection as well as medical genetics. I didn’t tackle that in this post since there’s so much phylogenomics one could talk about.