The Tibeto-Burman and Austro-Asiatic ancestry of Bengalis

My father’s mtDNA lineage phylogeography

When I first got my father’s 23andMe results the Y and mtDNA were an interesting contrast. He, and therefore myself, carried Y lineage R1a1a, the lord of the paternal lineages. That was not that great a surprise. In the 1000 Genomes results for the Bangladeshi sample 20% of the men were direct paternal descendants of the R1a1a progenitor.

The mtDNA was a surprise. It was G1a2. This was curious to me since Bangladesh has some of the highest frequencies in the world of haplogroups M, the subhaplogroups in question being mostly restricted to South Asia. I wasn’t surprised that I was R1a1a, but I was even more confident that my maternal lineage was going to be an M, as would my father’s (my own mtDNA is U2b, not common, but not so surprising). As you can see from the map 23andMe places my father’s maternal lineage somewhere in Northeast Asia. The only information I could get about the geography was for G1a, “G1a has been found in samples from China (Daur, Hui, Kazakh, Korean, Manchu, and a sample of the general population of the city of Shenyang), Japan, Korea, Vietnam, and Siberia (Yakut).”

The biggest sample of mtDNA results from Bangladesh I could find at N = 240 does not find any G at all, let alone G1a2. So this is clearly it is a rare haplogroup in the region. But, the authors do classify 13% of the Bangladeshis as carrying an “East Eurasian” haplogroup. Haplogroup A is found among Southeast Asians and Southern China, though not among Austronesians. Haplogroup F seems to have a similar distribution, as does D, B. The other haplogroups also seem “correctly” assigned in terms of modal distribution. They are all mostly East Asian.

Looking at the Y chromosome haplogroups in the 1000 Genomes there are two of O2 and O3, and one of C3, which are clearly of Southeast Asian origin. With N =5 out of 44 samples that is ~10%. O2 is interesting because it is found at very high frequencies among the Austro-Asiatic populations in South Asia, whether it be the Khasi, or Munda groups (general O2a). O3 seems associated with Tibeto-Burman populations, and C3 with East Asia more generally.

If you know much about the ethnolinguistic of South Asia you know that the two major language families are Indo-Aryan and Dravidian. But, there are other groups. In the northwest you have various other Indo-European speaking populations, and along the northern and northeast fringe, you have Tibeto-Burman languages being spoken. But most anomolous is the distribution of Austro-Asiatic languages. The most numerous Austro-Asiatic language in the world today is Vietnamese, followed by the language of the Khmers.

But there are numerous other Austro-Asiatic languages in Southeast and South Asia. The indigenous people of the deep forests of the Malay peninsula, including the Negritos, speaking Austro-Asiatic languages. As one moves west there are Austro-Asiatic languages in Burma, such as Mon, which used to be far more common. And in India there are two groups, the language of the Khasi of the northeast, which seems to share some affinity with the Palaungic dialects of interior Burma and southern China, and the Munda languages farthest west which seem very distinct from all the other branches.

The genetics seems to suggest that the Munda tribes do have East Asian ancestry, but it is almost totally male-mediated. Their Y chromosomal lineages are very unique, with high proportions of O2a, but their mtDNA lineages are overwhelming South Asian macro-haplogroup M. The Khasi of the hills north of Bangladeshi occupy a different position, with both maternal and paternal East Asian heritage, as well as much higher genome-wide ancestry that is not South Asian. At this point, I am convinced that the Austro-Asiatic language groups came into South Asia from the east to the west.

The other language family with East Asian connections in South Asia is that the of the Tibeto-Burmans. Unlike the Austro-Asiatic group, these peoples tend to occupy only the periphery of South Asia, the far north and east.

Finally, there are historically attested Tai peoples who migrated into South Asia. The most famous of these are the Ahoms of Assam. These were part of the same migrations ~1,000 years ago that led to the shift of Thailand from being a zone dominated by Mon and Khmer Austro-Asiatic peoples, to Tai peoples. In Burma, the Tai migrations resulted in the Shan states of the uplands, though the Burman and Mon polities were able to fight off the attempts at take over.

Ultimately the Ahom became totally Indianized. Their traditional language became relegated to ritual, and they adopted the Indo-Aryan Assamese language. Additionally, at some point, they converted to orthodox Hinduism. This became so much a part of their identity that by the 17th century were checking Islamic expansion to the east by defeating the Mughals.

All of this ultimately goes back to the question: how did my father get his mtDNA? If you read my post from a few years back, How did Bengalis get East Asian?, you will know that it is probably a mix of Austro-Asiatic and Tibeto-Burman ancestry. Can we say any more at this stage?

Some Austronesian data sets have come online. So I thought I’d give it another shot. Additionally, I spent several hours removing outliers and combining populations to generate a full data set. The number of markers was 195,000 SNPs.

Label	N	Notes
AA	17	Munda (outliers removed)
BD	74	Bangladesh, 1K BEB (outliers removed)
Borneo	31	Orang Asli tribes (outliers removed)
Burmese	20	Bamar ethnicity
Cambodians	39	Outliers removed
Dai	40
Han_C	47	Pooled Han from HGDP and 1K
Han_N	28	Pooled Han from HGDP and 1K
Han_S	29	Pooled Han from HGDP and 1K
Japanese	28
Malay	21
Miao	10
Phil	16	Luzon and Visaya
Phil_Highland	15	Igorot tribesman Luzon (outliers removed)
Telugu	34	1K STU (outliers removed)
Viet	18

I ran ADMIXTURE at K = 4 on the full data set. Please to click on on the image if you want details, but the results are straightforward:

yellow = South Asian (modal in Telugu)

green = Northeast Asian (modal in Japan and northern Han)

navy = Southeast Asian/Austro-Asiatic (modal in Cambodians)

red = Austronesian (modal in Igorot tribesman from the highlands of the Philippines)

The two bottom population groups are Bangladeshis and Munda. You can see that all are mostly yellow. That is, they’re mostly South Asian. But the Munda have a much lower South Asian proportion than the Bangladeshis. This is not surprising. The Munda language and mythology is very distinct from other South Asians. Clearly, they have ancient East Asian connections, and this shows in their genome-wide ancestry.

But notice a difference between Bangladeshis and Munda: most of the Bangladeshis have a green component, which is in common among Northeast Asians, while none of the Munda do. The total fractions are 38% navy (Austro-Asiatic) for the Munda, and 7% each for navy and green (Northeast Asian) for the Bangladeshis.

The two components also exhibit a negative correlation in the Bangladeshis of -0.47. Why? My own suspicion is there is some population structure and clinal variation exists within Bangladesh. As I’ve noted before my parents are among the most East Asian of Bangaldeshis I’ve ever analyzed…and it is no surprise that we are from the east of eastern Bengal. In contrast when I’ve looked at genotypes from West Bengalis, they tend to have less East Asian ancestry, though still an appreciable amount in a broader South Asian context (in fact, even Bengali Brahmins have East Asian ancestry, though at smaller fractions).

This seems to be pretty clear rejection of the model where Bangladeshis are a two population mix of Munda tribesman, and a more conventional South Asian group.

Here are the average percentages by population:

Group	Austro-Asiatic	Austronesian	South Asian	Northeast Asian
AA	38%	0%	62%	0%
BD	7%	2%	84%	7%
Borneo	61%	38%	0%	0%
Burmese	29%	0%	23%	48%
Cambodians	73%	1%	15%	11%
Dai	49%	7%	0%	44%
Han_C	16%	5%	0%	79%
Han_N	1%	1%	2%	96%
Han_S	27%	7%	0%	66%
Japanese	0%	1%	2%	97%
Malay	64%	16%	13%	7%
Miao	24%	3%	0%	73%
Phil	34%	37%	6%	22%
Phil_Highland	0%	100%	0%	0%
Telugu	0%	3%	96%	0%
Viet	45%	7%	0%	48%

I’m 99% sure that “South Asian” is in some of these cases a proxy for anything that’s not East Asian. But the Malay and Cambodian results are probably South Asian. And the Burmese certainly are.

Click to enlarge the PCA plot to the left, but PC1 is South Asian to East Asian, PC1 is Northeast Asian to Southeast Asian.

Both the Malays and the Burmese exhibit a “South Asia cline.” This is due to admixture. But the Burmese project toward the position of the central Han, while the Malays are shifted toward a Southeastern Asian population.

Both the Bangladeshis and Munda samples are East Asia shifted, but the Munda sample clearly skews toward the Southeast Asian populations. The Bangladeshi samples do not seem to exhibit this clear pattern.

Then I ran Treemix with blocks of 1000 SNPs and no migration edges as well as global rearrangements turned on and rooted with the Telugu.

The results are absolutely unsurprising. Unfortunately adding migration edges doesn’t really add much value with so many populations, as there is a great deal of complex population history in Southeast Asia.

Removing many of the populations and setting the migration edges to 3, you get:

The Austro-Asiatic connection between Cambodians and Munda is always clear no matter what you do. The Bangladeshis tend to have more complex relationships, but often the edges are toward the Burmese, who are a compound between South Asian, Austro-Asiatic, and Northeast Asian.

At this point I ran a “three population test.” Basically, you take an outgroup, and compare it to a clade of two other populations, and see how good the fit of the data to the model is. If there is “complex population history” you’ll get a negative f3 statistic. Complex population history means that there is almost certainly gene flow between the outgroup and one of the ingroups.

Below are results where the Bangladeshis are the outgroup, and f3 statistics are negative (sorted most negative to least).

Ougroup	Pop1	Pop2	f3	f3-error	Z-score
BD	Telugu	Miao	-0.00240554	6.21107e-05	-38.7298
BD	Telugu	Han_S	-0.00238905	5.49332e-05	-43.4901
BD	Telugu	Dai	-0.00238103	5.73977e-05	-41.4831
BD	Telugu	Han_C	-0.00237904	5.74148e-05	-41.4359
BD	Telugu	Viet	-0.0023151	5.63663e-05	-41.0725
BD	Telugu	Han_N	-0.00229979	5.55838e-05	-41.3752
BD	Telugu	Japanese	-0.00225745	5.65642e-05	-39.9095
BD	Telugu	Phil_Highland	-0.00225153	6.87595e-05	-32.745
BD	Telugu	Borneo	-0.00219619	5.91978e-05	-37.0992
BD	Telugu	Phil	-0.00209752	5.97396e-05	-35.1111
BD	Telugu	Cambodians	-0.00198719	4.88719e-05	-40.6613
BD	Telugu	Malay	-0.00195706	5.32466e-05	-36.7547
BD	Telugu	Burmese	-0.00183415	4.79121e-05	-38.2816
BD	AA	Telugu	-0.000744786	4.17995e-05	-17.818

The model where Bangladeshis are a combination of Austro-Asiatic populations and conventional South Asians is not crazy. But observe that there is a jump in the f3 statistics between that row and the previous row. Bangladeshis almost certainly have non-Austro-Asiatic ancestry, which is why the scores are more extreme for cases such as (Bangladesh(Telugu, Vietnamese)).

What I’ve established then are:

Bangladeshi East Asian ancestry is not sufficiently explained by Munda ancestry.
A minority of Bangladeshi Y and mtDNA lineages have East Asian connections, and this can not be explained exclusively by Munda ancestry.
Some of these Y and mtDNA lineages seems to be of Tibeto-Burman affinity.
Admixture analysis genome-wide indicates ancestry from non-Munda populations of East Asian origin.
The fraction of Austro-Asiatic ancestry is balanced with more “northern” elements, while in Burma the northern element is a greater proportion than in Bangladesh.
There is a moderate negative correlation between Austro-Asiatic ancestry and Northeast Asian ancestry in the Bangladeshi sample.
Bangladeshis seem to have moderate signatures of gene flow from a wide range of East Asian populations.
In contrast, the Mundas seem to have a connection most strongly with Cambodians.

A paper from several years ago looking at the patterns of genetic ancestry in the Bangladeshi population found that a single pulse of admixture around 500 AD from an East Asian population was a good fit for the origins of the variation they saw. A two-pulse model with more ancient and more recent admixture events did not improve the fit.

I assume that there is a true signal there. But the model may still be too parsimonious.

My own predictions are as follows:

There will be a east-west cline of Tibeto-Burman ancestry.
There will be a more constant fraction of Austro-Asiatic ancestry.
The ratio of Austro-Asiatic ancestry will be reversed from the Tibeto-Burman cline.
Two admixture events will eventually be detected. A strong sex-balanced pulse at 500 AD and later. And an older continuous event that will be more male skewed, as it will involve absorption of Munda substrate.
The Padma river will turn out to be a major differentiator, with much more Tibeto-Burman ancestry to the east (Bengali dialects from east of the Padma show more Tibeto-Burman influence).

Note: a separate issue that I did not want to explore is that the South Asian ancestry of the Munda seems to show almost no Indo-Aryan influence. The Bengali population does have a small, but consistent, “Indo-Aryan” signature that you can not find in the Telugu sample. Naturally this will bias the statistics a touch.