Aryan marauders from the steppe came to India, yes they did!

Its seems every post on Indian genetics elicits dissents from loquacious commenters who are woolly on the details of the science, but convinced in their opinions (yes, they operate through uncertainty and obfuscation in their rhetoric, but you know where the axe is lodged). This post is an attempt to answer some questions so I don’t have to address this in the near future, as ancient DNA papers will finally start to come out soon, I hope (at least earlier than Winds of Winter).

In 2001’s The Eurasian Heartland: A continental perspective on Y-chromosome diversity Wells et al. wrote:

The current distribution of the M17 haplotype is likely to represent traces of an ancient population migration originating in southern Russia/Ukraine, where M17 is found at high frequency (>50%). It is possible that the domestication of the horse in this region around 3,000 B.C. may have driven the migration (27). The distribution and age of M17 in Europe (17) and Central/Southern Asia is consistent with the inferred movements of these people, who left a clear pattern of archaeological remains known as the Kurgan culture, and are thought to have spoken an early Indo-European language (27, 28, 29). The decrease in frequency eastward across Siberia to the Altai-Sayan mountains (represented by the Tuvinian population) and Mongolia, and southward into India, overlaps exactly with the inferred migrations of the Indo-Iranians during the period 3,000 to 1,000 B.C. (27). It is worth noting that the Indo-European-speaking Sourashtrans, a population from Tamil Nadu in southern India, have a much higher frequency of M17 than their Dravidian-speaking neighbors, the Yadhavas and Kallars (39% vs. 13% and 4%, respectively), adding to the evidence that M17 is a diagnostic Indo-Iranian marker. The exceptionally high frequencies of this marker in the Kyrgyz, Tajik/Khojant, and Ishkashim populations are likely to be due to drift, as these populations are less diverse, and are characterized by relatively small numbers of individuals living in isolated mountain valleys.

In a 2002 interview with the India site Rediff, the first author was more explicit:

Some people say Aryans are the original inhabitants of India. What is your view on this theory?

The Aryans came from outside India. We actually have genetic evidence for that. Very clear genetic evidence from a marker that arose on the southern steppes of Russia and the Ukraine around 5,000 to 10,000 years ago. And it subsequently spread to the east and south through Central Asia reaching India. It is on the higher frequency in the Indo-European speakers, the people who claim they are descendants of the Aryans, the Hindi speakers, the Bengalis, the other groups. Then it is at a lower frequency in the Dravidians. But there is clear evidence that there was a heavy migration from the steppes down towards India.

But some people claim that the Aryans were the original inhabitants of India. What do you have to say about this?

I don’t agree with them. The Aryans came later, after the Dravidians.

Over the past few years I’ve gotten to know the above first author Spencer Wells as a personal friend, and I think he would be OK with me relaying that to some extent he was under strong pressure to downplay these conclusions. Not only were, and are, these views not popular in India, but the idea of mass migration was in bad odor in much of the academy during this period. Additionally, there was later work which was less clear, and perhaps supported an Indian origin for R1a1a. Spencer himself told me that it was not impossible for R1a to have originated in India, but a branch eventually back-migrated to southern Asia.

But even researchers from the group at Stanford where he had done his postdoc did not support this model by the middle 2000s, Polarity and Temporality of High-Resolution Y-Chromosome Distributions in India Identify Both Indigenous and Exogenous Expansions and Reveal Minor Genetic Influence of Central Asian Pastoralists. In 2009 a paper out of an Indian group was even stronger in its conclusion for a South Asian origin of R1a1a, The Indian origin of paternal haplogroup R1a1* substantiates the autochthonous origin of Brahmins and the caste system.

By 2009 one might have admitted that perhaps Spencer was wrong. I was certainly open to that possibility. There was very persuasive evidence that the mtDNA lineages of South Asia had little to do with Europe or the Middle East.

Yet a closer look at the above papers reveals two major systematic problems.

First, ancient DNA has made it clear that there has been major population turnover during the Holocene, but this was not the null hypothesis in the 2000s. Looking at extant distributions of lineages can give one a distorted view of the past. Frankly, the 2009 Indian paper was egregious in this way because they included Turkic groups in their Central Asian data set. Even in 2009 there was a whole lot of evidence that Central Asian Turkic groups were likely very different from Indo-European Turanian populations which would have been the putative ancestors of Indo-Aryans. Honestly the authors either consciously loaded the die to reduce the evidence for gene flow from Central Asia, or they were ignorant (the nature of the samples is much clearer in the supplements than the primary text for what it’s worth).

Second, Y chromosomal marker sets in the 2000s were constrained to fast mutating microsatellite regions or less than 100 variant SNPs on the Y. Because it is so repetitive the Y chromosome is hard to sequence, and it really took the technologies of the last ten years to get it done. Both the above papers estimate the coalescence of extant R1a1a lineages to be 10-15,000 years before the present. In particular, they suggest that European and South Asian lineages date back to this period, pushing back any possible connection between the groups, and making it possible that European R1a1a descended from a South Asian founder group which was expanding after the retreat of the ice sheets. The conclusions were not unreasonable based on the methods they had. But now we have better methods.*

Whole genome sequencing of the Y, as well as ancient DNA, seems to falsify the above dates. Though microsatellites are good for very coarse grain phyolgenetic inferences, one has to be very careful about them when looking at more fine grain population relationships (they are still useful in forensics to cheaply differentiate between individuals, since they accumulate variation very quickly). They mutate fast, and their clock may be erratic.

Additionally, diversity estimates were based on a subset of SNP that were clearly not robust. R1a1a is not diverse anywhere, though basal lineages seem to be present in ancient DNA on the Pontic steppe in some cases.

To show how lacking in diversity R1a1a is, here are the results of a 2016 paper which performed whole genome sequencing on the Y. Instead of relying on the order of 10 to 100 SNPs, this paper discover over 65,000 Y variants worldwide. Notice how little difference there is between different South Asian groups below, indicative of a massive population expansion relatively recently in time which didn’t even have time to exhibit regional population variation. They note that “The most striking are expansions within R1a-Z93 [the South Asian clade], ~4.0–4.5 kya. This time predates by a few centuries the collapse of the Indus Valley Civilization, associated by some with the historical migration of Indo-European speakers from the western steppes into the Indian sub-continent.”

(BEB = Bengali, GIH = Gujarati, PJL = Punjabi, STU = Sri Lanka Tamil, ITU = Indian Telugu)

The spatial distribution of Z93 lineages of R1a is as you can see to the left. There are branches in South Asia, Central Asia, and in the Altai region. Ancient DNA from the Bronze Age Mongolia has found Z93. Modern Mongolians clearly have a small, but appreciable, fraction of West Eurasian ancestry. Some also carry R1a1a. Z93 has also been found in North-Central Asian steppe samples that date to ~4,500 years before the present.

Today with ancient DNA we’re discovering individuals who lived around the time of the massive expansion alluded to above. What are these individuals like? They are a mix of European, Central Eurasian, Near Eastern, and Siberian. Many of them share quite a bit of ancestry with South Asian populations, in particular those from the northwest of subcontinent, as well as upper castes more generally.

A new paper using ancient DNA from Scythians (Iranian speakers) also shows that they carried Z93. Some of them had East Asian admixture. These were the ones from the eastern steppe. So not entirely surprising. In the supplements of the paper they have an admixture plot with many populations. At K = 15 in supplementary figure 14 you see many ancient Central Eurasian populations run against modern groups. At this K there is a South Asian modal cluster which is found in South Asians as well as nearby Iranian groups from Afghanistan.

It is not light green or dark blue. You see see that this salmon color is modal in tribal South Indian populations, or non-Brahmin South Indians. It drops in frequency as you move north and west, and as you move up the caste ladder. Observe that is present even among the relatively isolated Kalash people of Chitral.

Outside of South Asia-Afghanistan, this salmon component is found among Thai and Cambodians. From talking to various researchers, and recent published findings, it seems clear that this signature is not spurious, but is indicative of some migration from South Asia to Southeast Asia in the historical period, as one might infer based on cultural affinities. It is also found at lower frequencies among the Uyghur of Xinjiang. This is not entirely surprising either. This region of the Tarim basin was connected to Kashmir across the Pamirs. The 4th century Buddhist monk from the Tarim basin city of Kucha, who was instrumental in the translation of texts into Chinese, Kumārajīva, may have had a Kashmiri father.

Even before Islam much of Northwest India and Central Asia were under the rule of the same polity, and after Islam there is extensive record of the enslavement of many Indians in the cities of the eastern Islamic world, as well as the travel of some Indian merchants and intellectuals into these regions.

And yet this South Asia cluster is not present in the ancient steppe samples carrying R1a1a-Z93. None of them to my knowledge. Many ancient samples share ancestry with South Asians. For example it seems that many ancient West Asian samples from Iran share common history as evident in genetic drift patterns with many South Asians. And, there is good evidence that a subset of South Asians, skewed toward northwest and upper caste groups, share drift with steppe Yamna samples. But South Asians are often clearly composites of these exogenous populations and an indigenous component with affinities with Andaman Islanders, and more distantly Southeast Asians and other eastern non-Africans.

How can you reconcile this with migration out of South Asia? The path is found in publications such as Genetic Evidence for Recent Population Mixture in India. Here you have a paper which models mixing between Ancestral North Indians (ANI) and Ancestral South Indians (ASI). The ANI would be the source population for the ancestry shared with West Eurasians. And, they would lack ASI ancestry because the mixing had not occurred. The admixture dates the paper are between two and four thousand years before the present.

There is a problem though. These methods detect the last admixture events. Therefore, they are a lower bound on major mixing events, not a record of when there was no mixing. Secondarily, but not less importantly, recent work indicates that because of the pulse admixture simplification these methods likely underestimate the time period of admixture.

Another issue for me is the idea that ANI and ASI could be so separate within India. If ANI is the source of gene flow into other parts of Eurasia from South Asia, then I believe that ASI is intrusive to the subcontinent. I don’t think that ASI being intrusive is so implausible. Southeast Asia has undergone massive genetic changes over the Holocene, and it may be that there was much more ASI ancestry in placers like Burma before the arrival of Austro-Asiatic rice farmers. The presence of Austro-Asiatic languages in northeast India and central India shows a precedent of migration from Southeast Asia into the subcontinent.

In sum, the balance of evidence suggests male mediated migration into South Asia from Central Asia on the order of ~4-5,000 years ago. There are lots of details to be worked out, and this is not an assured model in terms of data, but it is the most likely. In the near future ancient DNA will clear up confusions. Writing very long but confused comments just won’t change this state of affairs. New data will.

Addendum: Indian populations have finally been relatively well sampled, thanks to Mait Mepsalu’s group in Estonia, David Reich’s lab and, the Indian collaborators of both, and the 1000 Genomes (HGDP gave us Pakistanis). Additionally, Zack Ajmal’s Harappa website did some work filling in some holes in the early 2010s.

* A Facebook argument broke out about one of my posts where one interlocutor asserted that he leaned on papers from the late 2000s, not all the new stuff. That’s obviously because the new stuff did not support his preferred position, while the old stuff did. I would prefer that faster-than-light travel were possible, so I’ll just stick to physics before 1910?