Last spring I posted ‘Beyond visualization of data in genetics’ in the hopes that people wouldn’t take PCA too far in assuming that the method was a reflection of reality in a definite fashion. Remember, PCA visualizations are showing you two, and at most three, dimensions in genetic variation within the data set at any given time. The fine print is important; e.g., “PC 1 15%”, “PC 2 4.5%”, etc., which points to the magnitude of the dimensions within the data. You see the largest, and likely historically most significant on a population wide scale, genetic variances, but there’s still a large remainder left over. But when I look at referrals from message boards people obviously aren’t careful with what PCA is telling them.
As an illustration, in the 23andMe user interface you can “compare genes” genes across people who you “share genes” with. This comparison operates over ~550,000 single nucelotide polymorphisms out of 3 billion base pairs (you can constrain it to traits, but I’m going to talk about the comparison to the whole data set below). For example, a man of European descent shares 83.2% with his daughter, who is Eurasian (the mother is Burmese, with some recent Indian admixture). Another man of European descent shares 84% with his daughter, whose mother is also European (in fact, both parents are western European). The “gene sharing” with other people of European descent of these two men is in the 75-74% range (for reference, a Chinese person is 71%, and Nigerian 68.5%). On the PCA plot the European and his Eurasian daughter are very far apart, while the European man and his European daughter cluster together. What you’re seeing on the PCA chart is population level information, not the genetic uniqueness within families and across parents and offspring.
To further explore this issue, I thought it would be interesting to revisit my own genetic data. If you read my previous post, you will know it is not boring. As an ethnic Bengali my ancestry comes from the northeast of the Indian subcontinent, so in addition to the “Asian” fraction which most South Asians have in the 23andMe “ancestry painting” (around 25% on average, with a range from 10-35% probably the extremes within two standard deviations from what I can tell), I likely have some southeast Asian ancestry from Burma. 23andMe has three “reference” populations it uses from the HapMap:
Asian = Chinese/Japanese
European = Northwest European
African = Yoruba
All of us get an ancestry painting which is a combination of these three. Unfortunately unless you’re a relatively straightforward combination of these three groups it isn’t always too informative. So if you’re African American you should be in luck since the two ancestral populations which you derive from are included as reference populations. On the other hand, unadmixed Native Americans tend to be about 25% European and 75% Asian, while unadmixed South Asians are 75% European and 25% Asian. That’s because the allele frequencies in these two populations have some relationship to both the reference groups, even if there hasn’t been any recent admixture (additionally, the painting presumably misses a lot that is distinctive to these groups, though 23andMe has a feature which allows people to explore possible Native American ancestry specifically).
As I told you before my ancestry is 57% European and 43% Asian. This is a very large Asian fraction for a South Asian, and after comparing notes with other South Asian 23andMe customers I’m pretty sure that my large fraction is due to having admixture from Burmese and/or Tibeto-Burman or Austro-Asiatic “Hill Tribes” to the north, south and east of Bengal. Since my family is from the east of Bengal that is not too surprising.
You know from my previous post that on the PCA plot I am near, but outside, of the main South Asian cluster. But there’s some interesting data from the gene comparison feature too. For reasons of privacy I’m not going to give you names obviously, but, I will label people by geographical origin if I know that aspect of the individual’s information. Additionally, below the comparison is mostly to Indians, and so I’m going to substitute names of Indian states for those where I have that level of specificity. I also restandardized the gene sharing value, so that the nearest individual with whom I’m sharing is 0 , and the furthest on the plot is 1 (74.5% to 73.04% if you’re curious). To add a wrinkle, I’ve added the % Asian calculated from 23andMe’s ancestry painting on the Y axis. The two images below show the results, the first includes some East Asians and a European, while the second includes only South Asians.
The first image is of more interest. Two points:
1 – Unlike most South Asians I have greater gene sharing identity with East Asians than with Europeans. The South Asian to whom I am closest to does not exhibit my own pattern, as they are closer to some Europeans than they are to some Chinese. In contrast, I not only unequivocally share more genes with East Asians than Europeans, but, I share more genes with some East Asians than I do with the individual from Iran, and, one South Asian from the northwest of the subcontinent and another from southern India. This last pattern is very peculiar from what I’ve been told (the other Bangaldeshi has the same tendency, though not to the same extent).
2 – There is a woman with whom I am sharing genes with from Burma. Her father, who died when she was young, had Indian ancestry, and reputedly spoke Tamil. She is ~20% European, which would make her father ~40% European. I have not seen a South Indian who is less than 65% European, so I believe that he had native Burmese admixture. If his mother was Burmese that would make his father ~80% European, which I have seen in a few South Indians, though their usual range seems to be 75-65%. Note that I am closer to her than I am to most South Asians. In contrast, the Bangaldeshi with whom I am sharing genes, and has the second highest percentage of Asian in their ancestry is about as far from this woman and he is from the Punjabis in terms of distance (in contrast, the Punjabis are about 2.5 times further than she is from my own genetic state).
If I did the same plot of % Asian with gene sharing for the European man and his Eurasian daughter I would see a pattern whereby for most of the data there would be a noticeable linear pattern, the more Asian, the less gene sharing. The exception would be his daughter, who would be greatly Asian, but would be the closest by this genetic distance measure. Similarly, the Burmese woman with some Indian admixture is an outlier on my plot. The South Asians follow a southeast-to-northwest range of distance from me, with a rough, but not perfect, correspondence with Asian ancestry. Among the South Asians the individual from Bihar is an exception, just as the Burmese woman is. Why? From previous comments I’ve made I have indicated that there is a high probability of recent Burmese ancestry on my paternal lineage (specifically, my paternal grandfather, whose physical appearance is always described as atypical for a Bengali. My paternal grandmother was from a Hindu family which converted, and she looked stereotypically Bengali). Additionally, I know my mother’s maternal grandfather is from the Indian state of Uttar Pradesh, specifically, the region of Delhi. But I also know that before they were Muslim my maternal grandfather’s family were of the Hindu Kayastha caste. The individual from Bihar is a Kayastha, and for those of you who do not know, Bihar is the state just to the west of Bengal. I do not know if the Kayasthas share any deep genetic affinity or not, but I recall that Reich et al. observed a high degree of genetic evidence of endogamy in South Asia. So, just as I believe that I share Burmese-specific genetic variants with the woman of predominant Burmese origin which are not showing up in the simple ancestry estimates based on the global reference populations, I may also share Kayastha-specific variants which results in my genetic closeness to the Bihari individual. But my confidence in the latter conjecture is far weaker than in the former case.
In reviewing all I’ve said so far I suppose the moral of the story is not to trust too deeply in one set of data visualizations or summary statistics. Granted, some people have axes to grind and can find what they want in the science, my posts on Jewish genetics indicates that very strongly. But if you’re genuinely interested in patterns of variation, and your own place within the broader framework, you need to open different windows on the same data to get a truly fully-fleshed out understanding of the nature of things. If you are of an understudied population, and of somewhat mixed background, as I am, tread lightly and carefully. If you are of a well studied and characterized population, then learning you are 100% European is basically worthless (though some of the more detailed PCA’s can tell you some things).