To classify humanity is not that hard

snpskin In my post below I quoted my interview L. L. Cavalli-Sforza because I think it gets to the heart of some confusions which have emerged since the finding that most variation on any given locus is found within populations, rather than between them. The standard figure is that 85% of genetic variance is within continental races, and 15% is between them. You can see some Fst values on Wikipedia to get an intuition. Concretely, at a given locus X in population 1 the frequency of allele A may be 40%, while in population 2 it may be 45%. Obviously the populations differ, but the small difference is not going to be very informative of population substructure when most of the difference is within populations.

But there are loci which are much more informative. Interestingly, one controls variation on a trait which you are familiar with, skin color (unless you happen to lack vision). A large fraction (on the order of 25-40%) of the between population variance in the complexion of Africans and Europeans can be predicted by substitution on one SNP in the gene SLC24A5. The substitution has a major phenotypic effect, and, exhibits a great deal of between population variation. One variant is nearly fixed in Europeans, and another is nearly fixed in Africans. In other words the component of genetic variance on this trait that is between population is nearly 100%, not 15%. This illustrates that the 15% value was an average across the genome, and in fact there are significant differences on the genetic level which can be ancestrally informative. You can take this to the next level: increase the number of ancestrally informative markers to obtain a fine-grained picture of population structure. In the illustration above the top panel shows the frequencies at the SNP mentioned earlier on SLC24A5. The second panel shows variation at another SNP controlling skin color, SLC45A2. This second SNP is useful in separating South and Central Asians from Europeans and Middle Easterners, if not perfectly so. In other words, the more markers you have, the better your resolution of inter-population difference. This is why I found the following comment very interesting:

Razib’s final concession (that genetic variation exists) is revealing because I think that’s as far as the argument can really be taken. It’s a bit of a strawman, in that people who argue that race is entirely a social construct don’t actually deny that human genetic variation exists. What they deny is that there are non-arbitrary and mutually exclusive categories into which humans can be resolved. This is, I think, the point being made by the “Race by Fingerprints” etc. rhetorical device cited earlier.

In other words, it may be possible for any particular phenotypic trait or genetic locus to be resolved into a strictly cladistic system but humans, being an amalgam of such traits and locii, defy such resoution. So while the study of human genetic variation does, indeed, have “instrumental utility” the concept of biological races is, itself, an arcahic relic.

As I noted below, the comment doesn’t make sense. Here is a PCA of world populations using 250,000 markers:

lotsofmarkers

The relationships between individuals is hypothesis-free. That is, the two largest components of variance in the data just happen to produce clusters which neatly map onto geographic realities. If you think about this a little weird, it makes total sense: populations share a history of intermarriage, so over time they will develop population-specific distinctiveness. It may be true that most of the variance is between populations, but it is not difficult at all to discriminate populations, or generate clusters which are not arbitrary as a function of geography or social identity.

There are relationships which do not match intuition. Or at least intuition as it crystallized during the period of the rise of modern taxonomic science. The various phenotypically “black” peoples of the world, Africans, Melanesians, and some South Asians, do not cluster together. Rather, all non-Africans are separated from Africans by the largest component of variance within the data set. The traits used to make inferences of taxonomy in “folk biology” and early scientific attempts to generate a systematic tree of life in relation to the human races were not necessarily representative of total genome variation, which captures the evolutionary history of a population with greater accuracy and precision.

And obviously you don’t need 250,000 markers, let alone all ~3 billion base pairs in the human genome, to distinguish on the level of continental races/populations. A paper in 2002 laid out the parameters. δ is a measure of between population difference on genes.

sig1
sig2

From the paper:

…we can estimate that about 120 unselected SNPs or 20 highly selected SNPs can distinguish group CA from NA, AA from AS and AA from NA. A few hundred random SNPs are required to separate CA from AA, CA from AS and AS from NA, or about 40 highly selected loci. STRP loci are more powerful and have higher effective δ values because they have multiple alleles. Table 3 reveals that fewer than 100 random STRPs, or about 30 highly selected loci, can distinguish the major racial groups. As expected, differentiating Caucasians and Hispanic Americans, who are admixed but mostly of Caucasian ancestry, is more difficult and requires a few hundred random STRPs or about 50 highly selected loci. These results also indicate that many hundreds of markers or more would be required to accurately differentiate more closely related groups, for example populations within the same racial category.

The paper was written in 2002. Since then much has changed. Here is an image from a post from last summer:

village1

People within European villages tend to be relatively closely related. Again, it is totally reasonable that given enough markers you could assign individuals to different villages with a high confidence. Concretely, person X may show up in the pedigree of individuals from village 1 ~100 times at a given generation, while the same person may show up in the pedigree of individuals from village 2 ~10 times at a given generation. This isn’t rocket science, the basic logic as to why populations shake out based on geography and endogamy patterns is pretty obvious when you think about it.

At about the same time as the above work, A. W. F. Edwards, a statistical geneticist, published a paper titled Lewontin’s Fallacy which took direct aim at the misunderstand of the human Fst statistic and its relevance for classification. Here is Edwards answering why he wrote the article in 2002 (my co-blogger at GNXP, David B, is doing the questioning):

4. Your recent article on ‘Lewontin’s Fallacy’ criticises the claim that human geographical races have no biological meaning. As the article itself points out, it could have been written at any time in the last 30 years. So why did it take so long – and have you had any reactions from Lewontin or his supporters? [David B’s question -R]

I can only speak for myself as to why it took me so long. Others closer to the field will have to explain why the penny did not drop earlier, but the principal cause must be the huge gap in communication that exists between anthropology, especially social anthropology, on the one hand, and the humdrum world of population and statistical genetics on the other. When someone like Lewontin bridges the gap, bearing from genetics a message which the other side wants to hear, it spreads fast – on that side. But there was no feedback. Others might have noticed Lewontin’s 1972 paper but I had stopped working in human and population genetics in 1968 on moving to Cambridge because I could not get any support (so I settled down to writing books instead). In the 1990s I began to pick up the message about only 15% of human genetic variation being between, as opposed to within, populations with its non-sequitur that classification was nigh impossible, and started asking my population-genetics colleagues where it came from. Most had not heard of it, and those that had did not know its source. I regret now that in my paper I did not acknowledge the influence of my brother John, Professor of Genetics in Oxford, because he was independently worrying over the question, inventing the phrase ‘the death of phylogeny’ which spurred me on.

Eventually the argument turned up unchallenged in Nature and the New Scientist and I was able to locate its origin. I only started writing about it after lunch one day in Caius during which I had tried to explain the fallacy across the table to a chemist, a physicist, a physiologist and an experimental psychologist – all Fellows of the Royal Society – and found myself faltering. I like to write to clear my mind. Then I met Adam Wilkins, the editor of BioEssays, and he urged me to work my notes up into a paper.

I have had no adverse reaction to it at all, but plenty of plaudits from geneticists, many of whom told me that they too had been perplexed. Perhaps the communication gap is still too large, or just possibly the point has been taken. After all, Fisher made it in 1925 in Statistical Methods which was written for biologists so it is hardly new. [my emphasis -R]

Richard Dawkins repeated Edward’s argument in The Ancestor’s Tale. You can read Edward’s full essay online. Also see p-ter’s lucid exposition at GNXP.

discblogs So far I’ve been talking mostly about genes. But in terms of classification there isn’t anything magical about genes. Biological anthropologists using more robust morphometric traits have discerned an “Out of Africa” movement, just as geneticists have. You have above five individuals. All of them have dark hair and dark eyes. There’s total overlap on those traits. And yet I’m pretty sure you can assign their rough population identity to each. Why? Because humans take a look at correlated clusters of traits in assigning population identity intuitively. Some traits are more salient, such as skin color, but early geographers understood that East Asians and Europeans were different populations despite similarity of light complexion. The ancient Greeks understood that Indians and Ethiopians were different groups despite their similar complexions, because they differed on other informative traits.

Let’s bring it back down to earth. Population structure exists. Phylogenetic analyses of humans are trivial in their difficulty. They track geography rather closely, at least before the age of mass migration. Additionally, they tend to follow endogamous social groups, such as Ashkenazi Jews. A South Asian is going to be more genetically related to a South Asian than they are to an African. There are many cosmetic differences between populations. But there are also less cosmetic differences which are very important. You can even assign different regions of a chromosome to different ancestral components.

Where does this leave us? Ultimately, it’s about the “R-word.” “Race is a myth.” Or, as PBS stated, an illusion. Here’s some of the precis of the PBS documentary:

Everyone can tell a Nubian from a Norwegian, so why not divide people into different races? That’s the question explored in “The Difference Between Us,” the first hour of the series. This episode shows that despite what we’ve always believed, the world’s peoples simply don’t come bundled into distinct biological groups. We begin by following a dozen students, including Black athletes and Asian string players, who sequence and compare their own DNA to see who is more genetically similar. The results surprise the students and the viewer, when they discover their closest genetic matches are as likely to be with people from other “races” as their own.

Much of the program is devoted to understanding why. We look at several scientific discoveries that illustrate why humans cannot be subdivided into races and how there isn’t a single characteristic, trait – or even one gene – that can be used to distinguish all members of one race from all members of another.

Modern humans – all of us – emerged in Africa about 150,000 to 200,000 years ago. Bands of humans began migrating out of Africa only about 70,000 years ago. As we spread across the globe, populations continually bumped into one another and mixed their mates and genes. As a species, we’re simply too young and too intermixed to have evolved into separate races or subspecies.

So what about the obvious physical differences we see between people? A closer look helps us understand patterns of human variation:

In a virtual “walk” from the equator to northern Europe, we see that visual characteristics vary gradually and continuously from one population to the next. There are no boundaries, so how can we draw a line between where one race ends and another begins?

We also learn that most traits – whether skin color, hair texture or blood group – are influenced by separate genes and thus inherited independently one from the other. Having one trait does not necessarily imply the existence of others. Racial profiling is as inaccurate on the genetic level as it is on the New Jersey Turnpike.

We also learn that many of our visual characteristics, like different skin colors, appear to have evolved recently, after we left Africa, but the traits we care about – intelligence, musical ability, physical aptitude – are much older, and thus common to all populations. Geneticists have discovered that 85% of all genetic variants can be found within any local population, regardless of whether they’re Poles, Hmong or Fulani. Skin color really is only skin deep. Beneath the skin, we are one of the most similar of all species.

Certainly a few gene forms are more common in some populations than others, such as those controlling skin color and inherited diseases like Tay Sachs and sickle cell. But are these markers of “race?” They reflect ancestry, but as our DNA experiment shows us, that’s not the same thing as race. The mutation that causes sickle cell, we learn, was passed on because it conferred resistance to malaria. It is found among people whose ancestors came from parts of the world where malaria was common: central and western Africa, Turkey, India, Greece, Sicily and even Portugal – but not southern Africa.

This documentary came out in 2003. In late 2005 scientists discovered the role that SLC24A5 plays in skin color. It is the second most ancestrally informative locus typed so far to differentiate Europeans and Africans. It actually does come close to being a single gene which differentiates two populations! It is true that human populations have mixed. I probably have ancestors who were resident in China and Northern Europe within the last 1,000 years. That’s the way genealogy works. All Eurasians may be able to find a genealogical line of ancestry back to Genghis Khan (though not necessarily distinctive genes attributable to him). But that does not negate the fact that some of your ancestors show up in your pedigree orders of magnitude more than others of your ancestors. The vast majority of my ancestors within the last 1,000 years were South Asian, though a substantial minority were Southeast Asian. The question of our youth as a species and its relation to our differentiation into races and subspecies is an empirical matter, not an a priori one determined by a fixed number of years. Since races and subspecies are fuzzy characteristics they’re easy to refute, just pick the definition which is refutable. I have no idea how they adduce that traits like intelligence, musical ability, and physical aptitude, are that much older than the “Out of Africa” migration. Humans have been getting much more gracile over the last 10,000 years as a whole, while I don’t know how one can know about the musical abilities of anatomically modern humans in Africa 200,000 years. These traits are quantitative, and based on standing genetic variation, so the architecture is qualitatively different from that of skin color (though in 2003 we didn’t know the architecture of skin color, the confusion is explainable).

The old concept of “race” as outlined by anthropologists in the early 20th century, and accepted broadly, was often unclear, ad hoc, and not empirical. Over the past generation by way of refuting the concept of race people are wont to make unclear, ad hoc, and non-empirical, assertions. The reason that scholars discuss race and refute it is to eliminate confusions and misconceptions from the public, but their presentation has produced more confusions and misconceptions. The idea that human phylogeny is impossible is in the air, I have heard it from many intelligent people. I have no idea why people would be skeptical of it, the way it is presented by many scholars makes the implication clear that phylogeny is impossible, that differences are trivial. Both these are false impressions. I do not believe that the fact that mixed-race people’s real problems obtaining organs with the appropriate tissue match is a trivial affair. Human genetic differences have plenty of concrete impacts which are not socially constructed.

Personally I have no problem with abandoning the word race and all the baggage which that entails. But there’s no reason to throw the baby out with the bathwater here. In the “post-genomic” era human population substructure is taken for granted. The outlines of the history of our species, and its various branches, are getting clearer and clearer. There’s no point in replacing old rubbish with new rubbish. We have the possibility for clear and useful thought, if we choose to grasp it.