Localizing recent positive selection in humans using multiple statistics

Localizing recent positive selection in humans using multiple statistics

Online this week in Science, a group presents a method for identifying genes under positive selection in humans, and gives some examples. I have somewhat mixed feelings about this paper, for reasons I’ll get to, but here’s their basic idea:

Readers of this site will likely be familiar with genome-wide scans for loci under positive selection in humans (see, eg., the links in this post). In such a scan, one decides on a statistic that measured some aspect of the data that should be different between selected loci and neutral loci–for example, extreme allele frequency differences between populations, or long haplotypes at high frequency–and calculates this statistic across the genome. One then decides on some threshold for deciding a locus is “interesting”, and looks at those loci for patterns–are there genes involved in particular phenotypes among those loci? Or protein-coding changes?

In this paper, the authors note that many of these statistics are measuring different aspects of the data, such that combining them should increase power to distinguish “interesting” loci from non-“interesting” loci. That is, if there’s an allele at 90% frequency in Europeans and 5% frequency in Asians, that’s interesting, but if that allele is surrounded by extensive haplotype structure in one of those populations, that’s even more interesting. The way they combine statistics is pretty straightforward–they essentially just multiply together empirical p-values from different tests as if they were independent. I wouldn’t believe the precise probabilities that come out of this procedure (for one, the statistics aren’t really fully independent), but it seems to work–in both simulations of new mutations that arise and are immediately under selection and in examples of selection signals where the causal variant is known (Figures 1-3)–for ranking SNPs in order of probability of being the causal SNP underlying a selection signal.

With this, the authors have a systematic approach for localizing polymorphisms that have experienced recent selection. It’s necessarily somewhat heuristic, sure, but it does the job. They then want to apply this procedure to gain novel insight into recent human evolution. This is sort of the crux of the matter–does this new method actually give us new biological insight?

The novel biology presented consists of a few examples of selection signals where they now think they’ve identified a plausible mechanism for the selection–a protein-coding change in PCDH15, and regulatory changes near PAWR and USF1 (their Figure 4). On reflection, however, these examples aren’t new. Consider PCDH15–this gene was mentioned in a previous paper by the same group, where they called a protein-coding change in the gene one of the 22 strongest candidates for selection in humans (Table 1 here, and main text). It’s unclear what is gained with the new method (except perhaps to confirm their previous result?).

Or consider the regulatory changes near PAWR and USF1. The authors use available gene expression data to show that SNPs near these genes influence gene expression, and that the signals for selection and the signals for association with gene expression overlap. Early last year, a paper examined in detail the overlap between signals of this sort, and indeed, both of these genes are mentioned as examples where this overlap is observed. So using different methods, a different group published the same conclusion about these genes a year ago. Again, it’s unclear what one gains with this new method.

In general, then, this paper has interesting ideas, but puzzlingly fails to really take advantage of them [1]. That said, they’ve taken some preliminary steps down a path that is very likely to yield interesting results in the future.

—–

[1] I wonder if I’m being too harsh on this paper just because it was published in a “big-name” journal. If this were published in Genetics, for example, I certainly wouldn’t be opining about whether or not it contains any novel biology.

—–
Citation: Grossman et al. (2010) A Composite of Multiple Signals Distinguishes Causal Variants in Regions of Positive Selection. Science. DOI: 10.1126/science.1183863

Razib Khan