Wednesday, January 10, 2007

The Scientific Basis for Race

Moving outside the usual realm momentarily, this is from a colleague over in the Physics Department, Steve Hsu:

Metric on the space of genomes and the scientific basis for race, by Steve Hsu: Suppose that the human genome has 30,000 distinct genes, which we will label as i = 1, 2, ... N, where N = 30k. Next, suppose that there are ni variants or alleles (mutations) of the i-th gene. Then, each human's genetic information can be described as a point on a lattice of size n1 x n2 x n3 ... nN, or equivalently an N-tuple of integers, each of whose values range from 1 to ni. For the simplified case where there are exactly 10 variants of each gene, the number of points in this N dimensional space is 10N or 10{30k}, one for each distinct 30k digit number. It's a space of very high dimension, but this doesn't stop us from defining a metric, or definition of distance between any two points in the space. (For simplicity we ignore restrictions on this space which might result from incompatibility of certain combinations, etc.)

Note that the genomes of all of the humans who have ever lived occupy only a small subset of this space -- most possible variations have never been realized. For this reason, the surprise expressed by biologists that humans have so few genes (not many more than a worm, and far less than the 100k of earlier estimates) is no cause for concern -- the number of possible organisms that might result from 30k genes is enormous -- far more than the number of molecules in the visible universe.

To define a metric, we need a notion of how far apart two different alleles are. We can do this by counting base pair differences -- most mutations only alter a few base pairs in the genetic code. We can define the distance between two alleles in terms of the number of base pair changes between them (this is always a positive number). Then, we can define the distance between two genomes as the sum of each of the i=1, 2,..,N individual gene distances. It is natural, although perhaps not always possible, to choose the ni labeling of alleles to reflect relative distances, so variants n1 and n2 are close together, and both very far from n10.

The exact definition of the metric and the allele labeling is somewhat arbitrary, but you can see it is easy to define a meaningful measure of how far apart any two individuals are in genome space.

Now plot the genome of each human as a point on our lattice. Not surprisingly, there are readily identifiable clusters of points, corresponding to traditional continental ethnic groups: Europeans, Africans, Asians, Native Americans, etc. (See, for example, Risch et al., Am. J. Hum. Genet. 76:268–275, 2005.) Of course, we can get into endless arguments about how we define European or Asian, and of course there is substructure within the clusters, but it is rather obvious that there are identifiable groupings, and as the Risch study shows, they correspond very well to self-identified notions of race.

From the conclusions of the Risch paper (Am. J. Hum. Genet. 76:268–275, 2005):

Attention has recently focused on genetic structure in the human population. Some have argued that the amount of genetic variation within populations dwarfs the variation between populations, suggesting that discrete genetic categories are not useful (Lewontin 1972; Cooper et al. 2003; Haga and Venter 2003). On the other hand, several studies have shown that individuals tend to cluster genetically with others of the same ancestral geographic origins (Mountain and Cavalli-Sforza 1997; Stephens et al. 2001; Bamshad et al. 2003). Prior studies have generally been performed on a relatively small number of individuals and/or markers. A recent study (Rosenberg et al. 2002) examined 377 autosomal micro-satellite markers in 1,056 individuals from a global sample of 52 populations and found significant evidence of genetic clustering, largely along geographic (continental) lines. Consistent with prior studies, the major genetic clusters consisted of Europeans/West Asians (whites), sub-Saharan Africans, East Asians, Pacific Islanders, and Native Americans. ethnic groups living in the United States, with a discrepancy rate of only 0.14%.

This clustering is a natural consequence of geographical isolation, inheritance and natural selection operating over the last 50k years since humans left Africa.

Every allele probably occurs in each ethnic group, but with varying  frequency. Suppose that for a particular gene there are 3 common variants (v1, v2, v3) all the rest being very rare. Then, for example, one might find that in ethnic group A the distribution is v1 75%, v2 15%, v3 10%, while for ethnic group B the distribution is v1 2% v2 6% v3 92%. Suppose this pattern is repeated for several genes, with the common variants in population A being rare in population B, and vice versa. Then, one might find a very dramatic difference in expressed phenotype between the two populations. For example, if skin color is determined by (say) 10 genes, and those genes have the distribution pattern given above, nearly all of population A might be fair skinned while all of population B is dark, even though there is complete overlap in the set of common alleles. Perhaps having the third type of variant v3 in 7 out of 10 pigmentation genes makes you dark. This is highly likely for an individual in population B with the given probabilities, but highly unlikely in population A.

We see that there can be dramatic group differences in phenotypes even if there is complete allele overlap between two groups - as long as the frequency or probability distributions are distinct. But it is these distributions that are measured by the metric we defined earlier. Two groups that form distinct clusters are likely to exhibit different frequency distributions over various genes, leading to group differences.

This leads us to two very distinct possibilities in human genetic variation:

Hypothesis 1: (the PC mantra) The only group differences that exist between the clusters (races) are innocuous and superficial, for example related to skin color, hair color, body type, etc.

Hypothesis 2: (the dangerous one) Group differences exist which might affect important (let us say, deep rather than superficial) and measurable characteristics, such as cognitive abilities, personality, athletic prowess, etc.

Note H1 is under constant revision, as new genetically driven group differences (e.g., particularly in disease resistance) are being discovered. According to the mantra of H1 these must all (by definition) be superficial differences.

A standard argument against H2 is that the 50k years during which groups have been separated is not long enough for differential natural selection to cause any group differences in deep characteristics. I find this argument quite naive, given what we know about animal breeding and how evolution has affected the (ever expanding list of) "superficial" characteristics. Many genes are now suspected of having been subject to strong selection over timescales of order 5k years or less. For further discussion of H2 by Steve Pinker, see here.

The predominant view among social scientists is that H1 is obviously correct and H2 obviously false. However, this is mainly wishful thinking. Official statements by the American Sociological Association and the American Anthropological Association even endorse the view that race is not a valid biological concept, which is clearly incorrect.

As scientists, we don't know whether H1 or H2 is correct, but given the revolution in biotechnology, we will eventually. Let me reiterate, before someone labels me a racist: we don't know with high confidence whether H1 or H2 is correct.

Finally, it is important to note that any group differences are statistical in nature and do not imply anything about particular individuals from any group. Rather than rely on the scientifically unsupported claim that we are all equal, it would be better to emphasize that we all have inalienable human rights regardless of our abilities or genetic make up.

Posted by on Wednesday, January 10, 2007 at 01:10 AM in Economics, Science | Permalink  TrackBack (1)  Comments (65)