Exercise 21.1

Linkage Disequilibrium and the Human Genome

(This exercise is based on Reich, D. E., M. Cargill, S. Bolk, J. Ireland, P. C. Sabeti, D. J. Ritchter, T. Lavery, R. Kouyomjian, S, F. Farhadian, R. Ward, and E. S. Lander. 2001. Linkage disequilibrium in the human genome. Nature 411: 199–204.)

(Note: The reference above links directly to the article on the journal’s website. In order to access the full text of the article, you may need to be on your institution’s network [or logged in remotely], so that you can use your institution’s access privileges.)


Population geneticists are keenly interested in characterizing the genetic variation that exists within and among populations because genetic variation is what natural selection operates on to yield evolutionary change. Genetic variation exists not only at the level of variants at a single genetic locus (or nucleotide), but can also be characterized at the level of combinations of alleles from different loci (also known as haplotypes).

Linkage disequilibrium (LD) is a measure that describes how often combinations of alleles appear in a population relative to their expected frequencies if the allele frequencies are independent of one another. When allele frequencies at one locus are independent of allele frequencies at another locus, the population is said to be in linkage equilibrium; any deviation from this equilibrium is called linkage disequilibrium. For a numerical example, suppose allele A1 is present at frequency 0.7 and allele B1 (at a different locus) is present at frequency 0.4. If the population is at linkage equilibrium for these two loci, then the combination of A1B1 should appear at frequency 0.28 (0.7 times 0.4). Suppose instead that the frequency of A1B1 were 0.37; in this case, nearly all (0.37 of 0.4) of the B1 alleles are associated with the A1 allele, and only a few (0.03 of 0.4) are associated with non-A1 alleles. The preceding case would be an example of strong linkage disequilibrium or LD. Another way to look at LD is to consider it as the extent of correlation of alleles across different loci.



Question 1. The frequency of the C3 allele is 0.3 and the frequency of the D8 allele is 0.12. What is the expected frequency of the combination C3D8 at linkage equilibrium (no linkage disequilibrium)?


Although evolutionary geneticists have known about linkage disequilibrium for many decades, its importance was not fully realized until the current genomics era. The implications of LD include its use in mapping genes associated with diseases and other traits, and the information it carries that pertains to the evolutionary and demographic history of the population of study.

Most evolutionary forces that affect allele frequencies will create or maintain LD. These include mutation, random genetic drift, and most forms of natural and sexual selection. The admixture of populations and non-random mating can also create LD. Recombination between loci diminishes LD, but does so at a slow rate (proportional to the recombination distance between the loci). Because the effects of genetic drift on LD are known and the strength of genetic drift is inversely proportional to the effective population size, evolutionary geneticists are able to make inferences about the effective population size of different populations from LD data taken from multiple genes.

In addition to providing information about the evolutionary history of a population, LD studies also have practical value. Association mapping, a powerful method used to map genetic variants associated with traits, rely on the existence of linkage disequilibrium between the genetic variant that actually affects the trait and molecular markers. For this reason, association mapping is also called LD mapping. Considerations of the placement and density of markers used in association mapping depends upon the extent of LD seen in a given population.

In 1999, Leonid Kruglyak published results of a simulation study showing expected values of LD for sites in the genome at various distances apart from one another. According to this simulation, as the distances between markers increase, LD should quickly fall off, and be nearly undetectable after about 20 kilobases (kb). If this result were true for most regions of the genome in most human populations, then association mapping would require a much greater density of markers (more markers per a given length of DNA) than had previously been considered.

In 2001, David Reich, Eric Lander, and other researchers at MIT undertook the first large-scale study of LD across the human genome. They examined 19 randomly selected regions of the genome from population of European-Americans from Utah. For each of these 19 regions, they sequenced at various distances (1 kb, 5 kb, 20 kb, 20 kb, 40 kb, 80 kb, 160 kb) from the focal single nucleotide polymorphism (SNP).

Figure 1 The extent of LD for both the simulation study by Kruglyak and the average for the actual data. In this study, D´ was used as the index of linkage disequilibrium; D´ = 1 is the maximal LD possible, and D´ = 0 is no LD (linkage equilibrium).


Question 2. Does LD extent further in the actual data or in the simulation study? At approximately what distance (in kb) is D´ = 0.5 in the actual data?

Figure 2 The extent of LD for each of the 19 genomic regions selected. The diamond symbols represent actual data points, and the solid line represents the average value for that genomic region. The dashed line represents the LD relationship averaged across all genomic regions. The panel on the lower right shows the relationship between the rank of a genomic region in extent of LD (highest LD on left, lowest LD on right) and the average recombination rate for that genomic region (in centimorgans per Mb).


Question 3. Does the WASL genomic region have a greater or less than average extent of LD?


Question 4. What is the relationship between the average recombination rate and the extent of LD?

What is the reason for the discrepancy between the predicted results from the simulation data and the actual results? One plausible explanation is a genetic bottleneck that reduced population sizes and thus increased the strength of genetic drift at some time in the past. The authors of the study considered the prospect of a bottleneck, and modeled how bottlenecks at different times in the past would affect LD.

Figure 3 How bottlenecks of F = 0.4 at various times (400, 800, 1600, and 3200 generations in the past) would affect LD, as well as the actual data and the previous prediction.


Question 5. Would a bottleneck that occurred 400 generations ago cause LD to increase more or less than a bottleneck of the same intensity that occurred 1600 generations ago?


Question 6. Assume that the bottleneck was responsible for the change in the extent of LD and that it was of size F = 0.4. Is the actual data most consistent with a bottleneck that took place less than 400 generations ago, one that took place 800–1600 generations ago, one that took place 1600–3200 generations ago, or one that took place more than 3200 generations ago?

Figure 4 Comparison of the relationships of LD with distance between SNPs for the Utah and the Nigerian populations.


Question 7. The researchers also sequenced the same genomic regions from a population in Nigeria. Which population shows the larger extent of LD?



Question 8. Which is more likely: that the Nigerian population suffered a more intense bottleneck than did the Utah population or that the Nigerian population did not suffer an intense bottleneck? Explain your answer.


Question 9. Would the Nigerian population be better or less well suited than the Utah one for association mapping studies when the marker density is fairly low (one marker per 30 kb)? Explain your answer.