Sunday, January 2, 2011

ADMIXTURE and STRUCTURE in Perspective: A Discussion of the Pitfalls of Likelihood Population Structure Analysis

In December, as ADMIXTURE results from the Dodecad Ancestry Project were put online, I noticed a post by one commenter that intimated that ADMIXTURE would soon be automatically generating phylogeny trees for various populations.

I'd been looking at the Dodecad results for a while and I hadn't observed an orderly tree like generation of results with increasing K factor, so I had my suspicions about this comment.

As I've analyzed the ADMIXTURE results for Middle Eastern populations, the Southwest Asian component seems to correlate with populations where the J1 Y-chromosome HG and R mt-DNA are common.  El-Sibai et al and Chiaroni et al note that the J1 haplogroup is isolated and strongly represented in the inland Levant and on the Arabian peninsula.  It is thus reassuring that ADMIXTURE was able to differentiate a component for this genetically isolated population.

More puzzling are the West Asian and South European clusters.  Based on spatial distribution data for West Asia, there seems to be a correlation of the J2, L and G y-chromosome haplogroups with the West Asian component and the R1b and E1b1b1 y-haplogroups with the Southern European component.  (See El-Sibai, Table 1 and Figure 2.)

The Eurogenes K10 results for Sinds and Gujaratis, which were run together with the Behar dataset, further elucidate the ADMIXTURE grouping of components with y-chromosome HGs:

----------------------------------------------
ADMIXTURE Eurogenes K10

Component        Gujarat   Sinds

West Asian            0.24   0.37            
Central Asian          0.07   0.09
SW Asian                   0   0.04
South Euro             0.02   0.01
North Euro              0.02   0.02
South Asian            0.63   0.45
---------------------------------------------

Consider these results against y-chromosome HG results for India and Pakistan:
---------------------------------------------------------------
Table 5 (Sengupta et al) lists the y-chromosome HG frequencies for India and Pakistan:

HG                     India(%)  Pakistan(%)

G1-M285                           0.57         
G2-P15              1.24         4.55
G5-P15                             1.14
J2a-M410           3.57         8.52
J2a1b-M067                      1.14
J2a1e-M158       0.27  
J2b2-M241         5.22         2.27
L1-M076            6.32         5.11
L2-M317                           1.14
L3-M231            0.41         6.82
------------------------------------------------
                       17.15       31.26

J1-M267             0.27        3.41

R*-M207            0.27         3.41
R1*-M173                         0.57
------------------------------------------------
                         0.27        3.98

R1a1-M017        15.8       24.43

R1b2b-M073                     4.55
R1b3-M269        0.55         2.84
------------------------------------------------
                          0.55       7.39

R2-M124             9.34       7.39
----------------------------------------------------------------------

The authors of the Thangaraj et al paper note that the Neolithic West Asian contributions to the Pakistan and Indian genetic picture are paternal and are composed primarily of the G, J2 and L y-haplogroup HGs.  That presents a unique opportunity to discern which West Asian y-haplogroups are grouped into the ADMIXTURE West Asian Component.

Comparing the proportions for Gujaratis, in Northwest India, and Sinds, in Southern Pakistan, from ADMIXTURE with the Sengupta results gives an idea of the y-haplogroup-West Asian component correlation:  Haplogroups J2, G and L appear to be grouped into the West Asian component.

Haplogroup R1a1 groups into the Central Asian component.  It is notable that Southern Pakistan Sinds appear to have a lower R1a1 contribution than other parts of Pakistan.

The Indian ADMIXTURE Southwest Asian component is 0% (Sengupta J1 HG result for India of 0.27%).    The Sind ADMIXTURE Southwest Asian result is 4% (Sengupta Pakistan: 3.41% J1 HG).

Due to their low level, it is not clear in which clusters R1* and R2 HG populations group.

Returning to the discussion about the limitations of ADMIXTURE, it is notable that ADMIXTURE has grouped the J2, G and L HGs together (but not J1).  In retrospect, since men with J2, G and L HGs have been interspersed in West Asian since the LGM, it isn't surprising that they are grouped together.  What is surprising is that J1 appears separated, even when it is in a leaf branch of the J-G-L root phylogeny.

Here, we can see that ADMIXTURE is not partitioning populations based on phylogeny, but by the degree to which a population has been isolated over a timescale of thousands of years.

A recent paper investigates some reasons why likelihood based algorithms such as ADMIXTURE sometimes fail to correctly identify phylogeny and the relationship between clusters.  The paper focuses on another computer program, STRUCTURE, but the problems of genetic clustering based on likelihood are encountered in all likelihood algorithms:

The computer program STRUCTURE does not reliably identify the main genetic clusters within a species:  simulations and implications for human population structure
ST Kalinowski
(Link)

From the paper:

"The goals of this paper are twofold. First, I will use computer simulation to examine whether STRUCTURE can correctly group individuals into clusters when populations have had a history of fragmentation and isolation. This is one of the simplest types of histories that a set of populations might have, and one of the most commonly used models to describe genetic relationships among natural populations. Second, I will explore two previously published data sets of human genetic diversity to determine whether problems identified in the simulations have influenced depictions of human population structure."

"Results from the simulations showed that the clustering arrangements produced by STRUCTURE were affected by the relative amount of differentiation among the populations, and that in some circumstances, STRUCTURE produced clusters that were not consistent with the main evolutionary divisions within the populations. For example, Figure 1 shows that STRUCTURE created evolutionarily accurate clusters when populations A, B and C were closely related to each other (for example, divergence times: 100/200/800). However, when population C was less closely related to population A and B--but still more related to A and B than to D--STRUCTURE clustered individuals from population C with population D."

"The results above show that if the value of K used to run STRUCTURE is less than the actual number of populations, STRUCTURE will sometimes place individuals from unrelated populations into the same cluster."

"I suspect that the problem is that the probability of the genotypic data is maximized by placing as many individuals as possible into genetically homogeneous clusters--with little regard to how the remaining individuals are clustered."

In the case of the Fertile Crescent analysis, this points to one reason why the J2, G and L related haplogroups may be clustered into the West Asian component. Not only are these populations not isolated, they are also less genetically homogeneous than the more isolated and easily discernible J1 related Southwest Asian component.

The tendency of likelihood algorithms to favor most similar clusters while grouping other less similar clusters has implications for the ability of these algorithms to analyze populations that are related, but less related than a dominant group. For example, they may have trouble correctly examining the flow of populations between Europe and Africa relative to European and African populations:

"The genetic similarities between Europeans and some Africans that I found are not evident in the output of STRUCTURE (Rosenberg et al 2005). STRUCTURE clustered all sub-Saharan Africans into a single cluster and all Europeans into another cluster (Rosenberg et al., 2005)—which suggests that the peoples of each of these continents are genetically more similar to each other than to peoples on other continents. Previous analyses of genetic diversity in humans do not seem to have noted the genomic similarity of Europeans and present-day African farmers. It has been shown for mitochondrial DNA (Ingman et al., 2000) and for Y-chromosomes (for example, Underhill and Kivisild, 2007), but apparently has not been recognized for autosomal loci which make up the majority of human genome."

The paper also mentions the problem of sample size. This does appear to be a problem with ADMIXTURE. SNP Clusters that are heavily sampled seem to inflate the degree to which they are represented, while under represented clusters may appear, but at very low levels.

2 comments:

  1. ADMIXTURE does not have the problem identified by the authors for STRUCTURE. Structure is NOT a maximum likelihood method, but a Bayesian one, with a quite elaborate model that produces bad results if not used correctly.

    ReplyDelete
  2. Hi Dienekes:

    Both ADMIXTURE and STRUCTURE use likelihood methods.

    You can read the paper yourself:

    Fast Model-Based estimation of Ancestry in Unrelated Individuals
    David H Alexander, John Novembre, Kenneth Lange

    Link: http://dalexander.bol.ucla.edu/preprints/admixture-preprint.pdf

    Abstract:
    Population stratification has long been recognized as a confounding factor in genetic association studies. Estimated ancestries, derived from multi-locus genotype data, can be used as covariates to correct for population stratification. One popular technique for estimation of ancestry is the model-based approach embodied by the widely-applied program STRUCTURE. Another approach, implemented in the program EIGENSTRAT, relies on principal component analysis rather than model-based estimation and does not directly deliver admixture fractions. EIGENSTRAT has gained in popularity in part due to its remarkable speed in comparison to structure. We present a new algorithm and a program, ADMIXTURE, for model-based estimation of ancestry in unrelated individuals. ADMIXTURE adopts the likelihood model embedded in structure. However, ADMIXTURE runs considerably faster, solving problems in minutes that take STRUCTURE hours. In many of our experiments we have found that ADMIXTURE is almost as fast as EIGENSTRAT. The runtime improvements of ADMIXTURE rely on a fast block relaxation scheme using sequential quadratic programming for block updates, coupled with a novel quasi-Newton acceleration of convergence. Our algorithm also runs faster and with greater accuracy than the implementation of an Expectation-Maximization (EM) algorithm incorporated in the program FRAPPE. Our simulations show that ADMIXTURE’s maximum likelihood estimates of the underlying admixture coefficients and ancestral allele frequencies are as accurate as STRUCTURE’s Bayesian estimates. On real world datasets, ADMIXTURE’s estimates are directly comparable to those from structure and eigenstrat. Taken together, our results show that ADMIXTURE’s computational speed opens up the possibility of using a much larger set of markers in model-based ancestry estimation and that its estimates are suitable for use in correcting population stratification in association studies.

    ReplyDelete

Comments have temporarily been turned off. Because I currently have a heavy workload, I do not feel that I can do an acceptable job as moderator. Thanks for your understanding.

Note: Only a member of this blog may post a comment.