submitted 30 Jun 2014
Admixture models are a ubiquitous approach to capture latent population structure in genetic samples. Despite the widespread application of admixture models, little thought has been devoted to the quality of the model fit or the accuracy of the estimates of parameters of interest for a particular study. Here we develop methods for validating admixture models based on posterior predictive checks (PPCs), a Bayesian method for assessing the quality of a statistical model. We develop PPCs for ve population-level statistics of interest: within-population genetic variation, background linkage disequilibrium, number of ancestral populations, between-population genetic variation, and the downstream use of admixture parameters to correct for population structure in association studies. Using PPCs, we evaluate the quality of the model estimates for four qualitatively different population genetic data sets: the POPRES European individuals, the HapMap phase 3 individuals, continental Indians, and African American individuals. We found that the same model fitted to different genomic studies resulted in highly study-specific results when evaluated using PPCs, illustrating the utility of PPCs for model-based analyses in large genomic studies.
We have developed posterior predictive checks (PPCs) for analyzing genomic data sets with the admixture model. We have demonstrated that the PPC-|estimating the posterior predictive distribution and checking the likelihood of the true observed data under this distribution-|gives a valuable perspective on genetic data beyond statistical inference of model parameters. In the research literature, fitted admixture models are often accompanied by a 'just so' story to explain the inferred parameters and how they are reflective of ancestral truth . The model may suggest these hypotheses, but only conditioned on the model being a good fit for the observed data. PPCs check this assumption of good fit, giving weight to the hypotheses by confirming that the underlying assumptions do not oversimplify the existing structure in the observed data. In this paper, we developed PPCs for the admixture model, designing biological discrepancy functions to quantify the effect of the model assumptions on interpreting and using the estimated parameters for downstream analyses.
Statistical modeling of genetic data requires us to balance the complexity of the model with its capacity to capture the data at hand. As examples of limitations, we may not have enough data to support an overly complex model, or the model class that that we want to fit may be too complex given our computational constraints. Thus, we support the iterative practice of fitting the simplest model (i.e., the one we fit here), checking whether a higher resolution model is needed, and then improving the model only in the ways that result in more reliable interpretations of the results. PPCs can drive this process of targeted model development, pointing us towards enriched Bayesian admixture models along gradients that quantifiably improve their performance for the exploratory tasks that matter. With this practice in mind, we revisit the PPCs described above and discuss how we might enrich the simple admixture model to address its misspecified assumptions.
Many population studies have applied admixture models to explore and quantify genetic variation between individuals within and across ancestral populations [13,45,46]; these analyses may benefit from the inter-individual PPC. For studies where this PPC indicates misfit, prior work has adapted the admixture model to control admixture LD by explicitly modeling haplotype blocks for each ancestral population instead of modeling each SNP separately . In particular, the SNP-specific ancestry assignment z variables for each individual are modeled by a Markov chain, where the probability of transitioning to a different ancestral population from one position to the next has an exponential distribution. This specifies a Poisson process describing the length of haplotype blocks across the chromosome, with global rate parameter r.
Many studies have noted that background LD may lead to phantom ancestral populations ; applying admixture models to genomic data that contain background LD may find the SNP autocorrelation PPC useful. After identifying model misspecification using our background LD discrepancy function, we could extend the admixture model to explicitly capture background LD. Above we described a Markov model on the z variables. It assumes that, conditional on ancestral population assignment, genotypes are independent. Extending this idea, SABER  implements a Markov hidden Markov (MHMM) model to capture both haplotype blocks and background LD by adding a Markov chain across the population-specific allele frequencies in beta. Others have further extended this model in various ways, including estimating recombination events explicitly in the MHMM .
Methods and statistics have been proposed to evaluate the proper number of latent ancestral populations, often motivated by FST [6, 49]; additionally, nonparametric Bayesian models estimate the posterior probability for each K [50, 51]. We propose a PPC with the FST discrepancy for general use in evaluating appropriate ranges of the number of ancestral populations for a specific study. A simple adaptation of the model to correct for a failure of this PPC is to change the number of ancestral populations K (Figure S3).
There are also explicit model adaptations that will affect the FST of the inferred ancestral populations. For example, one can build hierarchical models that allow the sharing of allele frequencies across populations for some SNPs; this was implemented in the structure 2.0 model, which includes a hierarchical component to allow similar allele frequencies across ancestral populations (the so-called F model) . A second example is from the topic model literature (similar models applied to modeling text documents), where the ancestral populations are captured in a tree-structured hierarchy [52, 53]. In the corresponding admixture models, the root node would include SNPs that have shared allele frequencies across all ancestral populations; at the leaves, the population-specific allele frequencies would include SNPs that have a frequency in that population that is different than the frequency in all other ancestral populations (referred to as ancestry informative markers ).
Previous population studies have explored and interpreted the population-specific SNP frequencies estimated by admixture models [54-56]; almost all applications of this admixture model have used MAP estimates of ancestry assignments to determine the proportion of admixture in individuals [14, 20]. The average entropy PPC will check model misspecification for ancestry assignment, and has implications for interpreting estimates of SNP frequencies. To adapt the model to this misspecification, the hyperparameters for the Dirichlet-distributed allele-specific ancestry assignments may be changed. (We and others set to alpha = 1 , giving equal weight to all possible contribution across ancestries for each SNP.) In particular, we might give higher weight to admixture proportions near 0 and 1 by setting alpha < 1 for studies where we expect low levels of admixture (e.g., the HapMap data). The equivalent change for the hyperparameters in the population-specific allele frequency parameters would encourage for allele frequency spectrums that more closely match what we find in natural populations . Another relevant model adaptation would be to modify the distribution of a SNP to be not Bernoulli but instead Poisson , normal , or something more sophisticated [60, 61]. We emphasize that, though these extensions seem reasonable, the PPC with this discrepancy found little need to modify the admixture model assumptions in our current studies. The exception to this point is the ASW study, although we hypothesize that correcting for background LD as suggested above will address this misspecification.
We believe that all model-based methods to control for population stratification in association mapping will benefit from application of the mapping PPC, including linear mixed models and non-generative methods such as EIGENSTRAT [2, 62]. Failure of the association mapping PPC indicates that the estimates of population structure are insuffi cient to correct for the confounding latent structure in the individuals. There are many directions to consider for mitigating this type of model misspecification. As examples, one may use larger numbers of estimated principal components or ancestral populations, use alternative approaches to specifying the latent structure variables, or correct for structure that are estimated on local regions of the genome. This same discrepancy function - replacing z with the estimated random effect from linear mixed models - would be useful in quantifying model misspecification for these alternative methods for association mapping in the presence of confounding population structure [63-65].
Applied statisticians develop models to capture the biological complexity of their data. To form hypotheses from these models, however, we need assurances that the data can support them. PPCs provide a simple mechanism to quantify when a model is suffi cient or when it needs additional structure to support downstream analysis. While we have focused on the admixture model, the PPC methodology applies to any probabilistic model of data. For example, we believe there could be a substantial role for PPCs in evaluating demographic models. As we continue to collect complex genomic data, we continue to develop complex models to explain them. Equally important to building our repertoire of statistical models for analyzing genomic data is to build our repertoire of ways to check those analyses.