Monday, June 2, 2014

High performance computation of landscape genomic models integrating local indices of spatial association

Sylvie Stucki, Pablo Orozco-terWengel, Michael W. Bruford, Licia Colli, Charles Masembe, Riccardo Negrini, Pierre Taberlet, Stéphane Joost, and the NEXTGEN Consortium
May 30, 2014
(Link)

Introduction

The time interval between Mitton et al.’s (1977) first attempt to correlate allelic frequencies with environmental variables to look for signatures of selection in ponderosa pine, and Joost et al.’s (2007; 2008) application of this concept allowing parallel processing of large numbers of logistic regressions was otherwise marked by little development.  During this period correlative approaches were used in parallel with population genetics outlier-detection methods (e.g. Beaumont and Nichols, 1996; Vitalis et al., 2003; Foll and Gaggiotti, 2008) as cross-validation (e.g. Jones et al., 2013; Henry and Russello, 2013) to detect signatures of selection (see a review in Vitti et al., 2013). However, while such methods are still in vogue (e.g. Colli et al., 2014), there has been a revival in the interest of developing new statistical approaches for the emerging field of landscape genomics (e.g. Coop et al., 2010; Günther and Coop, 2013; Frichot et al., 2013; Guillot et al., 2014). For example, BayEnv (Günther and Coop, 2013) implements a Bayesian method to compute correlations between allele frequencies and ecological variables taking into account differences in sample sizes and shared demographic history. LFMM (Frichot et al., 2013) estimates the influence of population structure on allele frequencies by introducing unobserved variables as latent factors. Finally, SGLMM (Guillot et al., 2014) uses a spatially-explicit computational framework including a random effect to quantify the correlation between genotypes and environmental variables. Yet, important functions are still lacking such as high performance computing capacity to process whole genome data, and the integration of spatial statistics to support a distinction between selection and demographic signals. Here we present the software Sambada, which aims at filling these gaps offering an open source multivariate analysis framework to detect signatures of selection. Sambada’s use is illustrated with a case study dedicated to the detection of potentially adaptive loci in 813 Bos taurus and Bos indicus individuals in Uganda genotyped for 40,000 SNP. Lastly, Sambada’s performance is described with respect to other state of the art software to detect signatures of selection.







 



 
 
 


 

 


Methods

Sambada uses logistic regressions to model the probability of presence of an allelic variant for a polymorphic marker given the environmental conditions of the sampling locations (Joost et al., 2007). Since each of the states of a given character is considered independently (i.e. as binary presence/absence in each sample), Sambada can handle many types of molecular data(e.g. SNPs, indels, copy number variants and haplotypes), provided the user formats the input. Specifically, biallelic SNPs are recoded as three distinct genotypes.  A maximum likelihood approach is used to fit the models (Dobson and Barnett, 2008).


Case Study

This study addressed local adaptation in Ankole and Shorthorn zebu cattle in Uganda.  Sampling was designed to cover the whole country, including each eco-geographic region, and to obtain a homogeneous distribution of individuals across the country. A regular grid made of 51 cells of 70 x 70 km was produced to this end. On average, four farms were visited in each cell and four unrelated individuals were selected from each farm, for a total of 917 biological samples retrieved from 202 farms. Recorded information also included the location of the farm, the name of the breed, a picture and morphological information on each individual. These elements were stored in a database accessible through a Web interface, enabling real-time monitoring of the sampling campaign.


Discussion

The key features of Sambada are the multivariate modelling and the measure of spatial autocorrelation. Both can help the interpretation of results in the case that the dataset features population structure. Bivariate models may include the global ancestry coefficients provided by a preliminary analysis. This setup can detect which loci are correlated with the environment while taking demography into account. Additionally, the introduction of measurements of spatial autocorrelation into these analyses integrates spatial statistics with landscape genomics. Contrary to most current and non-spatial models (e.g. Frichot et al., 2013; Coop et al., 2010), this approach integrated in Sambada allows the determination of whether the observed data reflects independent samples, a requirement of the underlying modelling assumptions of such methodologies. Measuring spatial autocorrelation assesses whether the occurrence of a genotype is related to its frequency in the surrounding locations. More specifically, local indices of spatial autocorrelation allow the mapping of areas prone to spatial dependency. On the basis of the present analysis, using spatial statistics in conjunction with correlative models may lower the risk of false positives due to population structure in landscape genomics.
In the present study, Sambada detected the highest number of SNPs as potentially subject to selection among the four approaches. However when comparing the positions of these SNPs, 1,029 of them were less than 100,000 base pairs apart from another detected locus, thus some of these detections might refer to the same signature of selection. Sambada’s results partially match with those of BayEnv with 435 common SNPs (i.e. 22% of BayEnv’s detections). Concerning the third correlative approach, LFMM is more conservative than Sambada but the correspondence is better since 154 loci (out of 280, i.e. 55% of LFMM’s detections) are detected by both methods. Moreover, 25 SNPs detected by LFMM only are less than 100,000 base pairs apart from a loci detected by Sambada, potentially identifying the same selection signature. The order of detections differed between the two methods, as the most significant loci detected by Sambada are ignored by LFMM. Lastly, Arlequin’s best results involved 17 SNPs with p-values lower than 10−4 (significance threshold: = 2.5 · 10−7), out of which 2 were common with Sambada and 16 were common with BayEnv.This result suggests that population based methods, whether using outliers or environmental correlations,tend to detect the same selection signatures. On the one hand, Sambada's detection rate may indicate the occurrence of some false positives due to population structure;on the other hand, the discrepancy between the results may indicate that the more conservative approaches have some false negatives. Thus the actual number of loci subject to selection is likely to lie in between. Comparing the results in the light of spatial dependence gives information about the differences between Sambada’s and LFMM’s detections. Maps of local spatial autocorrelation for ARS-113 (GG) and HM-28 (GG) illustrated a general trend: LFMM discarded SNPs showing significant local spatial autocorrelation for a large proportion of the sampling locations, while Sambada detected them.Thus measuring local autocorrelation of candidate genotypes may help distinguishing between the effects of local adaptation and those of population structure among Sambada detections.

Regarding common detections, the three SNPs identified by Sambada when population structure was included as a covariate were among the common detections of correlatives approaches. Thus pre-existent knowledge on demography may be built on to refine correlation-based detections of selection signatures. One possible approach could consist of computing population structure and then including one variable summarising this structure in the constant model used by Sambada. This way, only genotypes showing a significant correlation with the environment while taking the population structure into account would be detected. Concerning the biological function of the common detections, these three loci are located on chromosome 5, near the gene POLR3B whose mouse counterpart is involved in limiting infection by intracellular bacteria and DNA viruses (UniProt, www.uniprot.org). Moreover, genotype HM-28 (GG) shows spatial autocorrelation in the North-Western part of Uganda and this area overlaps with one of those where the higher load of tse-tse fly (Glossina spp.) occur in the country (Abila et al. (2008); MAAIF et al., 2010). Hence the risk of cattle trypanosomiasis is high in this region and the detected mutations may be involved in parasite resistance. The increasing availability of large molecular datasets raises challenges regarding their analysis. Correlative approaches in landscape genomics enable fast detection of candidate loci to local adaptation. However these methods must take into account the effect of population structure (Frichot et al., 2013; Joost et al., 2013; De Mita et al., 2013). Limited dispersal of individuals leads to spatial autocorrelation of marker frequencies, which may cause spurious correlations with the environment. Sambada addresses the first topic by detecting rapidly selection signatures and the second one by measuring the level of spatial autocorrelation for candidate loci. The next methodological step involves developing spatially-explicit models that directly include autocorrelation. Guillot et al. (2014) provide such a model, however the current R-based implementation does not enable whole-genome analysis. Alternatively Geographically Weighted Regressions (GWR) measure the spatial stationarity of regression coefficients by fitting a distinct model for each sampling location. The number of neighbouring points considered for each sampling location is given by the weighting scheme. These models allow some “local” coefficients to differ between sampling points while some “global” coefficients are common to all points (Fotheringham et al., 2002; Joost et al., 2013). Thus GWR enables building a null model where the constant term may vary in space and then refining it by adding a global environmental effect for all locations. Comparing these two models would enable an assessment of whether the global environmental effect is needed to describe the distribution of the genotype. The key advantage of allowing the constant term to vary in space is to take spatial autocorrelation into account in the models. This way, GWR allows an investigation of the spatial behaviour of loci showing selection signature with standard logistic regressions and may help to distinguish between local adaptation and population structure in landscape genomics. However GWR models require a finetuning of the weighting scheme from the user, which restrains their application to very large datasets.

Computation time is critical when processing large datasets. In this context, Sambada is able to swiftly analyse high-density SNP-chips and variants from whole-genome sequencing (e.g. the case study presented in here is analysed within 69 minutes for univariates models alone and 8.5 hours for both univariate and bivariate models). When considering single-process computations, Sambada is approximately 4.5 times quicker than LFMM and 30 times than BayEnv. Both Sambada and LFMM enable parallelised processing. Sambada’s processing speed, combined with its ability to analyse the spatial autocorrelation in molecular data and to incorporate prior knowledge on population structure, suits a wide range of applications, especially those involving whole genome sequence data.

No comments:

Post a Comment

Comments have temporarily been turned off. Because I currently have a heavy workload, I do not feel that I can do an acceptable job as moderator. Thanks for your understanding.

Note: Only a member of this blog may post a comment.