Genome-wide association (GWA) studies, in which dense genotypes in a large sample of individuals are tested for disease associations, represent a powerful approach for uncovering disease-susceptibility genes. Genotype imputation is a statistical procedure that enables evaluation of disease associations at markers beyond those experimentally measured, by using chromosomal stretches shared between study and reference individuals to infer unmeasured genotypes in GWA samples. Crucial to the success of imputation procedures is the representation of GWA samples in reference datasets that contain ;;template” sequences from which the unmeasured genotypes are inferred.In this dissertation, I study the design of reference datasets for use in genetic studies in diverse human populations. First, I devise a mixture approach for selecting panels of reference data. Using genotype data from 29 worldwide populations, I show that nearly all populations benefit from the mixture approach in that the mixture approach reduces imputation error. Focusing on African populations whose genotypes are particularly difficult to impute, I investigate haplotype variation and imputation in Africa. Using various statistics on haplotype variation to explain variation in imputation accuracy, I find that simple statistics, such as Fst, which measure genetic distance between study and reference populations are useful metrics for guiding the selection of reference panels. Next, I quantify the increase in the minimal sample size, due to imperfect imputation, that would be required to provide the same level of statistical evidence of disease predisposition for genetic variants that are imputed rather than experimentally measured. Finally, I develop a coalescent model for evaluating imputation accuracy. Under this model, use of reference sequences selected based on observed genetic similarity to a study sequence targeted for imputation produces higher imputation accuracy than use of reference sequences selected based on population of origin. This result suggests a reference-selection strategy that chooses template sequences from multiple populations, including the target population itself.Together, results from this dissertation can inform study design for future GWA studies. In particular, they can facilitate the design of reference datasets for use in imputation-based studies, thereby improving the search for genetic determinants that affect human health in populations worldwide.
【 预 览 】
附件列表
Files
Size
Format
View
Genotype Imputation in Diverse Populations:Empirical and Theoretical Approaches.