Next generation sequencing (NGS) is a technology that advances our knowledge of human medical genetics with unprecedented amount of data. This vast amount of data presents challenges to existing statistical methods.In this dissertation, I present three studies that demonstrate methods for efficiently analyzing NGS data using both simulated and real data. In the first study, I develop ancestry inference method using small amounts of sequence data. In comparison to microarray experiments, sequencing data produce uneven coverage and genotypes with higher error rates than those traditionally used for principal components analysis (PCA) of genetic ancestry. I overcome some of these challenges using a novel statistical method modeling sequence data directly without relying on intermediate genotype calls.My method achieves high accuracy in simulated data based on the Human Genome Diversity Panel as well as in a targeted sequencing study of age related macular degeneration. In our age-related macular degeneration study, our approach helps discover a high-risk rare variant in the Complement 3 gene. In the second chapter, I develop a model-based ancestry inference method that improves upon previous work described in the first study. It is based on a likelihood-based model of ancestral location, using sequencing data as input. Without losing accuracy, it increases computational efficiency. For each sample, a parallelizable optimization algorithm can infer ancestry using a fraction of the computational resources required for PCA-based methods. Evaluation using in the Human Genome Diversity Panel and age-related macular degeneration data set demonstrates its accuracy and efficiency.In the final study, I develop an improved genotype call method for low-coverage sequencing data. As high quality reference panels grow, it is helpful to incorporate these into genotype calling of new samples.Using a coalescent based simulation and real data from the 1000 Genomes Project, I evaluate the utility of my method (which uses a panel of previously sequenced samples) to improve analyses of samples sequenced at various depths. The improvement in accuracy and computation time will be measured as a function of reference panel size. This work will be useful to investigators undertaking sequencing and analysis of new human samples.
【 预 览 】
附件列表
Files
Size
Format
View
Statistical Methods and Analysis in Next Generation Sequencing.