Genome Imputation Given Inheritance;Genome-Wide Association Study;Haplotype;Imputation;Pedigrees;Phasing;Biostatistics;Bioinformatics;Genetics;Computer Science
When performing a Genome-Wide Association Study (GWAS), one attempts to associate a phenotype with some genomic information, commonly a gene or set of genes. Often we wish to have more accuracy and attempt to identify a Single Nucleotide Polymorphism (SNP) or Single Nucleotide Polymorphisms (SNPs) that are associated with the phenotype. Sometimes a GWAS is also used to associate other kinds of genetic data, like methylation or Copy Number Variations (CNVs) with the phenotype. The phenotype in such studies is often a disease, e.g. Type II Diabetes Melitus (T2D), Coronary Heart Disease (CHD), cancer, or others, but can be other traits as well, for instance, height, weight, eye color, or intelligence. In order to perform a GWAS it is necessary to sequence the Deoxyribonucleic Acid (DNA) of the individuals in the study. This sequencing is much cheaper than it once was, but is still very expensive for large scale studies. Large scale studies are needed in order to achieve the necessary statistical power to reliably identify associations. By performing imputation we are able to increase the size of studies in two ways. Individual studies are able to sequence more individuals on their budget because they can sequence individuals for only certain sites and impute the rest of the sites to recover part of the power. Also, large scale meta-studies can impute in order to have full sequences for all the individuals in the smaller studies in order to make them comparable, this is the approach taken by Fuchsberger et al [33]. Imputation for genetic data is done in two main ways. The first way is population-based imputation, which depends on Linkage Disequilibrium (LD) and knowing the allele frequencies for a reference population that the study population is believed to be similar to. The second main way to impute is Identity By Descent (IBD)-based imputation, in which we infer genotypes based on the familial relationships in pedigree data. In this thesis, we focus on IBD-based imputation. Imputing on pedigree data can be quite time consuming, for instance, the original implementation of GIGI (Genome Imputation Given Inheritance), Cheung et al [15], took around 17 days to impute chromosome 2 (2,402,346 SNPs) of a pedigree with 189 members, using 28 GB of RAM [53]. Being able to complete family (IBD)-based imputation in a timely manner with high accuracy is of great value to researchers around the world, especially now as this data becomes more available to those without large budgets for sheer computing power. The basis for phasing and imputation along with the details of the calculations involved and exploration of ways to increase the speed for imputing large pedigree data are described in this thesis.