While technological innovation has dramatically increased the amount and variety of genomic data available to geneticists, no assay is perfect and both human error and technical artifacts can lead to erroneous data. A proper analysis pipeline must both detect errors, and, if possible, correct them. One common source of errors in genetic data is sample-to-sample contamination. This dissertation will identify methods to address contamination in the most common types of genetic studies.Chapter 2 focuses on methods for detecting and quantifying contamination in both array-based and next-generation sequencing (NGS) genotype data. For the array-based data, we use the observed intensities from the genotyping instruments to quantify contamination with two distinct methods: 1) a regression-based model using intensities and population allele frequencies and 2) a multivariate normal mixture model that looks at the clustering of intensities. For NGS data, we model the reads using a mixture model to determine the proportion of reads from the true sample and the contaminating sample.Chapter 3 outlines a method to make accurate genotype calls with contaminated NGS data. Given an estimated level of contamination, we propose a likelihood that can be maximized to call genotypes and estimate allele frequencies for samples with no previous genotype data. We investigate the method from data from two common sequencing strategies: 1) low-pass (2-4x depth) genome-wide sequencing and 2) high-depth (50-100x depth) exome sequencing.Chapter 4 looks at contamination in the context of RNA sequencing (RNA-Seq) data. While the technology to generate RNA-Seq data is similar to exome sequencing, the difference in expression between the contaminating and true sample makes it more difficult to accurately estimate the contamination proportion. We propose methods to improve the quality of these estimates.
【 预 览 】
附件列表
Files
Size
Format
View
Detecting and Correcting Contamination in Genetic Data.