期刊论文详细信息
BMC Genetics
Inferring haplotypes and parental genotypes in larger full sib-ships and other pedigrees with missing or erroneous genotype data
Carl Nettelblad1 
[1] Division of Scientific Computing, Department of Information Technology, Uppsala University, Box 337, SE-75105, Uppsala, Sweden
关键词: Hidden Markov models;    Nuclear family data;    Genotype inference;    Phasing;    Haplotyping;   
Others  :  1121351
DOI  :  10.1186/1471-2156-13-85
 received in 2012-08-14, accepted in 2012-10-03,  发布年份 2012
PDF
【 摘 要 】

Background

In many contexts, pedigrees for individuals are known even though not all individuals have been fully genotyped. In one extreme case, the genotypes for a set of full siblings are known, with no knowledge of parental genotypes. We propose a method for inferring phased haplotypes and genotypes for all individuals, even those with missing data, in such pedigrees, allowing a multitude of classic and recent methods for linkage and genome analysis to be used more efficiently.

Results

By artificially removing the founder generation genotype data from a well-studied simulated dataset, the quality of reconstructed genotypes in that generation can be verified. For the full structure of repeated matings with 15 offspring per mating, 10 dams per sire, 99.89% of all founder markers were phased correctly, given only the unphased genotypes for offspring. The accuracy was reduced only slightly, to 99.51%, when introducing a 2% error rate in offspring genotypes. When reduced to only 5 full-sib offspring in a single sire-dam mating, the corresponding percentage is 92.62%, which compares favorably with 89.28% from the leading Merlin package. Furthermore, Merlin is unable to handle more than approximately 10 sibs, as the number of states tracked rises exponentially with family size, while our approach has no such limit and handles 150 half-sibs with ease in our experiments.

Conclusions

Our method is able to reconstruct genotypes for parents when genotype data is only available for offspring individuals, as well as haplotypes for all individuals. Compared to the Merlin package, we can handle larger pedigrees and produce superior results, mainly due to the fact that Merlin uses the Viterbi algorithm on the state space to infer the genotype sequence. Tracking of haplotype and allele origin can be used in any application where the marker set does not directly influence genotype variation influencing traits. Inference of genotypes can also reduce the effects of genotyping errors and missing data. The cnF2freq codebase implementing our approach is available under a BSD-style license.

【 授权许可】

   
2012 Nettelblad; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150212012754385.pdf 497KB PDF download
Figure 3. 20KB Image download
Figure 2. 47KB Image download
Figure 1. 16KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

【 参考文献 】
  • [1]Marchini J, Howie B: Genotype imputation for genome-wide association studies. Nat Rev Genet 2010, 11(7):499-511.
  • [2]Lin S, Chakravarti A, Cutler D: Haplotype and missing data inference in nuclear families. Genome Res 2004, 14(8):1624.
  • [3]Ding X, Zhang Q, Simianer H: Haplotype Reconstruction and Estimation of Haplotype Frequencies from Nuclear Families with One Parent Available and Varying Numbers of Children Using the Exact Likelihood. Human Heredity 2009, 67:174-175.
  • [4]Abecasis G, Cherny S, Cookson W, Cardon L: Merlin - rapid analysis of dense genetic maps using sparse gene flow trees. Nat genet 2001, 30:97-101.
  • [5]Nettelblad C, Holmgren S, Crooks L, Carlborg O: cnF2freq: Efficient Determination of Genotype and Haplotype Probabilities in Outbred Populations Using Markov Models. In BICoB ’09: Proceedings of the 1st International Conference on Bioinformatics and Computational Biology. Springer-Verlag, Berlin, Heidelberg; 2009:307-319.
  • [6]Nettelblad C: Haplotype inference based on Hidden Markov Models in the QTL-MAS 2010 multi-generational dataset. In BMC Proceedings Volume 5. BioMed Central Ltd; 2010(Suppl 3):S10-S10.
  • [7]Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. [http://dx.doi.org/10.1109/5.18626] webciteProceedings of the IEEE 1989, 77(2):257-286.
  • [8]Lander E, Green P, Abrahamson J, Barlow A, Daly M, Lincoln S, Newberg L, Newburg L, et al.: MAPMAKER: an interactive computer package for constructing primary genetic linkage maps of experimental and natural populations. Genomics 1987, 1(2):174.
  • [9]Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES: Parametric and nonparametric linkage analysis: a unified multipoint approach. [http://view.ncbi.nlm.nih.gov/pubmed/8651312] webciteAm j human genet 1996, 58(6):1347-1363.
  • [10]Therneau T, Atkinson E, Sinnwell J, Schaid D, McDonnell S: kinship2: Pedigree functions. R package version 1.3.3. [http://CRAN.R-project.org/package=kinship2 webcite], 2011
  • [11]Haldane JBS: The combination of linkage values, and the calculation of distance between the loci of linked factors. J Genet 1919, 8:299-309.
  • [12]Broman KW, Wu H, Sen S, Churchill GA: R/qtl: QTL mapping in experimental crosses. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/19/7/889] webciteBioinformatics 2003, 19(7):889-890.
  • [13]Howie BN, Donnelly P, Marchini J: A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. [http://dx.doi.org/10.1371] webcitePLoS Genet 2009, 5(6):e1000529.
  • [14]Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR: MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. [http://dx.doi.org/10.1002/gepi.20533] webciteGenet Epidemiol 2010, 34(8):816-834.
  • [15]Baum LE, Petrie T, Soules G, Weiss N: A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. [http://dx.doi.org/10.2307/2239727] webciteAnn Math Stat 1970, 41:164-171.
  • [16]Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via EM algorithm (with discussion). J R Statis Soc 1977, 39:1-38.
  • [17]Seaton G, Hernandez J, Grunchec J, White I, Allen J, De Koning D, Wei W, Berry D, Haley C, Knott S: GridQTL: a grid portal for QTL mapping of compute intensive datasets. In Proceedings of the 8th World Congress on Genetics Applied to Livestock Production, Belo Horizonte. MG, Brasil; 2006:13-18.
  • [18]Walters J, Balu V, Kompalli S, Chaudhary V: Evaluating the use of GPUs in liver image segmentation and HMMER database searches. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on IEEE. Rome, Italy,; 2009:1-12.
  • [19]Elsen JM, Tesseydre S, Filangi O, Roy P, Demeure O: XVth QTLMAS: simulated dataset. [http://www.biomedcentral.com/1753-6561/6/S2/S1] webciteBMC Proc 2012, 6(Suppl 2):S1. BioMed Central Full Text
  • [20]Zhang K, Zhao H: A comparison of several methods for haplotype frequency estimation and haplotype reconstruction for tightly linked markers from general pedigrees. Genet epidemiol 2006, 30(5):423-437.
  • [21]Li Y, Willer C, Ding J, Scheet P, Abecasis G: MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet epidemiol 2010, 34(8):816-834.
  • [22]Howie B, Marchini J, Stephens M: Genotype Imputation with Thousands of Genomes. G3: Genes, Genomes, Genet 2011, 1(6):457-470.
  • [23]Huang J, Ellinghaus D, Franke A, Howie B, Li Y: 1000 Genomes-based imputation identifies novel and refined associations for the Wellcome Trust Case Control Consortium phase 1 Data. [http://dx.doi.org/10.1038/ejhg.2012.3] webciteEur J Human Genet 2012, 20(7):801-805.
  • [24]Boost C++ Libraries [http://www.boost.org] webcite
  • [25]The R Project for statistical computing [http://www.r-project.org] webcite
  文献评价指标  
  下载次数:16次 浏览次数:43次