BMC Bioinformatics | |
Machine learning approach for pooled DNA sample calibration | |
Andrew D Hellicar2  Ashfaqur Rahman2  Daniel V Smith2  John M Henshall1  | |
[1] CSIRO Agriculture Flagship, Armidale, Australia | |
[2] CSIRO Computational Informatics, Castray Esplanade, Hobart, Australia | |
关键词: SNP; Machine learning; Calibration; DNA pooling; | |
Others : 1230998 DOI : 10.1186/s12859-015-0593-1 |
|
received in 2014-07-23, accepted in 2015-04-23, 发布年份 2015 | |
【 摘 要 】
Background
Despite ongoing reduction in genotyping costs, genomic studies involving large numbers of species with low economic value (such as Black Tiger prawns) remain cost prohibitive. In this scenario DNA pooling is an attractive option to reduce genotyping costs. However, genotyping of pooled samples comprising DNA from many individuals is challenging due to the presence of errors that exceed the allele frequency quantisation size and therefore cannot be simply corrected by clustering techniques. The solution to the calibration problem is a correction to the allele frequency to mitigate errors incurred in the measurement process. We highlight the limitations of the existing calibration solutions such as the fact they impose assumptions on the variation between allele frequencies 0, 0.5, and 1.0, and address a limited set of error types. We propose a novel machine learning method to address the limitations identified.
Results
The approach is tested on SNPs genotyped with the Sequenom iPLEX platform and compared to existing state of the art calibration methods. The new method is capable of reducing the mean square error in allele frequency to half that achievable with existing approaches. Furthermore for the first time we demonstrate the importance of carefully considering the choice of training data when using calibration approaches built from pooled data.
Conclusion
This paper demonstrates that improvements in pooled allele frequency estimates result if the genotyping platform is characterised at allele frequencies other than the homozygous and heterozygous cases. Techniques capable of incorporating such information are described along with aspects of implementation.
【 授权许可】
2015 Hellicar et al.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
Figure 4. | 24KB | Image | download |
20151107013258633.pdf | 2330KB | download | |
Figure 2. | 35KB | Image | download |
Figure 1. | 73KB | Image | download |
Figure 4. | 24KB | Image | download |
Figure 3. | 60KB | Image | download |
Figure 2. | 35KB | Image | download |
Figure 1. | 73KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 1.
Figure 2.
Figure 4.
【 参考文献 】
- [1]Hi Seq X Ten Datasheet $1000 Human Genome and Extreme Throughput for Population-scale Sequencing Accessed. Retrieved June 2014. http://res. illumina.com/documents/products/datasheets/datasheet-hiseq-x-ten.pdf webcite
- [2]DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). http://www. genome.gov/sequencingcosts/ webcite
- [3]Ozerov M, Vasemagi A, Wennevik V, Niemela E, Prusov S, Kent M et al.. Cost-effective genome-wide estimation of allele frequencies from pooled DNA in atlantic salmon (salmo salar l.). BMC Genomics. 2013; 14(1):12. BioMed Central Full Text
- [4]Henshall JM, Hawken RJ, Dominik S, Barendse W. Estimating the effect of SNP genotype on quantitative traits from pooled DNA samples. Genet Selec Evol. 2012; 44(1):12. BioMed Central Full Text
- [5]Futschik A, Schlotterer C. The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics. 2010; 186(1):207-18.
- [6]Dorfman R. The detection of defective members of large populations. Ann Math Stat. 1943; 14:436-40.
- [7]Arnheim N, Strange C, Erlich H. Use of pooled DNA samples to detect linkage disequilibrium of polymorphic restriction fragments and human disease: studies of the HLA class II loci. Proc Nat Acad Sci. 1985; 82(20):6970-4.
- [8]Sham P, Bader JS, Craig I, O’Donovan M, Owen M. DNA pooling: a tool for large-scale association studies. Nat Rev Genet. 2002; 3(11):862-71.
- [9]Jawaid A, Sham P. Impact and quantification of the sources of error in DNA pooling designs. Ann Hum Genet. 2009; 73(1):118-24.
- [10]Gautier M, Foucaud J, Gharbi K, Cezard T, Galan M, Loiseau A et al.. Estimation of population allele frequencies from next-generation sequencing data: pool-versus individual-based genotyping. Mol Ecol. 2013; 22:3766-79.
- [11]Hoogendoorn B, Norton N, Kirov G, Williams N, Hamshere M, Spurlock G et al.. Cheap, accurate and rapid allele frequency estimation of single nucleotide polymorphisms by primer extension and DHPLC in DNA pools. Hum Genet. 2000; 107(5):488-93.
- [12]Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F et al.. High-resolution genomic profiling of chromosomal aberrations using infinium whole-genome genotyping. Genome Res. 2006; 16(9):1136-48.
- [13]Brohede J, Dunne R, McKay JD, Hannan GN. PPC: an algorithm for accurate estimation of SNP allele frequencies in small equimolar pools of DNA using data from high density microarrays. Nucleic Acids Res. 2005; 33(17):142-2.
- [14]Hellicar A, Smith D, Rahman A, Engelke U, Henshall J. A hierarchical learning approach to calibrate allele frequencies for SNP based genotyping of DNA pools. In: Proc. of 2014 International Joint Conference on Neural Networks (IJCNN) July 6-11. Beijing, China: 2014.
- [15]Gabriel S, Ziaugra L, Tabbaa D. SNP Genotyping using the Sequenom MassARRAY iPLEX Platform. Current Protocols in Human Genetics. 2009; 60:2.12.1-2.12.12.
- [16]Jawaid A, Bader JS, Purcell S, Cherny S, Sham P. Optimal selection strategies for qtl mapping using pooled dna samples. Eur J Hum Genet. 2002; 10(2):125-32.
- [17]Hall M, Frank E, Holmes G, Pfahringer B, Reitemann P, Witten I. The weka data mining software: An update. ACM SIGKDD Explorations. 2009; 11(1):10-8.
- [18]Chang CC, Lin CJ. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol. 2011; 2(3):27:1-27:27.
- [19]Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986; 323(9):533-6.
- [20]Chang CC, Lin CJ. Training nu-support vector regression: theory and algorithms. Neural Comput. 2002; 14(8):1957-77.