| BMC Bioinformatics | |
| Comparing a few SNP calling algorithms using low-coverage sequencing data | |
| Xiaoqing Yu2  Shuying Sun1  | |
| [1] Department of Mathematics, Texas State University, San Marcos, Texas 78666, USA | |
| [2] Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio 44106, USA | |
| 关键词: GATK; SAMtools; Atlas-SNP2; SOAPsnp; Single-sample; Low-coverage; SNP calling; Next generation sequencing; | |
| Others : 1087762 DOI : 10.1186/1471-2105-14-274 |
|
| received in 2013-05-10, accepted in 2013-09-12, 发布年份 2013 | |
PDF
|
|
【 摘 要 】
Background
Many Single Nucleotide Polymorphism (SNP) calling programs have been developed to identify Single Nucleotide Variations (SNVs) in next-generation sequencing (NGS) data. However, low sequencing coverage presents challenges to accurate SNV identification, especially in single-sample data. Moreover, commonly used SNP calling programs usually include several metrics in their output files for each potential SNP. These metrics are highly correlated in complex patterns, making it extremely difficult to select SNPs for further experimental validations.
Results
To explore solutions to the above challenges, we compare the performance of four SNP calling algorithm, SOAPsnp, Atlas-SNP2, SAMtools, and GATK, in a low-coverage single-sample sequencing dataset. Without any post-output filtering, SOAPsnp calls more SNVs than the other programs since it has fewer internal filtering criteria. Atlas-SNP2 has stringent internal filtering criteria; thus it reports the least number of SNVs. The numbers of SNVs called by GATK and SAMtools fall between SOAPsnp and Atlas-SNP2. Moreover, we explore the values of key metrics related to SNVs’ quality in each algorithm and use them as post-output filtering criteria to filter out low quality SNVs. Under different coverage cutoff values, we compare four algorithms and calculate the empirical positive calling rate and sensitivity. Our results show that: 1) the overall agreement of the four calling algorithms is low, especially in non-dbSNPs; 2) the agreement of the four algorithms is similar when using different coverage cutoffs, except that the non-dbSNPs agreement level tends to increase slightly with increasing coverage; 3) SOAPsnp, SAMtools, and GATK have a higher empirical calling rate for dbSNPs compared to non-dbSNPs; and 4) overall, GATK and Atlas-SNP2 have a relatively higher positive calling rate and sensitivity, but GATK calls more SNVs.
Conclusions
Our results show that the agreement between different calling algorithms is relatively low. Thus, more caution should be used in choosing algorithms, setting filtering parameters, and designing validation studies. For reliable SNV calling results, we recommend that users employ more than one algorithm and use metrics related to calling quality and coverage as filtering criteria.
【 授权许可】
2013 Yu and Sun; licensee BioMed Central Ltd.
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| 20150117041858898.pdf | 929KB | ||
| Figure 5. | 50KB | Image | |
| Figure 4. | 49KB | Image | |
| Figure 3. | 74KB | Image | |
| Figure 2. | 65KB | Image | |
| Figure 1. | 99KB | Image |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
【 参考文献 】
- [1]Collins FS, Brooks LD, Chakravarti A: A DNA polymorphism discovery resource for research on human genetic variation. Genome Res 1998, 8(12):1229-1231.
- [2]Jimenez-Sanchez G, Childs B, Valle D: Human disease genes. Nature 2001, 409(6822):853-855.
- [3]Wolford JK, Yeatts KA, Eagle ARR, Nelson RG, Knowler WC, Hanson RL: Variants in the gene encoding aldose reductase (AKR1B1) and diabetic nephropathy in American Indians. Diabet Med 2006, 23(4):367-376.
- [4]Zeggini E, Groves C, Parkinson J, Halford S, Owen K, Frayling T, Walker M, Hitman G, Levy J, O’Rahilly S, Hattersley A, McCarthy M: Largescale studies of the association between variation at the TNF/LTA locus and susceptibility to type 2 diabetes. Diabetologia 2005, 48(10):2013-2017.
- [5]Altshuler D, Hirschhorn JN, Klannemark M, Lindgren CM, Vohl MC, Nemesh J, Lane CR, Schaffner SF, Bolk S, Brewer C, Tuomi T, Gaudet D, Hudson TJ, Daly M, Groop L, Lander ES: The common PPARr Pro12Ala polymorphism is associated with decreased risk of type 2 diabetes. Nat Genet 2000, 26(1):76-80.
- [6]Palmer ND, Hester JM, An SS, Adeyemo A, Rotimi C, Langefeld CD, Freedman BI, Ng MCY, Bowden DW: Resequencing and analysis of variation in the TCF7L2 Gene in African Americans suggests that SNP rs7903146 is the causal diabetes susceptibility variant. Diabetes 2011, 60(2):662-668.
- [7]Ueda H, Howson JMM, Esposito L, Heward J, Snook H, Chamberlain G, Rainbow DB, Hunter KMD, Smith AN, Di Genova G, Herr MH, Dahlman I, Payne F, Smyth D, Lowe C, Twells RCJ, Howlett S, Healy B, Nutland S, Rance HE, Everett V, Smink LJ, Lam AC, Cordell HJ, Walker NM, Bordin C, Hulme J, Motzo C, Cucca F, Hess JF, Metzker ML, Rogers J, Gregory S, Allahabadia A, Nithiyananthan R, Tuomilehto-Wolf E, Tuomilehto J, Bingley P, Gillespie KM, Undlien DE, Ronningen KS, Guja C, Ionescu-Tirgoviste C, Savage DA, Maxwell AP, Carson DJ, Patterson CC, Franklyn JA, Clayton DG, Peterson LB, Wicker LS, Todd JA, Gough SCL, et al.: Association of the T-cell regulatory gene CTLA4 with susceptibility to autoimmune disease. Nature 2003, 423(6939):506-511.
- [8]Vyshkina T, Kalman B: Haplotypes within genes of β-chemokines in 17q11 are associated with multiple sclerosis: a second phase study. Hum Genet 2005, 118(1):67-75.
- [9]Arinami T, Ohtsuki T, Ishiguro H, Ujike H, Tanaka Y, Morita Y, Mineta M, Takeichi M, Yamada S, Imamura A, Ohara K, Shibuya H, Ohara K, Suzuki Y, Muratake T, Kaneko N, Someya T, Inada T, Yoshikawa T, Toyota T, Yamada K, Kojima T, Takahashi S, Osamu O, Shinkai T, Nakamura M, Fukuzako H, Hashiguchi T, Niwa SI, Ueno T, Tachikawa H, Hori T, Asada T, Nanko S, Kunugi H, Hashimoto R, Ozaki N, Iwata N, Harano M, Arai H, Ohnuma T, Kusumi I, Koyama T, Yoneda H, Fukumaki Y, Shibata H, Kaneko S, Higuchi H, Yasui-Furukori N, Numachi Y, Itokawa M, Okazaki Y, et al.: Genomewide high-density SNP linkage analysis of 236 Japanese families supports the existence of schizophrenia susceptibility loci on chromosomes 1p, 14q, and 20p. Am J Hum Gen 2005, 77(6):937-944.
- [10]Bond GL, Levine AJ: A single nucleotide polymorphism in the p53 pathway interacts with gender, environmental stresses and tumor genetics to influence cancer in humans. Oncogene 2006, 26(9):1317-1323.
- [11]Kammerer S, Roth RB, Hoyal CR, Reneland R, Marnellos G, Kiechle M, Schwarz-Boeger U, Griffiths LR, Ebner F, Rehbock J, Cantor CR, Nelson MR, Braun A: Association of the NuMA region on chromosome 11q13 with breast cancer susceptibility. Proc Natl Acad Sci U S A 2005, 102(6):2004-2009.
- [12]Kuwano R, Miyashita A, Arai H, Asada T, Imagawa M, Shoji M, Higuchi S, Urakami K, Kakita A, Takahashi H, Tsukie T, Toyabe S, Akazawa K, Kanazawa I, Ihara Y: Dynamin-binding protein gene on chromosome 10q is associated with late-onset Alzheimer’s disease. Hum Mol Genet 2006, 15(13):2170-2182.
- [13]Corneveaux JJ, Myers AJ, Allen AN, Pruzin JJ, Ramirez M, Engel A, Nalls MA, Chen K, Lee W, Chewning K, Villa SE, Meechoovet HB, Gerber JD, Frost D, Benson HL, O’Reilly S, Chibnik LB, Shulman JM, Singleton AB, Craig DW, Van Keuren-Jensen KR, Dunckley T, Bennett DA, De Jager PL, Heward C, Hardy J, Reiman EM, Huentelman MJ: Association of CR1, CLU and PICALM with Alzheimer’s disease in a cohort of clinically characterized and neuropathologically verified individuals. Hum Mol Genet 2010, 19(16):3295-3201.
- [14]Henningsson A, Marsh S, Loos WJ, Karlsson MO, Garsa A, Mross K, Mielke S, Viganò L, Locatelli A, Verweij J, Sparreboom A, McLeod HL: Association of CYP2C8, CYP3A4, CYP3A5, and ABCB1 polymorphisms with the pharmacokinetics of paclitaxel. Clin Cancer Res 2005, 11(22):8097-8104.
- [15]Higashi MK, Veenstra DL, Kondo LM, Wittkowsky AK, Srinouanprachanh SL, Farin FM, Rettie AE: Association between CYP2C9 genetic variants and anticoagulation-related outcomes during warfarin therapy. JAMA 2002, 287(13):1690-1698.
- [16]Shendure J, Mitra RD, Varma C, Church GM: Advanced sequencing technologies: methods and goals. Nat Rev Genet 2004, 5(5):335-344.
- [17]Metzker ML: Sequencing technologies–the next generation. Nat Rev Genet 2010, 11(1):31-46.
- [18]Quinlan AR, Stewart DA, Stromberg MP, Marth GT: Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nat Meth 2008, 5(2):179-181.
- [19]Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K: SNP detection for massively parallel whole-genome resequencing. Genome Res 2009, 19(6):1124-1132.
- [20]Shen Y, Wan Z, Coarfa C, Drabek R, Chen L, Ostrowski EA, Liu Y, Weinstock GM, Wheeler DA, Gibbs RA, Yu F: A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res 2010, 20(2):273-280.
- [21]Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008, 18(11):1851-1858.
- [22]Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, Ding L: VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 2009, 25(17):2283-2285.
- [23]Martin ER, Kinnamon DD, Schmidt MA, Powell EH, Zuchner S, Morris RW: SeqEM: an adaptive genotype-calling approach for next-generation sequencing studies. Bioinformatics 2010, 26(22):2803-2810.
- [24]Bansal V: A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics 2010, 26(12):i318-i324.
- [25]Wei Z, Wang W, Hu P, Lyon GJ, Hakonarson H: SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Res 2011, 39(19):e132.
- [26]FreeBayes. https://github.com/ekg/freebayes webcite
- [27]DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, Del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011, 43(5):491-498.
- [28]Edmonson MN, Zhang J, Yan C, Finney RP, Meerzaman DM, Buetow KH: Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format. Bioinformatics 2011, 27(6):865-866.
- [29]Goya R, Sun MGF, Morin RD, Leung G, Ha G, Wiegand KC, Senz J, Crisan A, Marra MA, Hirst M, Huntsman D, Murphy KP, Aparicio S, Shah SP: SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics 2010, 26(6):730-736.
- [30]Rivas MA, Beaudoin M, Gardet A, Stevens C, Sharma Y, Zhang CK, Boucher G, Ripke S, Ellinghaus D, Burtt N, Fennell T, Kirby A, Latiano A, Goyette P, Green T, Halfvarson J, Haritunians T, Korn JM, Kuruvilla F, Lagace C, Neale B, Lo KS, Schumm P, Torkvist L, Dubinsky MC, Brant SR, Silverberg MS, Duerr RH, Altshuler D, Gabriel S, Lettre G, Franke A, D’Amato M, McGovern DPB, Cho JH, Rioux JD, Xavier RJ, Daly MJ, et al.: Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat Genet 2011, 43(11):1066-1073.
- [31]Altmann A, Weber P, Quast C, Rex-Haffner M, Binder EB, Müller-Myhsok B: vipR: variant identification in pooled DNA using R. Bioinformatics 2011, 27(13):i77-i84.
- [32]MuTect. http://www.broadinstitute.org/cancer/cga/mutect webcite
- [33]Vallania FLM, Druley TE, Ramos E, Wang J, Borecki I, Province M, Mitra RD: High-throughput discovery of rare insertions and deletions in large cohorts. Genome Res 2010, 20(12):1711-1718.
- [34]Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler B, Speicher MR, Zschocke J, Trajanoski Z: A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform 2013.
- [35]Adams MD, Veigl ML, Wang Z, Molyneux N, Sun S, Guda K, Yu X, Markowitz SD, Willis J: Global mutational profiling of formalin-fixed human colon cancers from a pathology archive. Mod Pathol 2012, 25(12):1599-1608.
- [36]The Genomes Project C: An integrated map of genetic variation from 1,092 human genomes. Nature 2012, 491(7422):56-65.
- [37]Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP: The sequence alignment/map format and SAMtools. Bioinformatics 2009, 25(16):2078-2079.
- [38]McKenna A, Hanna M, Banks E: The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20:1297-1303.
- [39]Li Y, Chen W, Liu E, Zhou YH: Single nucleotide polymorphism (SNP) detection and genotype calling from massively parallel sequencing (MPS) data. Stat Biosci 2012, 5(1):1-23.
- [40]Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR: Low-coverage sequencing: implications for design of complex trait association studies. Genome Res 2011, 21(6):940-951.
- [41]Picard. http://picard.sourceforge.net/ webcite
- [42]FastQC. http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc webcite
- [43]Harris E, Ponts N, Levchuk A, Roch K, Lonardi S: BRAT: bisulfite-treated reads analysis tool. Bioinformatics 2010, 26(4):572.
- [44]UCSC genome browser. http://genome.ucsc.edu/ webcite
- [45]Yu X, Guda K, Willis J, Veigl M, Wang Z, Markowitz S, Adams M, Sun S: How do alignment programs perform on sequencing data with varying qualities and from repetitive regions? BioData Mining 2012, 5(1):6. BioMed Central Full Text
- [46]O’Rawe J, Jiang T, Sun G, Wu Y, Wang W, Hu J, Bodily P, Tian L, Hakonarson H, Johnson WE, Wei Z, Wang K, Lyon G: Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med 2013, 5(3):28. BioMed Central Full Text
PDF