BMC Bioinformatics | |
Evaluation of variant detection software for pooled next-generation sequence data | |
Howard W. Huang1  James C. Mullikin1  Nancy F. Hansen1  NISC Comparative Sequencing Program1  | |
[1] National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA | |
关键词: Algorithms; Sequencing; Pooling; | |
Others : 1230720 DOI : 10.1186/s12859-015-0624-y |
|
received in 2014-11-11, accepted in 2015-05-20, 发布年份 2015 |
【 摘 要 】
Background
Despite the tremendous drop in the cost of nucleotide sequencing in recent years, many research projects still utilize sequencing of pools containing multiple samples for the detection of sequence variants as a cost saving measure. Various software tools exist to analyze these pooled sequence data, yet little has been reported on the relative accuracy and ease of use of these different programs.
Results
In this manuscript we evaluate five different variant detection programs—The Genome Analysis Toolkit (GATK), CRISP, LoFreq, VarScan, and SNVer—with regard to their ability to detect variants in synthetically pooled Illumina sequencing data, by creating simulated pooled binary alignment/map (BAM) files using single-sample sequencing data from varying numbers of previously characterized samples at varying depths of coverage per sample. We report the overall runtimes and memory usage of each program, as well as each program’s sensitivity and specificity to detect known true variants.
Conclusions
GATK, CRISP, and LoFreq all gave balanced accuracy of 80 % or greater for datasets with varying per-sample depth of coverage and numbers of samples per pool. VarScan and SNVer generally had balanced accuracy lower than 80 %. CRISP and LoFreq required up to four times less computational time and up to ten times less physical memory than GATK did, and without filtering, gave results with the highest sensitivity. VarScan and SNVer had generally lower false positive rates, but also significantly lower sensitivity than the other three programs.
【 授权许可】
2015 Huang et al.
Files | Size | Format | View |
---|---|---|---|
Fig. 3. | 50KB | Image | download |
Fig. 2. | 38KB | Image | download |
Fig. 1. | 32KB | Image | download |
Fig. 3. | 50KB | Image | download |
Fig. 2. | 38KB | Image | download |
Fig. 1. | 32KB | Image | download |
【 图 表 】
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 1.
Fig. 2.
Fig. 3.
【 参考文献 】
- [1]Wetterstrand KA: DNA sequencing costs: Data from the NHGRI genome sequencing program (GSP). 2014 [http://www.genome.gov/sequencingcosts]. Accessed October 10, 2014.
- [2]McClellan J, King MC. Genetic heterogeneity in human disease. Cell. 2010; 141(2):210-7.
- [3]Grada A, Weinbrecht K. Next-generation sequencing: methodology and application. J Invest Dermatol. 2013; 133(8):e11.
- [4]Baltagi BH, Bresson G, Pirotte, A: To pool or not to pool? The econometrics of panel data (pp. 517–546) Springer Berlin Heidelberg 2008.
- [5]Bansal V. A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics. 2010; 26(12):i318-24.
- [6]Wei Z, Wang W, Hu P, Lyon GJ, Hakonarson H. SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Res. 2011; 39(19):e132.
- [7]Wilm A, Aw PP, Bertrand D, Yeo GH, Ong SH, Wong CH et al.. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012; 40(22):11189-201.
- [8]Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L et al.. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012; 22(3):568-76.
- [9]McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A et al.. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010; 20(9):1297-303.
- [10]Biesecker LG, Mullikin JC, Facio FM, Turner C, Cherukuri PF, Blakesley RW et al.. The ClinSeq project: Piloting large-scale genome sequencing for research in genomic medicine. Genome Res. 2009; 19(9):1665-74.
- [11]The variant call format and VCFtools. Bioinformatics. 2011; 27(15):2156-8.
- [12]Teer JK, Bonnycastle LL, Chines PS, Hansen NF, Aoyama N, Swift AJ et al.. Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. Genome Res. 2010; 20(10):1420-31.
- [13]An integrated map of genetic variation from 1,092 human genomes. Nature. 2012; 491(7422):56-65.
- [14]Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010; 26(5):589-95.
- [15]DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C et al.. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011; 43(5):491-8.
- [16]Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al.. The sequence alignment/Map format and SAMtools. Bioinformatics. 2009; 2078–2079:25(16).