| BMC Bioinformatics | |
| BAYSIC: a Bayesian method for combining sets of genome variants with improved specificity and sensitivity | |
| Brandi L Cantarel4  Daniel Weaver2  Nathan McNeill4  Jianhua Zhang1  Aaron J Mackey3  Justin Reese2  | |
| [1] Institute for Applied Cancer Science, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA | |
| [2] Genformatic, LLC, Austin, TX 78731, USA | |
| [3] Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA 22908, USA | |
| [4] Baylor Health, Baylor Institute for Immunology Research, Dallas, TX 75204, USA | |
| 关键词: Somatic mutation; Cancer; Latent class analysis; Bayesian; Genome variants; SNP; | |
| Others : 818672 DOI : 10.1186/1471-2105-15-104 |
|
| received in 2013-10-10, accepted in 2014-03-31, 发布年份 2014 | |
PDF
|
|
【 摘 要 】
Background
Accurate genomic variant detection is an essential step in gleaning medically useful information from genome data. However, low concordance among variant-calling methods reduces confidence in the clinical validity of whole genome and exome sequence data, and confounds downstream analysis for applications in genome medicine.
Here we describe BAYSIC (BAYeSian Integrated Caller), which combines SNP variant calls produced by different methods (e.g. GATK, FreeBayes, Atlas, SamTools, etc.) into a more accurate set of variant calls. BAYSIC differs from majority voting, consensus or other ad hoc intersection-based schemes for combining sets of genome variant calls. Unlike other classification methods, the underlying BAYSIC model does not require training using a “gold standard” of true positives. Rather, with each new dataset, BAYSIC performs an unsupervised, fully Bayesian latent class analysis to estimate false positive and false negative error rates for each input method. The user specifies a posterior probability threshold according to the user’s tolerance for false positive and false negative errors; lowering the posterior probability threshold allows the user to trade specificity for sensitivity while raising the threshold increases specificity in exchange for sensitivity.
Results
We assessed the performance of BAYSIC in comparison to other variant detection methods using ten low coverage (~5X) samples from The 1000 Genomes Project, a tumor/normal exome pair (40X), and exome sequences (40X) from positive control samples previously identified to contain clinically relevant SNPs. We demonstrated BAYSIC’s superior variant-calling accuracy, both for somatic mutation detection and germline variant detection.
Conclusions
BAYSIC provides a method for combining sets of SNP variant calls produced by different variant calling programs. The integrated set of SNP variant calls produced by BAYSIC improves the sensitivity and specificity of the variant calls used as input. In addition to combining sets of germline variants, BAYSIC can also be used to combine sets of somatic mutations detected in the context of tumor/normal sequencing experiments.
【 授权许可】
2014 Cantarel et al.; licensee BioMed Central Ltd.
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| 20140711132515739.pdf | 1809KB | ||
| Figure 6. | 27KB | Image | |
| Figure 5. | 67KB | Image | |
| Figure 4. | 69KB | Image | |
| Figure 3. | 60KB | Image | |
| Figure 2. | 154KB | Image | |
| Figure 1. | 104KB | Image |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
【 参考文献 】
- [1]Martin ADG, Kamm T, Ordowski M, Przybocki M: The DET curve in assessment of detection task performance. Proc Eurospeech 1899–1903, 1997:4.
- [2]Dewey FE, Grove ME, Pan C, Goldstein BA, Bernstein JA, Chaib H, Merker JD, Goldfeder RL, Enns GM, David SP, Pakdaman N, Ormond KE, Caleshu C, Kingham K, Klein TE, Whirl-Carrillo M, Sakamoto K, Wheeler MT, Butte AJ, Ford JM, Boxer L, Ioannidis JP, Yeung AC, Altman RB, Assimes TL, Snyder M, Ashley EA Quertermous T: Clinical interpretation and implications of whole-genome sequencing. JAMA 2014, 311(10):1035-1045.
- [3]Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 2014, 32:246-251.
- [4]Gerstung M, Papaemmanuil E, Campbell PJ: Subclonal variant calling with multiple samples and prior knowledge. Bioinformatics 2014. doi:10.1093/bioinformatics/btt750
- [5]Lupski JR, Gonzaga-Jauregui C, Yang Y, Bainbridge MN, Jhangiani S, Buhay CJ, Kovar CL, Wang M, Hawes AC, Reid JG, Eng C, Muzny DM, Gibbs RA: Exome sequencing resolves apparent incidental findings and reveals further complexity of SH3TC2 variant alleles causing Charcot-Marie-Tooth neuropathy. Genome Med 2013, 5(6):57.
- [6]Elsik CG, Mackey AJ, Reese JT, Milshina NV, Roos DS, Weinstock GM: Creating a honey bee consensus gene set. Genome biol 2007, 8(1):R13. BioMed Central Full Text
- [7]Chen F, Mackey AJ, Vermunt JK, Roos DS: Assessing performance of orthology detection strategies applied to eukaryotic genomes. PloS one 2007, 2(4):e383.
- [8]McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20(9):1297-1303.
- [9]Challis D, Yu J, Evani US, Jackson AR, Paithankar S, Coarfa C, Milosavljevic A, Gibbs RA, Yu F: An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinforma 2012, 13:8. BioMed Central Full Text
- [10]Garrison E, Marth G: Haplotype-based variant detection from short-read sequencing. arXivorg 2012., 1207.3907[q-bio.GN]
- [11]Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Group: The variant call format and VCFtools. Bioinformatics 2011, 27(15):2156-2158.
- [12]Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, Gabriel S, Meyerson M, Lander ES, Getz G: Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 2013, 31(3):213-219.
- [13]Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L, Wilson RK: VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 2012, 22(3):568-576.
- [14]Hansen NF, Gartner JJ, Mei L, Samuels Y, Mullikin JC: Shimmer: detection of genetic alterations in tumors using next-generation sequence data. Bioinformatics 2013, 29(12):1498-1503.
- [15]Saunders CT, Wong WS, Swamy S, Becq J, Murray LJ, Cheetham RK: Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 2012, 28(14):1811-1817.
- [16]Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, Jia M, Shepherd R, Leung K, Menzies A, Teague JW, Campbell PJ, Stratton MR, Futreal PA: COSMIC: mining complete cancer genomes in the catalogue of somatic mutations in cancer. Nucleic Acids Res 2011, 39(Database issue):D945-950.
- [17]DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulski K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011, 43(5):491-498.
- [18]Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/Map format and SAMtools. Bioinformatics 2009, 25(16):2078-2079.
- [19]O’Rawe J, Jiang T, Sun G, Wu Y, Wang W, Hu J, Bodily P, Tian L, Hakonarson H, Johnson WE, Wei Z, Wang K, Lyon GJ: Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med 2013, 5(3):28. BioMed Central Full Text
- [20]Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA: An integrated map of genetic variation from 1,092 human genomes. Nature 2012, 491(7422):56-65.
- [21]Guo Y, Long J, He J, Li CI, Cai Q, Shu XO, Zheng W, Li C: Exome sequencing generates high quality data in non-target regions. BMC Genomics 2012, 13:194. BioMed Central Full Text
- [22]Bainbridge MN, Wang M, Wu Y, Newsham I, Muzny DM, Jefferies JL, Albert TJ, Burgess DL, Gibbs RA: Targeted enrichment beyond the consensus coding DNA sequence exome reveals exons with higher variant densities. Genome Biol 2011, 12(7):R68. BioMed Central Full Text
- [23]Freudenberg-Hua Y, Freudenberg J, Kluck N, Cichon S, Propping P, Nothen MM: Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population. Genome Res 2003, 13(10):2271-2276.
- [24]Ebersberger I, Metzler D, Schwarz C, Paabo S: Genomewide comparison of DNA sequences between humans and chimpanzees. Am J Hum Genet 2002, 70(6):1490-1497.
- [25]van der Knaap MS, Leegwater PA, van Berkel CG, Brenner C, Storey E, Di Rocco M, Salvi F, Pronk JC: Arg113His mutation in eIF2Bepsilon as cause of leukoencephalopathy in adults. Neurology 2004, 62(9):1598-1600.
- [26]Mardis ER, Wilson RK: Cancer genome sequencing: a review. Hum Mol Genet 2009, 18(R2):R163-168.
- [27]Roberts ND, Kortschak RD, Parker WT, Schreiber AW, Branford S, Scott HS, Glonek G, Adelson DL: A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics 2013, 29(18):2223-2230.
- [28]Rashid M, Robles-Espinoza CD, Rust AG, Adams DJ: Cake: a bioinformatics pipeline for the integrated analysis of somatic variants in cancer genomes. Bioinformatics 2013, 29(17):2208-2210.
PDF