BMC Bioinformatics | |
Comparing somatic mutation-callers: beyond Venn diagrams | |
Su Yeon Kim2  Terence P Speed1  | |
[1] Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia | |
[2] Department of Statistics, University of California at Berkeley, 367 Evans Hall, Berkeley, CA 94720 USA | |
关键词: Validation; Methods comparison; Somatic mutation-calling; Next-generation sequencing; Cancer genome; | |
Others : 1087846 DOI : 10.1186/1471-2105-14-189 |
|
received in 2013-02-19, accepted in 2013-05-30, 发布年份 2013 | |
【 摘 要 】
Background
Somatic mutation-calling based on DNA from matched tumor-normal patient samples is one of the key tasks carried by many cancer genome projects. One such large-scale project is The Cancer Genome Atlas (TCGA), which is now routinely compiling catalogs of somatic mutations from hundreds of paired tumor-normal DNA exome-sequence data. Nonetheless, mutation calling is still very challenging. TCGA benchmark studies revealed that even relatively recent mutation callers from major centers showed substantial discrepancies. Evaluation of the mutation callers or understanding the sources of discrepancies is not straightforward, since for most tumor studies, validation data based on independent whole-exome DNA sequencing is not available, only partial validation data for a selected (ascertained) subset of sites.
Results
To provide guidelines to comparing outputs from multiple callers, we have analyzed two sets of mutation-calling data from the TCGA benchmark studies and their partial validation data. Various aspects of the mutation-calling outputs were explored to characterize the discrepancies in detail. To assess the performances of multiple callers, we introduce four approaches utilizing the external sequence data to varying degrees, ranging from having independent DNA-seq pairs, RNA-seq for tumor samples only, the original exome-seq pairs only, or none of those.
Conclusions
Our analyses provide guidelines to visualizing and understanding the discrepancies among the outputs from multiple callers. Furthermore, applying the four evaluation approaches to the whole exome data, we illustrate the challenges and highlight the various circumstances that require extra caution in assessing the performances of multiple callers.
【 授权许可】
2013 Kim and Speed; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150117051342109.pdf | 1808KB | download | |
Figure 7. | 59KB | Image | download |
Figure 6. | 44KB | Image | download |
Figure 5. | 55KB | Image | download |
Figure 4. | 94KB | Image | download |
Figure 3. | 44KB | Image | download |
Figure 2. | 92KB | Image | download |
Figure 1. | 99KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.
【 参考文献 】
- [1]Wong KM, Hudson TJ, McPherson JD: Unraveling the genetics of cancer: genome sequencing and beyond. Annu Rev Genomics Hum Genet 2011, 12:407-430.
- [2]Meyerson M, Gabriel S, Getz G: Advances in understanding cancer genomes through second-generation sequencing. Nat Rev Genet 2010, 11(10):685-696.
- [3]Stratton MR: Exploring the genomes of cancer cells: progress and promise. Science 2011, 331(6024):1553-1558.
- [4]Ding L, Wendl MC, Koboldt DC, Mardis ER: Analysis of next-generation genomic data in cancer: accomplishments and challenges. Hum Mol Genet 2010, 19(R2):R188-196.
- [5]Le Gallo M, O’Hara AJ, Rudd ML, Urick ME, Hansen NF, O’Neil NJ, Price JC, Zhang S, England BM, Godwin AK, Sgroi DC, Hieter P, Mullikin JC, Merino MJ, Bell DW, NIH Intramural Sequencing Center (NISC) Comparative Sequencing Program: Exome sequencing of serous endometrial tumors identifies recurrent somatic mutations in chromatin-remodeling and ubiquitin ligase complex genes. Nat Genet 2012, 44(12):1310-1315.
- [6]Zang ZJ, Cutcutache I, Poon SL, Zhang SL, McPherson JR, Tao J, Rajasegaran V, Heng HL, Deng N, Gan A, Lim KH, Ong CK, Huang D, Chin SY, Tan IB, Ng CCY, Yu W, Wu Y, Lee M, Wu J, Poh D, Wan WK, Rha SY, So J, Salto-Tellez M, Yeoh KG, Wong WK, Zhu YJ, Futreal PA, Pang B, et al.: Exome sequencing of gastric adenocarcinoma identifies recurrent somatic mutations in cell adhesion and chromatin remodeling genes. Nat Genet 2012, 44(5):570-574.
- [7]Puente XS, Pinyol M, Quesada V, Conde L, Ordóñez GR, Villamor N, Escaramis G, Jares P, Beà S, González-Díaz M, Bassaganyas L, Baumann T, Juan M, López-Guerra M, Colomer D, Tubío JMC, López C, Navarro A, Tornador C, Aymerich M, Rozman M, Hernández JM, Puente DA, Freije JMP, Velasco G, Gutiérrez-Fernández A, Costa D, Carrió A, Guijarro S, Enjuanes A, et al.: Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia. Nature 2011, 475(7354):101-105.
- [8]Varela I, Tarpey P, Raine K, Huang D, Ong CK, Stephens P, Davies H, Jones D, Lin ML, Teague J, Bignell G, Butler A, Cho J, Dalgliesh GL, Galappaththige D, Greenman C, Hardy C, Jia M, Latimer C, Lau KW, Marshall J, McLaren S, Menzies A, Mudie L, Stebbings L, Largaespada DA, Wessels LFA, Richard S, Kahnoski RJ, Anema J, et al.: Exome sequencing identifies frequent mutation of the SWI/SNF complex gene PBRM1 in renal carcinoma. Nature 2011, 469(7331):539-542.
- [9]Biankin AV, Waddell N, Kassahn KS, Gingras MC, Muthuswamy LB, Johns AL, Miller DK, Wilson PJ, Patch AM, Wu J, Chang DK, Cowley MJ, Gardiner BB, Song S, Harliwong I, Idrisoglu S, Nourse C, Nourbakhsh E, Manning S, Wani S, Gongora M, Pajic M, Scarlett CJ, Gill AJ, Pinho AV, Rooman I, Anderson M, Holmes O, Leonard C, Taylor D, et al.: Pancreatic cancer genomes reveal aberrations in axon guidance pathway genes. Nature 2012, 491(7424):399-405.
- [10]Ding L, Ley TJ, Larson DE, Miller CA, Koboldt DC, Welch JS, Ritchey JK, Young MA, Lamprecht T, McLellan MD, McMichael JF, Wallis JW, Lu C, Shen D, Harris CC, Dooling DJ, Fulton RS, Fulton LL, Chen K, Schmidt H, Kalicki-Veizer J, Magrini VJ, Cook L, McGrath SD, Vickery TL, Wendl MC, Heath S, Watson MA, Link DC, Tomasson MH, et al.: Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 2012, 481(7382):506-510.
- [11]Shah SP, Roth A, Goya R, Oloumi A, Ha G, Zhao Y, Turashvili G, Ding J, Tse K, Haffari G, Bashashati A, Prentice LM, Khattra J, Burleigh A, Yap D, Bernard V, McPherson A, Shumansky K, Crisan A, Giuliany R, Heravi-Moussavi A, Rosner J, Lai D, Birol I, Varhol R, Tam A, Dhalla N, Zeng T, Ma K, Chan SK, et al.: The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature 2012, 486(7403):395-399.
- [12]Chapman MA, Lawrence MS, Keats JJ, Cibulskis K, Sougnez C, Schinzel AC, Harview CL, Brunet JP, Ahmann GJ, Adli M, Anderson KC, Ardlie KG, Auclair D, Baker A, Bergsagel PL, Bernstein BE, Drier Y, Fonseca R, Gabriel SB, Hofmeister CC, Jagannath S, Jakubowiak AJ, Krishnan A, Levy J, Liefeld T, Lonial S, Mahan S, Mfuko B, Monti S, Perkins LM, et al.: Initial genome sequencing and analysis of multiple myeloma. Nature 2011, 471(7339):467-472.
- [13]Stransky N, Egloff AM, Tward AD, Kostic AD, Cibulskis K, Sivachenko A, Kryukov GV, Lawrence M, Sougnez C, McKenna A, Shefler E, Ramos AH, Stojanov P, Carter SL, Voet D, Cortés ML, Auclair D, Berger MF, Saksena G, Guiducci C, Onofrio R, Parkin M, Romkes M, Weissfeld JL, Seethala RR, Wang L, Rangel-Escareño C, Fernandez-Lopez JC, Hidalgo-Miranda A, Melendez-Zajgla J, et al.: The mutational landscape of head and neck squamous cell carcinoma. Science 2011, 333(6046):1157-1160.
- [14]Lee W, Jiang Z, Liu J, Haverty PM, Guan Y, Stinson J, Yue P, Zhang Y, Pant KP, Bhatt D, Ha C, Johnson S, Kennemer MI, Mohan S, Nazarenko I, Watanabe C, Sparks AB, Shames DS, Gentleman R, de Sauvage FJ, Stern H, Pandita A, Ballinger DG, Drmanac R, Modrusan Z, Seshagiri S, Zhang Z: The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature 2010, 465(7297):473-477.
- [15]Forster M, Forster P, Elsharawy A, Hemmrich G, Kreck B, Wittig M, Thomsen I, Stade B, Barann M, Ellinghaus D, Petersen BS, May S, Melum E, Schilhabel MB, Keller A, Schreiber S, Rosenstiel P, Franke A: From next-generation sequencing alignments to accurate comparison and validation of single-nucleotide variants: the pibase software. Nucleic Acids Res 2013, 41:e16.
- [16]Nielsen R, Paul JS, Albrechtsen A, Song YS: Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 2011, 12(6):443-451.
- [17]Meacham F, Boffelli D, Dhahbi J, Martin DI, Singer M, Pachter L: Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics 2011, 12:451. BioMed Central Full Text
- [18]Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak MC, Hirai A, Takahashi H, Altaf-Ul-Amin M, Ogasawara N, Kanaya S: Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 2011, 39(13):e90.
- [19]Depristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, Del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011, 43(5):491-498.
- [20]Qu Y, Tan M, Kutner MH: Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics 1996, 52(3):797-810.
- [21]Menten J, Boelaert M, Lesaffre E: Bayesian latent class models with conditionally dependent diagnostic tests: a case study. Stat Med 2008, 27(22):4469-4488.
- [22]Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAMtools. Bioinformatics 2009, 25(16):2078-2079.
- [23]Danecek P, Auton A, Abecasis G, Albers CA, Banks E, Depristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Analysis Group: The variant call format and VCFtools. Bioinformatics 2011, 27(15):2156-2158.
- [24]Ledergerber C, Dessimoz C: Base-calling for next-generation sequencing platforms. Brief Bioinform 2011, 12(5):489-497.
- [25]Hui SL, Zhou XH: Evaluation of diagnostic tests without gold standards. Stat Methods Med Res 1998, 7(4):354-370.
- [26]Garrett ES, Eaton WW, Zeger S: Methods for evaluating the performance of diagnostic tests in the absence of a gold standard: a latent class model approach. Stat Med 2002, 21(9):1289-1307.
- [27]Pepe MS, Janes H: Insights into latent class analysis of diagnostic test performance. Biostatistics 2007, 8(2):474-484.
- [28]Xu H, Craig BA: A probit latent class model with general correlation structures for evaluating accuracy of diagnostic tests. Biometrics 2009, 65(4):1145-1155.
- [29]Roth A, Ding J, Morin R, Crisan A, Ha G, Giuliany R, Bashashati A, Hirst M, Turashvili G, Oloumi A, Marra MA, Aparicio S, Shah SP: JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics 2012, 28(7):907-913.
- [30]Larson DE, Harris CC, Chen K, Koboldt DC, Abbott TE, Dooling DJ, Ley TJ, Mardis ER, Wilson RK, Ding L: SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 2012, 28(3):311-317.
- [31]Saunders CT, Wong WSW, Swamy S, Becq J, Murray LJ, Cheetham RK: Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 2012, 28(14):1811-1817.
- [32]Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, Gabriel S, Meyerson M, Lander ES, Getz G: Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 2013, 31(3):213-219.
- [33]Löwer M, Renard BY, de Graaf J Wagner, Paret C, Kneip C, Türeci O, Diken M, Britten C, Kreiter S, Koslowski M, Castle JC, Sahin U: Confidence-based somatic mutation evaluation and prioritization. PLoS Comput Biol 2012, 8(9):e1002714.