BMC Medical Genomics | |
Scalable and cost-effective NGS genotyping in the cloud | |
Dennis P. Wall5  Peter J. Tonellato1  Hassan Ghazal6  Saaïd Amzazi3  Ryan Powles7  Jared B. Hawkins7  Ettore Rizzo4  Jae-Yoon Jung7  Alex K. Lancaster2  Yassine Souilmi3  | |
[1] Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston 02215, MA, USA;Department of Pathology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston 02215, MA, USA;Department of Biology, Mohamed Vth University, 4 Ibn Battouta Avenue, Rabat, Morocco;Department of Electrical, Computer and Biomedical Engineering, University of Pavia, via Ferrata 1, Pavia 27100, Italy;Department of Pediatrics and Psychiatry (by courtesy), Division of Systems Medicine & Program in Biomedical Informatics, Stanford University, Stanford 94305, CA, USA;Department of Biology, Mohamed First University, Oujda, Nador, Morocco;Department of Biomedical Informatics, Harvard Medical School 10 Shattuck Street, Boston 02115, MA, USA | |
关键词: Parallel computing; Bioinformatics; Software; Medical genomics; Cloud computing; Clinical sequencing; Next-generation sequencing; | |
Others : 1228948 DOI : 10.1186/s12920-015-0134-9 |
|
received in 2015-04-16, accepted in 2015-09-11, 发布年份 2015 |
【 摘 要 】
Background
While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10’s of dollars.
Results
We take a step towards addressing this challenge, by using COSMOS, a cloud-enabled workflow management system, to develop GenomeKey, an NGS whole genome analysis workflow. COSMOS implements complex workflows making optimal use of high-performance compute clusters. Here we show that the Amazon Web Service (AWS) implementation of GenomeKey via COSMOS provides a fast, scalable, and cost-effective analysis of both public benchmarking and large-scale heterogeneous clinical NGS datasets.
Conclusions
Our systematic benchmarking reveals important new insights and considerations to produce clinical turn-around of whole genome analysis optimization and workflow management including strategic batching of individual genomes and efficient cluster resource configuration.
【 授权许可】
2015 Souilmi et al.
Files | Size | Format | View |
---|---|---|---|
Figure 7. | 89KB | Image | download |
Fig. 2. | 34KB | Image | download |
Fig. 1. | 71KB | Image | download |
【 图 表 】
Fig. 1.
Fig. 2.
Figure 7.
【 参考文献 】
- [1]Kircher M, Kelso J: High-throughput DNA sequencing--concepts and limitations. Bioessays 2010, 32(6):524-536.
- [2]Schatz MC, Langmead B: The DNA data deluge: fast, efficient genome sequencing machines are spewing out more data than geneticists can analyze. IEEE Spectr 2013, 50(7):26-33.
- [3]Desai AN, Jere A: Next-generation sequencing: ready for the clinics? Clin Genet 2012, 81(6):503-510.
- [4]Sboner A, Mu XJ, Greenbaum D, Auerbach RK, Gerstein MB: The real cost of sequencing: higher than you think! Genome Biol 2011, 12(8):125. BioMed Central Full Text
- [5]Life Technologies Receives FDA 510(k) Clearance for Diagnostic Use of Sanger Sequencing Platform and HLA Typing Kits. https://www.genomeweb.com/sequencing/510k-clearance-3500-dx-life-tech-aims-convert-hla-typing-customers-cleared-box-a webcite
- [6]Collins FS, Hamburg MA: First FDA authorization for next-generation sequencer. N Engl J Med 2013, 369(25):2369-2371.
- [7]Gafni E, Luquette LJ, Lancaster AK, Hawkins JB, Jung JY, Souilmi Y, et al.: COSMOS: python library for massively parallel workflows. Bioinformatics 2014, 30(20):2956-2958.
- [8]Abouelhoda M, Issa SA, Ghanem M: Tavaxy: integrating Taverna and Galaxy workflows with cloud computing support. BMC Bioinformatics 2012, 13:77. BioMed Central Full Text
- [9]Karczewski KJ, Fernald GH, Martin AR, Snyder M, Tatonetti NP, Dudley JT: STORMSeq: an open-source, user-friendly pipeline for processing personal genomics data in the cloud. PLoS One 2014, 9(1):e84860.
- [10]Goecks J, Nekrutenko A, Taylor J, Galaxy T: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010, 11(8):R86. BioMed Central Full Text
- [11]Nekrutenko A, Taylor J: Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 2012, 13(9):667-672.
- [12]Fusaro VA, Patil P, Gafni E, Wall DP, Tonellato PJ: Biomedical cloud computing with Amazon Web Services. PLoS Comput Biol 2011, 7(8):e1002147.
- [13]Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, et al.: From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 2013, 11(1110):11 10 11-11 10 33.
- [14]McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20(9):1297-1303.
- [15]DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011, 43(5):491-498.
- [16]Dean J, Ghemawat S: MapReduce: simplified data processing on large clusters. Commun ACM 2008, 51(1):107-113.
- [17]Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14):1754-1760.
- [18]Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010, 38(16):e164.
- [19]Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al.: The sequence alignment/Map format and SAMtools. Bioinformatics 2009, 25(16):2078-2079.
- [20]Yu TW, Chahrour MH, Coulter ME, Jiralerspong S, Okamura-Ikeda K, Ataman B, et al.: Using whole-exome sequencing to identify inherited causes of autism. Neuron 2013, 77(2):259-273.
- [21]Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, et al.: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 2014, 32(3):246-251.
- [22]Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al.: Genomes Project C: A map of human genome variation from population-scale sequencing. Nature 2010, 467(7319):1061-1073.
- [23]Fischer M, Snajder R, Pabinger S, Dander A, Schossig A, Zschocke J, et al.: SIMPLEX: cloud-enabled pipeline for the comprehensive analysis of exome sequencing data. PLoS One 2012, 7(8):e41948.
- [24]Reid JG, Carroll A, Veeraraghavan N, Dahdouli M, Sundquist A, English A, et al.: Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline. BMC Bioinformatics 2014, 15:30. BioMed Central Full Text
- [25]Zhao S, Prenger K, Smith L, Messina T, Fan H, Jaeger E, et al.: Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing. BMC Genomics 2013, 14:425. BioMed Central Full Text
- [26]Kelly BJ, Fitch JR, Hu Y, Corsmeier DJ, Zhong H, Wetzel AN, et al.: Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biol 2015, 16(1):6. BioMed Central Full Text