期刊论文详细信息
GigaScience
A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data
Ola Spjuth1  Mikhail Voznesenskiy2  Tore Sundqvist3  Alexey Siretskiy3 
[1] Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, P.O. Box 541, Uppsala SE-75124, Sweden;Department of Physical Chemistry, institute of Chemistry, St-Petersburg State University, Saint-Petersburg, Russia;Department of Information Technology, Uppsala University, P.O. Box 337, Uppsala SE-75105, Sweden
关键词: Bioinformatics;    DNA-seq;    High-performance computing;    Hadoop;    Massively parallel sequencing;    Next generation sequencing;   
Others  :  1211833
DOI  :  10.1186/s13742-015-0058-5
 received in 2014-09-04, accepted in 2015-04-09,  发布年份 2015
PDF
【 摘 要 】

Background

New high-throughput technologies, such as massively parallel sequencing, have transformed the life sciences into a data-intensive field. The most common e-infrastructure for analyzing this data consists of batch systems that are based on high-performance computing resources; however, the bioinformatics software that is built on this platform does not scale well in the general case. Recently, the Hadoop platform has emerged as an interesting option to address the challenges of increasingly large datasets with distributed storage, distributed processing, built-in data locality, fault tolerance, and an appealing programming methodology.

Results

In this work we introduce metrics and report on a quantitative comparison between Hadoop and a single node of conventional high-performance computing resources for the tasks of short read mapping and variant calling. We calculate efficiency as a function of data size and observe that the Hadoop platform is more efficient for biologically relevant data sizes in terms of computing hours for both split and un-split data files. We also quantify the advantages of the data locality provided by Hadoop for NGS problems, and show that a classical architecture with network-attached storage will not scale when computing resources increase in numbers. Measurements were performed using ten datasets of different sizes, up to 100 gigabases, using the pipeline implemented in Crossbow. To make a fair comparison, we implemented an improved preprocessor for Hadoop with better performance for splittable data files. For improved usability, we implemented a graphical user interface for Crossbow in a private cloud environment using the CloudGene platform. All of the code and data in this study are freely available as open source in public repositories.

Conclusions

From our experiments we can conclude that the improved Hadoop pipeline scales better than the same pipeline on high-performance computing resources, we also conclude that Hadoop is an economically viable option for the common data sizes that are currently used in massively parallel sequencing. Given that datasets are expected to increase over time, Hadoop is a framework that we envision will have an increasingly important role in future biological data analysis.

【 授权许可】

   
2015 Siretskiy et al.; licensee BioMed Central.

【 预 览 】
附件列表
Files Size Format View
20150611041944589.pdf 1240KB PDF download
Figure 6. 37KB Image download
Figure 5. 46KB Image download
Figure 4. 35KB Image download
Figure 3. 74KB Image download
Figure 2. 73KB Image download
Figure 1. 39KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

【 参考文献 】
  • [1]Metzker ML. Sequencing technologies – the next generation. Nat Rev Genet. 2010; 11(1):31-46.
  • [2]Marx V. Biology: The big challenges of big data. Nature. 2013; 498(7453):255-60.
  • [3]Hiseq Comparison. Available from: http://www.illumina.com/systems/sequencing.ilmn
  • [4]Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009; 25(14):1754-60.
  • [5]Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10(3):R25.
  • [6]Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al.. The sequence alignment/map format and SAMtools. Bioinformatics. 2009; 25(16):2078-79.
  • [7]The OpenMP®; API specification for parallel programming. Available from: http://openmp.org/.
  • [8]Top 500 Supercomputer Sites. Available from: http://www.top500.org/statistics/list/.
  • [9]Tange O. GNU Parallel - The Command-Line Power Tool. The USENIX Magazine. 2011; 36(1):42–7. Available from: http://www.gnu.org/s/parallel.
  • [10]The Message Passing Interface (MPI) standard. Available from: http://www.mcs.anl.gov/research/projects/mpi/.
  • [11]The Extended Randomized Numerical alignEr. Available from: http://erne.sourceforge.net.
  • [12]pMap: Parallel Sequence Mapping Tool. Available from: http://bmi.osu.edu/hpc/software/pmap/pmap.html.
  • [13]Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P et al.. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005; 15(10):1451-5.
  • [14]McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A et al.. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010; 20(9):1297-303.
  • [15]Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. In: OSDI ’04: 6th Symposium on Operating Systems Design and Implementation. 2004. https://www.usenix.org/legacy/event/osdi04/tech/full_papers/dean/dean.pdf.
  • [16]Hadoop Wiki - Powered By. Available from: https://wiki.apache.org/hadoop/PoweredBy.
  • [17]Lin J, Dyer C. Data-Intensive Text Processing with MapReduce. College Park, Morgan and Claypool Publishers; 2010.
  • [18]How Facebook keeps 100 petabytes of Hadoop data online. Available from: https://gigaom.com/2012/06/13/how-facebook-keeps-100-petabytes-of-hadoop-data-online/.
  • [19]White T. Hadoop: The Definitive Guide. 1st ed. Sebastopol: O’Reilly; 2009. Available from: http://oreilly.com/catalog/9780596521981.
  • [20]Sammer E. Hadoop Operations. 1st ed. Sebastopol. Inc., O’Reilly Media; 2012.
  • [21]Schatz MC. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. 2009; 25(11):1363-9.
  • [22]Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL. Searching for SNPs with cloud computing. Genome Biol. 2009; 10(11):R134.
  • [23]Langmead B, Hansen KD, Leek JT, et al.Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 2010; 11(8):R83.
  • [24]Schatz M, Sommer D, Kelley D, Pop M. Contrail: Assembly of large genomes using cloud computing. In: CSHL Biology of Genomes Conference: 2010.
  • [25]Taylor R. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics. 2010; 11(Suppl 12):S1. Available from: http://www.biomedcentral.com/1471-2105/11/S12/S1.
  • [26]The Arabidopsis Information Resource (TAIR). Available from: www.arabidopsis.org.
  • [27]Gilchrist J, Nikolov Y. Parallel BZIP2 (pbzip2). http://compression.ca/pbzip2/.
  • [28]UPPMAX. Available from: http://uppmax.uu.se.
  • [29]Open Nebula. Available from: http://opennebula.org.
  • [30]Cloudera. http://www.cloudera.com/content/cloudera/en/why-cloudera/hadoop-and-big-data.html.
  • [31]Habib I. Virtualization with KVM. Linux J. 2008 Feb;2008(166). Available from: http://dl.acm.org/citation.cfm?id=1344209.1344217.
  • [32]Li Y, Chen W, Liu EY, Zhou YH. Single Nucleotide Polymorphism (SNP) Detection and Genotype Calling from Massively Parallel Sequencing (MPS) Data. Stat Biosci. 2013; 5(1):3-25.
  • [33]Short Oligonucleotide Analysis Package. Available from: http://soap.genomics.org.cn/soapsnp.html.
  • [34]Siretskiy A. HPC_bash_align. Available from: https://github.com/raalesir/HPC_bash_align.
  • [35]Siretskiy A. mr_python. Available from: https://github.com/raalesir/mr_python.
  • [36]The NCBI Sequence Read Archive. Available from: http://www.ncbi.nlm.nih.gov/Traces/sra.
  • [37]Mark A. A parallel implementation of gzip for modern multi-processor, multi-core machines. Available from: http://zlib.net/pigz/.
  • [38]1001 Genomes Project database. Available from: http://1001genomes.org/data/software/shoremap/shoremap%5C_2.0%5C%5C/data/reads/Schneeberger.2009/Schneeberger.2009.single%5C_end.gz.
  • [39]Pireddu L, Leo S, Zanetti G. SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics. 2011; 27(15):2159-60.
  • [40]Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In: Sixth Symposium on Operating System Design and Implementation: 2004;San Francisco, CA. 2004. https://www.usenix.org/legacy/event/osdi04/tech/full_papers/dean/dean.pdf.
  • [41]Yoo AB, Jette MA, Grondona M. SLURM: Simple linux utility for resource management. Job Scheduling Strategies for Parallel Processing. Springer, Berlin Heidelberg; 2003.
  • [42]Afgan E, Baker D, Coraor N, Chapman B, Nekrutenko A, Taylor J. Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinformatics. 2010; 11 Suppl 12:S4.
  • [43]Schönherr S, Forer L, Weissensteiner H, Kronenberg F, Specht G, Kloss-Brandstatter A. Cloudgene: a graphical execution platform for MapReduce programs on private and public clouds. BMC Bioinformatics. 2012; 13:200.
  • [44]Gurtowski J, Schatz MC, Langmead B. Genotyping in the cloud with Crossbow. Curr Protoc Bioinformatics. 2012. Sep;Chapter 15:Unit15.3.
  • [45]Fadika Z, Dede E, Govindaraju M, Ramakrishnan L. MARIANE: Using MapReduce in HPC environments. Future Generation Comput Syst. 2014; 36(0):379-88.
  • [46]Nordberg H, Bhatia K, Wang K, Wang Z.. BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics. 2013; 29(23):3014-9.
  • [47]Schumacher A, Pireddu L, Niemenmaa M, Kallio A, Korpelainen E, Zanetti G et al.. SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics. 2014; 30(1):119-20.
  • [48]Krishnan S, Tatineni M, Baru C. myHadoop-Hadoop-on-Demand on Traditional HPC Resources. San Diego Supercomputer Center Technical Report TR-2011-2, University of California, San Diego; 2011.
  • [49]Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing: 2010. p. 10. http://dl.acm.org/citation.cfm?id=1863103.1863113.
  • [50]Wiewiórka MS, Messina A, Pacholewska A, Maffioletti S, Gawrysiak P, Okoniewski MJ. SparkSeq: fast, scalable, cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics. 2014. p. btu343 http://dx.doi.org/10.1093/bioinformatics/btu343.
  • [51]Massie M, Nothaft F, Hartl C, Kozanitis C, Schumacher A, Joseph AD et al.. ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing. EECS Department, University of California, Berkeley; 2013.
  文献评价指标  
  下载次数:77次 浏览次数:39次