Biology Direct | |
Experiences with workflows for automating data-intensive bioinformatics | |
Ola Spjuth9  Erik Bongcam-Rudloff6  Guillermo Carrasco Hernández2  Lukas Forer1  Mario Giovacchini2  Roman Valls Guimera2  Aleksi Kallio8  Eija Korpelainen8  Maciej M Kańduła5  Milko Krachunov7  David P Kreil5  Ognyan Kulev7  Paweł P. Łabaj5  Samuel Lampa9  Luca Pireddu3  Sebastian Schönherr1  Alexey Siretskiy4  Dimitar Vassilev10  | |
[1] Division of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck 6020, Austria | |
[2] Science for Life Laboratory, Karolinska Institutet, SE-17121, Stockholm P.O. Box 1031, Sweden | |
[3] CRS4 Polaris, Pula, Italy | |
[4] Department of Information Technology, Uppsala University, SE-75105, Uppsala P.O. Box 337, Sweden | |
[5] Chair of Bioinformatics Research Group, Boku University, Vienna, Austria | |
[6] SLU-Global Bioinformatics Centre, Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden | |
[7] Faculty of Mathematics and Informatics, Sofia University, Sofia, Bulgaria | |
[8] CSC - IT Center for Science Ltd., FI-02101, Espoo P.O. Box 405, Finland | |
[9] Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, SE-75124, Uppsala P.O. Box 591, Sweden | |
[10] AgroBioInstitute and Joint Genomic Centre, Sofia, Bulgaria | |
关键词: Reproducibility; Big data; High-performance computing; Data-intensive; Automation; Workflow; | |
Others : 1225801 DOI : 10.1186/s13062-015-0071-8 |
|
received in 2015-02-26, accepted in 2015-08-03, 发布年份 2015 |
【 摘 要 】
High-throughput technologies, such as next-generation sequencing, have turned molecular biology into a data-intensive discipline, requiring bioinformaticians to use high-performance computing resources and carry out data management and analysis tasks on large scale. Workflow systems can be useful to simplify construction of analysis pipelines that automate tasks, support reproducibility and provide measures for fault-tolerance. However, workflow systems can incur significant development and administration overhead so bioinformatics pipelines are often still built without them. We present the experiences with workflows and workflow systems within the bioinformatics community participating in a series of hackathons and workshops of the EU COST action SeqAhead. The organizations are working on similar problems, but we have addressed them with different strategies and solutions. This fragmentation of efforts is inefficient and leads to redundant and incompatible solutions. Based on our experiences we define a set of recommendations for future systems to enable efficient yet simple bioinformatics workflow construction and execution.
Reviewers
This article was reviewed by Dr Andrew Clark.
【 授权许可】
2015 Spjuth et al.
Files | Size | Format | View |
---|---|---|---|
Fig. 3. | 41KB | Image | download |
Fig. 2. | 18KB | Image | download |
Fig. 1. | 95KB | Image | download |
Fig. 3. | 41KB | Image | download |
Fig. 2. | 18KB | Image | download |
Fig. 1. | 95KB | Image | download |
【 图 表 】
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 1.
Fig. 2.
Fig. 3.
【 参考文献 】
- [1]Marx V: Biology: The big challenges of big data. Nature 2013, 498(7453):255-60.
- [2]Bux M, Leser U. Parallelization in Scientific Workflow Management Systems. ArXiv e-prints. 2013. 1303.7195.
- [3]Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, et al.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 2004, 20(17):3045-54.
- [4]Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, et al.: Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol 2010, Chapter 19:19.10.1-21.
- [5]Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, et al.: Galaxy: a platform for interactive large-scale genome analysis. Genome Res 2005, 15(10):1451-5.
- [6]Goecks J, Nekrutenko A, Taylor J: Galaxy Team: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010, 11(8):86.
- [7]Altintas I, Berkley C, Jaeger E, Jones M, Ludascher B, Mock S. Kepler: an extensible system for design and execution of scientific workflows. In: Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference On: 2004. p. 423–4. doi:10.1109/SSDM.2004.1311241.
- [8]Kallio MA, Tuimala JT, Hupponen T, Klemelä P, Gentile M, Scheinin I, et al.: Chipster: user-friendly analysis software for microarray and other high-throughput data. BMC Genomics 2011, 12:507.
- [9]Sadedin SP, Pope B, Oshlack A: Bpipe: a tool for running and managing bioinformatics pipelines. Bioinformatics 2012, 28(11):1525-6.
- [10]Köster J, Rahmann S: Snakemake–a scalable bioinformatics workflow engine. Bioinformatics 2012, 28(19):2520-2.
- [11]Feldman SI: Make - a program for maintaining computer programs a program for maintaining computer programs. Softw Pract Experience 1979, 9(4):255-65.
- [12]Schwab M, Schroeder J. Reproducible research documents using gnumake. In: Stanford Exploration Project: 1995. p. 217–26.
- [13]Schatz M, Langmead B, Salzberg S: Cloud computing and the DNA data race. Nat Biotechnol 2010, 28:691-3.
- [14]Stein L: The case for cloud computing in genome informatics. Genome Biol 2010, 11:207.
- [15]Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In: Sixth Symposium on Operating System Design and Implementation: 2004; San Francisco, CA. 2004.
- [16]White T: Hadoop: The Definitive Guide. O’Reilly, Sebastopol, CA; 2009. http://oreilly.com/catalog/9780596521981
- [17]Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing: 2010. p. 10–10.
- [18]Lampa S, Dahlö M, Olason PI, Hagberg J, Spjuth O: Lessons learned from implementing a national infrastructure in sweden for storage and analysis of next-generation sequencing data. Gigascience 2013, 2(1):9.
- [19]Rodríguez D, Bello X, Gutiérrez-de-Terán H: Molecular modelling of g protein-coupled receptors through the web. Mol Inf 2012, 31(5):334-41.
- [20]Schönherr S, Forer L, Weißensteiner H, Kronenberg F, Specht G, Kloss-Brandstätter A: Cloudgene: a graphical execution platform for mapreduce programs on private and public clouds. BMC Bioinformatics 2012, 13:200.
- [21]Siretskiy A, Sundqvist T, Voznesenskiy M, Spjuth O: A quantitative assessment of the hadoop framework for analyzing massively parallel dna sequencing data. Gigascience 2015, 4:26.
- [22]Siretskiy A, Spjuth O. Htseq-hadoop: Extending htseq for massively parallel sequencing data analysis using hadoop. In: eScience (eScience), 2014 IEEE 10th International Conference On: 2014.
- [23]Anders S, Pyl PT, Huber W: Htseq-a python framework to work with high-throughput sequencing data. Bioinformatics 2015, 31(2):166-169.
- [24]SEQC/MAQC-III Consortium: A comprehensive assessment of rna-seq accuracy, reproducibility and information content by the sequencing quality control consortium Nat Biotechnol 2014, 32(9):903-14.
- [25]Li S, Łabaj PP, Zumbo P, Sykacek P, Shi W, Shi L, et al.: Detecting and correcting systematic variation in large-scale rna sequencing data. Nat Biotechnol 2014, 32(9):888-95.
- [26]Mueckstein U, Leparc GG, Posekany A, Hofacker I, Kreil DP: Hybridization thermodynamics of nimblegen microarrays. BMC Bioinformatics 2010, 11:35.
- [27]Leparc GG, Tüchler T, Striedner G, Bayer K, Sykacek P, Hofacker IL, et al.: Model-based probe set optimization for high-performance microarrays. Nucleic Acids Res 2009, 37(3):18.
- [28]Goodstadt L: Ruffus: a lightweight python library for computational pipelines. Bioinformatics 2010, 26(21):2778-9.
- [29]Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, et al.: Cancer Genome Atlas Research Network: The cancer genome atlas pan-cancer analysis project. Nat Genet 2013, 45(10):1113-20.
- [30]Niemenmaa M, Kallio A, Schumacher A, Klemelä P, Korpelainen E, Heljanko K: Hadoop-bam: directly manipulating next generation sequencing data in the cloud. Bioinformatics 2012, 28(6):876-7.
- [31]Schumacher A, Pireddu L, Niemenmaa M, Kallio A, Korpelainen E, Zanetti G, et al.: Seqpig: simple and scalable scripting for large sequencing data sets in hadoop. Bioinformatics 2014, 30(1):119-20.
- [32]Merali Z: Computational science:...error. Nature 2010, 467(7317):775-7.
- [33]Orrù V, Steri M, Sole G, Sidore C, Virdis F, Dei M, et al.: Genetic variants regulating immune cell levels in health and disease. Cell 2013, 155(1):242-56.
- [34]Francalacci P, Morelli L, Angius A, Berutti R, Reinier F, Atzeni R, et al.: Low-pass DNA sequencing of 1200 Sardinians reconstructs European Y-chromosome phylogeny. Science (New York, N.Y.) 2013, 341(6145):565-9.
- [35]Cuccuru G, Leo S, Lianas L, Muggiri M, Pinna A, Pireddu L, et al.An automated infrastructure to support high-throughput bioinformatics. In: High Performance Computing Simulation (HPCS), 2014 International Conference On: 2014. p. 600–7. doi:10.1109/HPCSim.2014.6903742.
- [36][10.1093/bioinformatics/btr325] webcitePireddu L, Leo S, Zanetti G. Seal: a distributed short read mapping and duplicate removal tool. Bioinformatics. 2011. doi:. .. http://bioinformatics.oxfordjournals.org/content/early/2011/06/22/bioinformatics.btr325.full.pdfhtml webcite
- [37]Pireddu L, Leo S, Soranzo N, Zanetti G: A hadoop-galaxy adapter for user-friendly and scalable data-intensive bioinformatics in galaxy. [10.1145/2649387.2649429] webciteProceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. BCB ’14 ACM, New York, NY, USA; 2014. doi:10.1145/2649387.2649429
- [38]Kloss-Brandstätter A, Pacher D, Schönherr S, Weissensteiner H, Binna R, Specht G, et al.: Haplogrep: a fast and reliable algorithm for automatic classification of mitochondrial dna haplogroups. Hum Mutat 2011, 32(1):25-32.
- [39]Afgan E, Baker D, Coraor N, Chapman B, Nekrutenko A, Taylor J: Galaxy cloudman: delivering cloud compute clusters. BMC Bioinformatics 2010, 11(Suppl 12):4.
- [40]Afgan E, Chapman B, Jadan M, Franke V, Taylor J: Using cloud computing infrastructure with cloudbiolinux, cloudman, and galaxy. Curr Protoc Bioinformatics 2012, Chapter 11:11-9.
- [41]Forer L, Lipic T, Schonherr S, Weisensteiner H, Davidovic D, Kronenberg F, et al.Delivering bioinformatics mapreduce applications in the cloud. In: Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2014 37th International Convention On: 2014. p. 373–7. doi:10.1109/MIPRO.2014.6859593.
- [42]Krachunov M. Hierarchy and expressions for automated workflows for ngs data processing. In: Proceedings of the 8th International Conference on Information Systems & Grid Technologies (ISGT). Sofia, Bulgaria: 2015. p. 38–48.
- [43]Schaaff A, Verdes-Montenegro L, Ruiz J, Vela JS. Scientific workflows in astronomy. In: Proceeding of Astronomical Data Analysis Software and Systems: 2012.
- [44]Lih A, Zadok E. Pgmake: A portable distributed make system. 1994. Technical report.
- [45]Taura K, Matsuzaki T, Miwa M, Kamoshida Y, Yokoyama D, Dun N, et al.: Design and implementation of gxp make - a workflow system based on make. Future Gener Comput Syst 2013, 29(2):662-72.
- [46]Albrecht M, Donnelly P, Bui P, Thain D: Makeflow: A portable abstraction for data intensive computing on clusters, clouds, and grids. [10.1145/2443416.2443417] webciteProceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies. SWEET ’12 ACM, New York, NY, USA; 2012. doi:10.1145/2443416.2443417
- [47]Seibel P, Kruger J, Hartmeier S, Schwarzer K, Lowenthal K, Mersch H, et al.: XML schemas for common bioinformatic data types and their application in workflow systems. BMC Bioinformatics 2006, 7(1):490. BioMed Central Full Text
- [48]Kalas M, Puntervoll P, Joseph A, Bartaseviciute E, Töpfer A, Venkataraman P, et al.: Bioxsd: the common data-exchange format for everyday bioinformatics web services. Bioinformatics 2010, 26(18):540-6.
- [49]Wilkinson M. Interoperability With Moby 1.0 - It’s Better Than Sharing Your Toothbrush!. 2008. Available from Nature Precedings.
- [50]Linke B, Giegerich R, Goesmann A: Conveyor: a workflow engine for bioinformatic analyses. Bioinformatics 2011, 27(7):903-11.
- [51]Wassink I, van der Vet PE, Wolstencroft K, Neerincx PBT, Roos M, Rauwerda H, et al.Analysing scientific workflows: Why workflows not only connect web services. In: Services - I, 2009 World Conference On: 2009. p. 314–21.