期刊论文

【摘要】

Background

There is a significant demand for creating pipelines or workflows in the life science discipline that chain a number of discrete compute and data intensive analysis tasks into sophisticated analysis procedures. This need has led to the development of general as well as domain-specific workflow environments that are either complex desktop applications or Internet-based applications. Complexities can arise when configuring these applications in heterogeneous compute and storage environments if the execution and data access models are not designed appropriately. These complexities manifest themselves through limited access to available HPC resources, significant overhead required to configure tools and inability for users to simply manage files across heterogenous HPC storage infrastructure.

Results

In this paper, we describe the architecture of a software system that is adaptable to a range of both pluggable execution and data backends in an open source implementation called Yabi. Enabling seamless and transparent access to heterogenous HPC environments at its core, Yabi then provides an analysis workflow environment that can create and reuse workflows as well as manage large amounts of both raw and processed data in a secure and flexible way across geographically distributed compute resources. Yabi can be used via a web-based environment to drag-and-drop tools to create sophisticated workflows. Yabi can also be accessed through the Yabi command line which is designed for users that are more comfortable with writing scripts or for enabling external workflow environments to leverage the features in Yabi. Configuring tools can be a significant overhead in workflow environments. Yabi greatly simplifies this task by enabling system administrators to configure as well as manage running tools via a web-based environment and without the need to write or edit software programs or scripts. In this paper, we highlight Yabi's capabilities through a range of bioinformatics use cases that arise from large-scale biomedical data analysis.

Conclusion

The Yabi system encapsulates considered design of both execution and data models, while abstracting technical details away from users who are not skilled in HPC and providing an intuitive drag-and-drop scalable web-based workflow environment where the same tools can also be accessed via a command line. Yabi is currently in use and deployed at multiple institutions and is available at http://ccg.murdoch.edu.au/yabi webcite.

【授权许可】

2012 Hunter et al; licensee BioMed Central Ltd.

【预览】

附件列表
Files	Size	Format	View
20140708092606784.pdf	916KB	PDF	download
Figure 5.	44KB	Image	download
Figure 4.	29KB	Image	download
Figure 3.	58KB	Image	download
Figure 2.	35KB	Image	download
Figure 1.	24KB	Image	download

【图表】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

【参考文献】

[1]Goble C, Stevens R: State of the nation in the data integration for bioinformatics. Journal of Biomedical Informatics 2008, 41(5):687-693.
[2]Louys M, Bonnarel F, Schaaff A, Claudon J-J, Pestel C: Implementing astronomical image analysis pipelines using VO standards. In Highlights of Astronomy, XXVIth IAU General Assembly Edited by van der Hucht KA. 2006., 14
[3]Walton NA, Brenton JD, Caldas C, Irwin MJ, Akram A, Gonzalez-Solares E, Lewis JR, Maccallum PH, Morris LJ, Rixon GT: PathGrid: a service-orientated architecture for microscopy image analysis. Philos Transact A Math Phys Eng Sci 2010, 368:3937-3952.
[4]Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehväslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E: The Bioperl toolkit: Perl modules for the life sciences. Genome Res 2002, 12(10):1611-8.
[5]Pocock M, Down T, Hubbard T: BioJava: open source components for bioinformatics. ACM SIGBIO Newsletter 2000, 20(2):10-12.
[6]Taylor IJ: From P2P to Web Services and Grids - Peers in a Client/Server World. Springer 2005.
[7]Wilkinson MD, Links M: BioMOBY: an open source biological web services proposal. Brief Bioinform 2002, 3(4):331-41.
[8]Hunter A, Schibeci D, Hiew HL, Bellgard M: Grendel: A bioinformatics Web Service-based architecture for accessing HPC resources. Proceedings of the 2005 Australasian workshop on Grid computing and e-research 2005., 44
[9]Bellgard M, Hiew HL, Hunter A, Wiebrands M: ORBIT: and integrated environment for user-customised bioinformatics tools. Bioinformatics 2005., 1
[10]Bellgard M: Bioinformatics from comparative genomic analysis through to integrated systems. Mammalian Genomics 2005, 393-409.
[11]Foster I, Kesselman C: Globus: A Metacomputing Infrastructure Toolkit. Intl J Supercomputer Applications 1997, 11(2):115-128.
[12]Hull D, Wolstencroft K, Stevens R, Goble C, Pocock M, Li P, Oinn T: Taverna: a tool for building and running workflows of services. Nucleic Acids Research 2006, (34 Web Server):729-732.
[13]Oinn T, Greenwood M, Addis M, Alpdemir N, Ferris J, Glover K, Goble C, Goderis A, Hull D, Marvin D, Li P, Lord P, Pocock M, Senger M, Stevens R, Wipat A, Wroe C: Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience 2006, 18(10):1067-1100.
[14]Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A: Galaxy: a platform for interactive large-scale genome analysis. Genome Research 2005, 15(10):1451-5.
[15]Altintas I, Berkley C, Jaeger E, Jones M, Ludascher B, Mock S: Kepler: an extensible system for design and execution of scientific workflows. Proceedings. 16th International Conference on Scientific and Statistical Database Management 2004, 423-424.
[16]Deelman E, Singh G, Su M, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman GB, Good J, Laity A, Jacob JC, Katz DS: Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming Journal 2005, 13(3):219-237.
[17][http://wiki.g2.bx.psu.edu/Admin/Tools/Add%20Tool%20Tutorial] webcite
[18][http://hg.notalon.org/iracooke/galaxy-proteomics] webcite
[19]Eckerson WW: Three Tier Client/Server Architecture: Achieving Scalability, Performance, and Efficiency in Client Server Applications. Open Information Systems 1995, 3(20):10.
[20]Bellgard M, Kenworthy W, Hunter A: Microarray Analysis Using Bioinformatics Analysis Audit Trails (BAATs). C R Biol 2003, 326:1083-1087.
[21][http://www.adaptivecomputing.com/products/torque.php] webcite
[22][http://www.pbsworks.com/Product.aspx?id = 1] webcite
[23]Fielding RT, Taylor RN: Principled Design of the Modern Web Architecture. ACM Transactions on Internet Technology 2002, 2(2):115-150.
[24]Левенштейн ВИ: Двоичные коды с исправлением выпадений, вставок и замещений символов. Доклады Академий Наук CCCP 1965, 163(4):845-8. Appeared in English as: Levenshtein VI (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 1966, 10: 707-10
[25]Conway ME: Design of a Separable Transition-Diagram Compiler. Communications of the ACM 1963, 6(7):396-408.
[26]Bellgard MI, Moolhuijzen P, Guerrero F, Schibeci D, Rodriguez-Valle M, Peterson D, Dowd S, Barrero R, Hunter A, Miller R, Lew-Tabor A: CattleTickBase: An integrated Internet-based bioinformatics resource for Rhipicephalus (Boophilus) microplus. International Journal for Parasitology 42(2):161-169.
[27]Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol 1997, 268(1):78-94.
[28]Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000, 16(6):276-277.
[29]Wistrand M, Kall L, Sonnhammer EL: A general model of G protein-coupled receptor sequences and its application to detect remote homologs. Protein Sci 2006, 15(3):509-521.
[30]Keller A, Eng J, Zhang N, Li X, Aebersold R: A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Molecular Systems Biology 2005, 1:2005.0017.
[31]Nesvizhskii AI, Keller A, Kolker E, Aebersold R: A statistical model for identifying proteins by tandem mass spectrometry. Analytical Chemistry 2003, 75(17):4646-4658.
[32]Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20:3551-3567.
[33]Murray DC, Bunce M, Cannell BL, Oliver R, Houston J, White NE, Barrero RA, Bellgard MI, Haile J: DNA-Based Faecal Dietary Analysis: A Comparison of qPCR and High Throughput Sequencing Approaches. PLoS One 2011, 6(10):e25776. Epub 2011 Oct 6. PubMed PMID: 21998697; PubMed Central PMCID: PMC3188572

Source Code for Biology and Medicine
Yabi: An online research environment for grid, high performance and cloud computing

Matthew I Bellgard¹ Crispin A Wellington¹ Tamas O Szabo¹ Andrew B Macgregor¹ Adam A Hunter¹
[1] Centre for Comparative Genomics, Murdoch, Western Australia, 6150
关键词: high performance computing; Internet; workflows; Bioinformatics;
Others : 806341 DOI : 10.1186/1751-0473-7-1

received in 2011-12-12, accepted in 2012-02-15, 发布年份 2012
PDF


	文献评价指标
	下载次数：54次	浏览次数：27次

【 摘 要 】

Background

Results

Conclusion

【 授权许可】

【 预 览 】

【 图 表 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【图表】

【参考文献】