GigaScience | |
Think globally and solve locally: secondary memory-based network learning for automated multi-species function prediction | |
Giorgio Valentini1  Matteo Re1  Marco Mesiti1  | |
[1] AnacletoLab - Department of Computer Science, University of Milano, Via Comelico 39/41, 20135 Milano, Italy | |
关键词: Network-based learning; Big data analysis; Biomolecular networks; | |
Others : 861385 DOI : 10.1186/2047-217X-3-5 |
|
received in 2013-12-04, accepted in 2014-04-01, 发布年份 2014 | |
【 摘 要 】
Background
Network-based learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their effectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satisfiable with simple off-the-shelf computers.
Results
We propose a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertex-centric” implementations of network-based algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memory-based technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off-the-shelf computers. This approach has been applied to the analysis of a large multi-species network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the first work where secondary-memory based network analysis has been applied to multi-species function prediction using biological networks with hundreds of thousands of proteins.
Conclusions
The combination of these algorithmic and technological approaches makes feasible the analysis of large multi-species networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using well-equipped stand-alone machines.
【 授权许可】
2014 Mesiti et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20140725000945269.pdf | 911KB | download | |
34KB | Image | download | |
64KB | Image | download | |
45KB | Image | download | |
19KB | Image | download | |
42KB | Image | download |
【 图 表 】
【 参考文献 】
- [1]Friedberg I: Automated protein function prediction-the genomic challenge. Brief Bioinform 2006, 7:225-242.
- [2]Gillis J, Pavlidis P: Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (CAFA). BMC Bioinformatics 2013, 14(Suppl 3):S15. BioMed Central Full Text
- [3]Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Törönen P, Nokso-Koivisto J, Holm L, Cozzetto D, Buchan DWA, Bryson K, Jones DT, Limaye B, et al.: A large-scale evaluation of computational protein function prediction. Nat Methods 2013, 10(3):221-227.
- [4]Wong AK, Park CY, Greene CS, Bongo LA, Guan Y, Troyanskaya OG: IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res 2012, 40(W1):W484—W490.
- [5]Kuzniar A, van Ham RC, Pongor S, Leunissen JA: The quest for orthologs: finding the corresponding gene across genomes. Trends Genet 2008, 24(11):539-551.
- [6]Koonin EV: Orthologs, paralogs, and evolutionary genomics 1. Annu Rev Genet 2005, 39:309-338.
- [7]Hamp T, Kassner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Rost B: Homology-based inference sets the bar high for protein function prediction. BMC Bioinformatics 2013, 14(Suppl 3):S7. BioMed Central Full Text
- [8]Lovasz L: Random walks on graphs: a survey. Combinatorics, Paul Erdos is Eighty 1993, 2:1-46.
- [9]Zhou D, Bousquet O, Lal NT, Weston J, Schölkopf B: Learning with local and global consistency. In Advances in Neural Information Processing Systems 16. Cambridge: MIT Press; 2004:321-328.
- [10]Bengio Y, Delalleau O, Le Roux N: Label propagation and quadratic Criterion. In Semi-Supervised Learning. Edited by Zien A, Schölkopf B, Chapelle O. Cambridge: MIT Press; 2006:193-216.
- [11]Liu W, Wang J, Chang SF: Robust and scalable graph-based Semisupervised learning. Proc IEEE 2012, 100(9):2624-2638.
- [12]Foster J: Designing and Building Parallel Programs. Boston: Addison Wesley; 1995.
- [13]Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C: PowerGraph: Distributed graph-parallel computation on natural graphs. In OSDI’12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation. Hollywood, CA: USENIX Association Berkeley; 2012:17-30.
- [14]Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM: GraphLab: a new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI). Catalina Island: AUAI Press; 2010.
- [15]Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G: Pregel: a system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010. Indianapolis, Indiana, USA, New York: ACM Press; 2010:135-146.
- [16]Kyrola A, Blelloch G, Guestrin C: GraphChi: large-scale graph computation on just a PC. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation. CA, USA: Hollywood, CA, USA, OSDI’12: USENIX Association Berkeley; 2012:31-46.
- [17]Webber J: A programmatic introduction to Neo4j. In Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software for Humanity. Tucson: ACM; 2012:217-218.
- [18]Han WS, Lee S, Park K, Lee JH, Kim MS, Kim J, Yu H: TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM; 2013:77-85.
- [19]Robinson I, Webber J: Eifrem E: Graph Databases. 2013.
- [20]Karedla R, Love J, Wherry B: Caching strategies to improve disk system performance. Computer 1994, 27:38-46.
- [21]Boldi P, Vigna S: The WebGraph framework I: compression techniques. In In Proc. of the Thirteenth International World Wide Web Conference. New York: ACM Press; 2003:595-601.
- [22]Have C: Jensen L: Are graph databases ready for bioinformatics? Bioinformatics 2013, 29(24):3107.
- [23]Mostafavi S, Ray D, Warde-Farley D, Grouios C, Morris Q: GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol 2008., 9(S4)
- [24]Kohler S, Bauer S, Horn D, Robinson P: Walking the Interactome for prioritization of candiyear disease genes. Am J Human Genet 2008, 82(4):948-958.
- [25]Re M, Mesiti M, Valentini G: A fast ranking algorithm for predicting gene functions in biomolecular networks. IEEE ACM Trans Comput Biol Bioinform 2012, 9(6):1812-1818.
- [26]Malewicz G, Austern M, Bik AJ, Dehnert J, Horn I, Leiser N, Czajkowski G: Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10. Indianapolis, Indiana, USA. New York: ACM Press; 2010:135-146.
- [27]Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin JM, Sherlock G: Gene Ontology: tool for the unification of biology. Nat Genet 2000, 25:25.
- [28]Angles R, Gutierrez C: Survey of graph database models. ACM Comput Surv 2008, 40(1):Article 1.
- [29]Friedberg I, Linial M, Mooney S, Radivojac P: Critical assessment of function annotation experiment. 2013. http://biofunctionprediction.org webcite
- [30]Finn RD, Mistry J, Schuster-Böckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer ELL, Bateman A: Pfam: clans, web tools and services. Nucleic Acids Res 2006, 34(suppl 1):D247—D251.
- [31]Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 2001, 313(4):903-919.
- [32]Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL, Moulton G, Nordle A, Paine K, Taylor P, Uddin A, Zygouri C: PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res 2003, 31:400-402.
- [33]Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist CJ: The PROSITE database. Nucleic Acids Res 2006, 34(suppl 1):D227—D230.
- [34]Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R, Courcelle E, Das U, Daugherty L, Dibley M, Finn R, Fleischmann W, Gough J, Haft D, Hulo N, Hunter S, Kahn D, Kanapin A, Kejariwal A, Labarga A, Langendijk-Genevaux PS, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, et al.: New developments in the InterPro database. Nucleic Acids Res 2007, 35(suppl 1):D224—D228.
- [35]Muller J, Szklarczyk D, Julien P, Letunic I, Roth A, Kuhn M, Powell S, Von Mering C, Doerks T, Jensen LJ, Bork P: eggNOG v2. 0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations. Nucleic Acids Res 2010, 38(suppl 1):D190—D195.
- [36]Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P: SMART 5: domains in the context of genomes and networks. Nucleic Acids Res 2006, 34(suppl 1):D257—D260.
- [37]Re M, Valentini G: Network-based drug ranking and repositioning with respect to DrugBank therapeutic categories. IEEE/ACM Trans Comput Biol Bioinform 2013, 10(6):1359-1371.
- [38]STRING database http://string-db.org webcite
- [39]Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P, Von Mering C, Jensen LJ: STRING v9. 1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res 2013, 41(D1):D808—D815.
- [40]Von Mering C, Jensen LJ, Kuhn M, Chaffron S, Doerks T, Krüger B, Snel B, Bork P: STRING 7 recent developments in the integration and prediction of protein interactions. Nucleic Acids Res 2007, 35(suppl 1):D358—D362.
- [41]Marcotte E, Pellegrini M, Thompson M, Yeates T, Eisenberg D: A combined algorithm for genome-wide prediction of protein function. Nature 1999, 402:83-86.
- [42]Bumgarner McDermottRJand, Samudrala R: Functional annotation from predicted protein interaction networks. Bioinformatics 2005, 21(15):3217-3226.
- [43]Lippert G, Ghahramani Z, Borgwardt K: Gene function prediction form synthetic leathality networks via ranking on demand. Bioinformatics 2010, 26(7):912-918.
- [44]Re M, Valentini G: Cancer module genes ranking using kernelized score functions. BMC Bioinformatics 2012, 13(S14):S3.
- [45]Frasca M, Bertoni A, Re M, Valentini G: A neural network algorithm for semi-supervised node label learning from unbalanced data. Neural Netw 2013, 43:84-98.
- [46]Barabasi A, Gulbahce N, Loscalzo J: Network medicine: a network-based approach to human disease. Nat Rev Genet 2011, 12:56-68.
- [47]Dudley J, Desphonde T, Butte A: Exploiting Drug-disease relationships for computational drug repositioning. Brief Bioinform 2011, 12(4):303-311.
- [48]Mesiti M, Re M, Valentini G: Supporting materials from ‘Think globally and solve locally: secondary memory-based network learning for automated multi-species function prediction’ GigaScience Database. 2014. http://dx.doi.org/10.5524/100090 webcite