BMC Bioinformatics | |
On finding bicliques in bipartite graphs: a novel algorithm and its application to the integration of diverse biological data types | |
Yun Zhang4  Charles A Phillips2  Gary L Rogers2  Erich J Baker3  Elissa J Chesler1  Michael A Langston2  | |
[1] The Jackson Laboratory, Bar Harbor, ME 04609, USA | |
[2] Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996, USA | |
[3] Department of Computer Science, Baylor University, Waco, TX 76798, USA | |
[4] Dupont Pioneer, Johnston, IA 50131, USA | |
关键词: Graph algorithms; Bipartite graphs; Bioinformatics databases; Biclustering; Biclique; | |
Others : 818666 DOI : 10.1186/1471-2105-15-110 |
|
received in 2013-07-27, accepted in 2014-03-29, 发布年份 2014 | |
【 摘 要 】
Background
Integrating and analyzing heterogeneous genome-scale data is a huge algorithmic challenge for modern systems biology. Bipartite graphs can be useful for representing relationships across pairs of disparate data types, with the interpretation of these relationships accomplished through an enumeration of maximal bicliques. Most previously-known techniques are generally ill-suited to this foundational task, because they are relatively inefficient and without effective scaling. In this paper, a powerful new algorithm is described that produces all maximal bicliques in a bipartite graph. Unlike most previous approaches, the new method neither places undue restrictions on its input nor inflates the problem size. Efficiency is achieved through an innovative exploitation of bipartite graph structure, and through computational reductions that rapidly eliminate non-maximal candidates from the search space. An iterative selection of vertices for consideration based on non-decreasing common neighborhood sizes boosts efficiency and leads to more balanced recursion trees.
Results
The new technique is implemented and compared to previously published approaches from graph theory and data mining. Formal time and space bounds are derived. Experiments are performed on both random graphs and graphs constructed from functional genomics data. It is shown that the new method substantially outperforms the best previous alternatives.
Conclusions
The new method is streamlined, efficient, and particularly well-suited to the study of huge and diverse biological data. A robust implementation has been incorporated into GeneWeaver, an online tool for integrating and analyzing functional genomics experiments, available athttp://geneweaver.org webcite. The enormous increase in scalability it provides empowers users to study complex and previously unassailable gene-set associations between genes and their biological functions in a hierarchical fashion and on a genome-wide scale. This practical computational resource is adaptable to almost any applications environment in which bipartite graphs can be used to model relationships between pairs of heterogeneous entities.
【 授权许可】
2014 Zhang et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20140711131645175.pdf | 3083KB | download | |
Figure 12. | 86KB | Image | download |
Figure 11. | 99KB | Image | download |
Figure 10. | 119KB | Image | download |
Figure 1. | 32KB | Image | download |
Figure 8. | 35KB | Image | download |
Figure 7. | 50KB | Image | download |
Figure 6. | 25KB | Image | download |
Figure 5. | 55KB | Image | download |
Figure 4. | 23KB | Image | download |
Figure 3. | 27KB | Image | download |
Figure 2. | 22KB | Image | download |
Figure 2. | 22KB | Image | download |
Figure 2. | 22KB | Image | download |
Figure 1. | 21KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 2.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.
Figure 8.
Figure 1.
Figure 10.
Figure 11.
Figure 12.
【 参考文献 】
- [1]Malgrange Y: Recherche des sous-matrices premières d’une matrice à coefficients binaires. Applications à certains problèmes de graphe. Proceedings of the Deuxième Congrès de l’AFCALTIParis: Gauthier-Villars; 1962
- [2]Berry A, Bordat JP, Sigayret A: A local approach to concept generation. Ann Math Artif Intell 2007, 49(1–4):117-136.
- [3]Kuznetsov SO, Obiedkov S: Comparing performance of algorithms for generating concept lattices. J Exp Theor Artif Intell 2002, 14:189-216.
- [4]Kaytoue-Uberall M, Duplessis S, Napoli A: Using formal concept analysis for the extraction of groups of co-expressed genes. In Modelling, Computation and Optimization in Information Systems and Management Sciences, Volume 14 of Communications in Computer and Information Science. Edited by Le Thi H, Bouvry P, Pham Dinh T. Springer Berlin Heidelberg; 2008:439-449.
- [5]Kaytoue M, Kuznetsovb SO, Napoli A, Duplessis S: Mining gene expression data with pattern structures in formal concept analysis. Inform Sci 2011, 181:1989-2001.
- [6]Cheng Y, Church GM: Biclustering of expression data. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology. La Jolla: AAAI Press; 2000:93-103.
- [7]Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002, 18:S136-S144.
- [8]Wang H, Wang W, Yang J, Yu PS: Clustering by pattern similarity in large data sets. In SIGMOD ‘02: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. Madison: ACM Press; 2002:394-405.
- [9]Sanderson MJ, Driskell AC, Ree RH, Eulenstein O, Langley S: Obtaining maximal concatenated phylogenetic data sets from large sequence databases. Mol Biol Evol 2003, 20(7):1036-1042.
- [10]Chesler EJ, Langston MA: Combinatorial genetic regulatory network analysis tools for high throughput transcriptomic data. Report 575, University of Tennessee 2006.
- [11]Baker EJ, Jay J, Philip V, Zhang Y, Li Z, Kirova R, Langston MA, Chesler EJ: Ontological discovery environment: A system for integrating gene-phenotype associations. Genomics 2009, 94(6):377-387.
- [12]Kirova R, Langston MA, Peng X, Perkins AD, Chesler EJ: A systems genetic analysis of chronic fatigue syndrome: combinatorial data integration from SNPs to differential diagnosis of disease. Proceedings, International Conference for the Critical Assessment of Microarray Data Analysis (CAMDA06)Durham, North Carolina; June 2006
- [13]Mushlin RA, Kershenbaum A, Gallagher ST, Rebbeck TR: A graph-theoretical approach for pattern discovery in epidemiological research. IBM Syst J 2007, 46:135-149.
- [14]Liu J, Wang W: OP-Cluster: clustering by tendency in high dimensional space. In ICDM ‘03: Proceedings of the Third IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society; 2003:187-187.
- [15]Garey MR, Johnson DS: Computers and Intractability. New York: W. H. Freeman; 1979.
- [16]Peeters R: The maximum edge biclique problem is NP-complete. Discrete Appl Math 2003, 131(3):651-654.
- [17]Eppstein D: Arboricity and bipartite subgraph listing algorithms. Inf Process Lett 1994, 51(4):207-211.
- [18]Makino K, Uno T: New algorithms for enumerating all maximal cliques. In Proceedings, 9th Scandinavian Workshop on Algorithm Theory. Humlebaek: Springer; 2004:260-272.
- [19]Zaki MJ, Ogihara M: Theoretical foundations of association rules. In Proceedings, 3rd SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. Seattle, Washington: ACM; 1998.
- [20]Li J, Li H, Soh D, Wong L: A correspondence between maximal complete bipartite subgraphs and closed patterns. In PKDD. Berlin Heidelberg: Springer-Verlag; 2005:146-156.
- [21]Zaki MJ, Hsiao C: Charm: An efficient algorithm for closed itemset mining. In Proceedings, 2nd SIAM International Conference on Data Mining. Arlington, Virginia; 2002:398-416.
- [22]Wang J, Pei J, Han J: Closet+: Searching for the best strategies for mining frequent closed itemsets. In Proceedings, 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC; 2003:236-245.
- [23]Grahne G, Zhu J: Efficiently using prefix-trees in mining frequent itemsets. In Proceedings, FIMI’03: Workshop on Frequent Itemset Mining Implementations. Melbourne, Florida: CEUR-WS.org; 2003.
- [24]Zhu J, Grahne G: Reducing the main memory consumptions of FPmax* and FPclose. In Proceedings, FIMI’04: Workshop on Frequent Itemset Mining Implementations. Brighton, UK; November 2004
- [25]Uno T, Kiyomi M, Arimura H: LCM ver.2: Efficient mining algorithms for frequent/closed/maximal itemsets. In Proceedings, FIMI’04: Workshop on Frequent Itemset Mining Implementations. Brighton, UK: CEUR-WS.org; 2004.
- [26]Li J, Liu G, Li H, Wong L: Maximal Biclique subgraphs and closed pattern pairs of the adjacency matrix: a one-to-one correspondence and mining algorithms. IEEE Trans Knowl Data Eng 2007, 19(12):1625-1637.
- [27]Alexe G, Alexe S, Crama Y, Foldes S, Hammer PL, Simeone B: Consensus algorithms for the generation of all maximal bicliques. Discrete Appl Math 2004, 145:11-21.
- [28]Liu G, Sim K, Li J: Efficient mining of large maximal Bicliques. In The 8th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2006). Krakow, Poland; 2006:437-448.
- [29]Bron C, Kerbosch J: Algorithm 457: finding all cliques of an undirected graph. Commun ACM 1973, 16(9):575-577.
- [30]Tomita E, Tanaka A, Takahashi H: The worst-case time complexity for generating all maximal cliques and computational experiments. Theor Comput Sci 2006, 363:28-42.
- [31]Johnson DS, Papadimitriou CH: On generating all maximal independent sets. Inform Process Lett 1988, 27(3):119-123.
- [32]Chesler E, Wang J, Lu L, Qu Y, Manly K, Williams RW: Genetic correlates of gene expression in recombinant inbred strains: a relational model system to explore neurobehavioral phenotypes. Neuroinformatics 2003, 1(4):343-357.
- [33]Kreek M, Nielsen D, LaForge K: Genes associated with addiction: alcoholism, opiate, and cocaine addiction. Neuromolecular Med 2004, 5:85-108.
- [34]Albertson D, Schmidt C, Kapatos G, Bannon M: Distinctive profiles of gene expression in the human nucleus accumbens associated with cocaine and heroin abuse. Neuropsychopharmacology 2006, 31(10):2304-2312.
- [35]Mash D, Ffrench-Mullen J, Adi N, Qin Y, Buck A, Pablo J: Gene expression in human hippocampus from cocaine abusers identifies genes which regulate extracellular matrix remodeling. PLoS ONE 2007, 2(11):e1187.