| Journal of Biomedical Semantics | |
| Sequential pattern mining for discovering gene interactions and their contextual information from biomedical texts | |
| Jean-Luc Manguin7  Jiří Kléma1  Olivier Gandrillon6  Bruno Crémilleux7  Christophe Rigotti4  Marc Plantevit3  Thierry Charnois5  Peggy Cellier2  | |
| [1] Faculty of Electrical Engineering, Czech Technical University, Prague, Czech Republic;INSA de Rennes, IRISA, UMR6074, Rennes F-35042, France;Université Lyon 1, LIRIS, UMR5205, Lyon F-69622, France;INSA de Lyon, LIRIS, UMR5205, Lyon F-69621, France;Université de Paris 13, LIPN, UMR7030, Villetaneuse F-93430, France;Université Lyon 1, CGMC, UMR5534, Lyon F-69622, France;Université de Caen, GREYC, UMR6072, Caen F-14032, France | |
| 关键词: Gene interactions; Information extraction; Natural language processing; Sequential pattern mining; Data mining; | |
| Others : 1209188 DOI : 10.1186/s13326-015-0023-3 |
|
| received in 2013-07-25, accepted in 2015-04-22, 发布年份 2015 | |
PDF
|
|
【 摘 要 】
Background
Discovering gene interactions and their characterizations from biological text collections is a crucial issue in bioinformatics. Indeed, text collections are large and it is very difficult for biologists to fully take benefit from this amount of knowledge. Natural Language Processing (NLP) methods have been applied to extract background knowledge from biomedical texts. Some of existing NLP approaches are based on handcrafted rules and thus are time consuming and often devoted to a specific corpus. Machine learning based NLP methods, give good results but generate outcomes that are not really understandable by a user.
Results
We take advantage of an hybridization of data mining and natural language processing to propose an original symbolic method to automatically produce patterns conveying gene interactions and their characterizations. Therefore, our method not only allows gene interactions but also semantics information on the extracted interactions (e.g., modalities, biological contexts, interaction types) to be detected. Only limited resource is required: the text collection that is used as a training corpus. Our approach gives results comparable to the results given by state-of-the-art methods and is even better for the gene interaction detection in AIMed.
Conclusions
Experiments show how our approach enables to discover interactions and their characterizations. To the best of our knowledge, there is few methods that automatically extract the interactions and also associated semantics information. The extracted gene interactions from PubMed are available through a simple web interface at https://bingotexte.greyc.fr/ webcite. The software is available at https://bingo2.greyc.fr/?q=node/22 webcite.
【 授权许可】
2015 Cellier et al.; licensee BioMed Central.
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| 20150602090245903.pdf | 765KB | ||
| Figure 2. | 14KB | Image | |
| Figure 1. | 36KB | Image |
【 图 表 】
Figure 1.
Figure 2.
【 参考文献 】
- [1]PubMed. http://www.ncbi.nlm.nih.gov/pubmed/.
- [2]BioGRID. http://thebiogrid.org/.
- [3]STRING. http://string-db.org/.
- [4]Giuliano C, Lavelli A, Romano L. Exploiting shallow linguistic information for relation extraction from biomedical literature. In: Conference of the European Chapter of the Association for Computational Linguistics. Trento, Italy: 2006. p. 401–8.
- [5]Rinaldi F, Schneider G, Kaljurand K, Hess M, Romacker M: An environment for relation mining over richly annotated corpora: the case of genia. BMC Bioinformatics 2006, 7(Suppl 3):S3. BioMed Central Full Text
- [6]Fundel K, Küffner R, Zimmer R: RelEx - relation extraction using dependency parse trees. Bioinformatics 2007, 23(3):365-71.
- [7]Hobbs JR, Riloff E: Information extraction. In Handbook of Natural Language Processing, Second Edition. Edited by Indurkhya N, Damerau FJ. CRC, Boca Raton, FL; 2010.
- [8]Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biol 2008, 9(Suppl 2):S4. BioMed Central Full Text
- [9]Zhang Y, Lin H, Yang Z, Li Y: Neighborhood hash graph kernel for protein-protein interaction extraction. J Biomed Inform 2011, 44(6):1086-92.
- [10]Polajnar T, Damoulas T, Girolami M: Protein interaction sentence detection using multiple semantic kernels. J Biomed Semantics 2011, 2:1. BioMed Central Full Text
- [11]Tikk D, Thomas PE, Palaga P, Hakenberg J, Leser U: A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature. PLoS Comput Biol 2010, 6(7):1-19.
- [12]Tikk D, Solt I, Thomas PE, Leser U: A detailed error analysis of 13 kernel methods for protein-protein interaction extraction. BMC Bioinformatics 2013, 14:12. BioMed Central Full Text
- [13]Miyao Y, Sagae K, Sætre R, Matsuzaki T, Tsujii J: Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics 2009, 25(3):394-400.
- [14]Nédellec C: Machine learning for information extraction in genomics - state of the art and perspectives. In Text Mining and Its Applications: Results of the NEMIS Launch Conference. Studies in Fuzziness and Soft Computing. Springer, Berlin Heidelberg; 2004.
- [15]Schneider G, Kaljurand K, Rinaldi F: Detecting protein-protein interactions in biomedical texts using a parser and linguistic resources. In International Conference on Intelligent Text Processing and Computational Linguistics. LNCS, vol. 5449. Springer, Berlin, Germany; 2009.
- [16]Gerner M, Sarafraz F, Bergman CM, Nenadic G: Biocontext: an integrated text mining system for large-scale extraction and contextualization of biomolecular events. Bioinformatics 2012, 28(16):2154-61.
- [17]Björne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T: Scaling up biomedical event extraction to the entire pubmed. [http://www.aclweb.org/anthology/W10-1904] webciteProceedings of the 2010 Workshop on Biomedical Natural Language Processing Association for Computational Linguistics, Uppsala, Sweden; 2010. http://www.aclweb.org/anthology/W10-1904
- [18]Hakenberg J, Leaman R, Vo NH, Jonnalagadda S, Sullivan R, et al.: Efficient extraction of protein-protein interactions from full-text articles. IEEE/ACM Trans Comput Biol Bioinform. 2010, 7(3):481-94.
- [19]Ben Abacha A, Zweigenbaum P: Automatic extraction of semantic relations between medical entities: a rule based approach. J Biomed Semantics 2011, 2(Suppl 5):S4. BioMed Central Full Text
- [20]Hearst MA. Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics - Volume 2. COLING ’92. Nantes, France: 1992. p. 539–45.
- [21]Hakenberg J, Plake C, Royer L, Strobelt H, Leser U, Schroeder M: Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biol 2008, 9(Suppl 2):14. BioMed Central Full Text
- [22]Palaga P, Nguyen L, Leser U, Hakenberg J: High-performance information extraction with alibaba. In Proc. of the 12th Int. Conf. on Extending Database Technology: Advances in Database Technology. EDBT ’09. ACM, New York, NY, USA; 2009.
- [23]Hakenberg J, Schroeder M, Leser U. Consensus pattern alignment to find protein-protein interactions in text. In: Proc. Second BioCreative Challenge Evaluation Workshop. Madrid, Spain: 2007.
- [24]Agrawal R, Srikant R. Mining sequential patterns. In: International Conference on Data Engineering. IEEE Computer Society: 1995. p. 3–14.
- [25]Frawley WJ, Piatetsky-Shapiro G, Matheus CJ: Knowledge discovery in databases: An overview. In Knowledge Discovery in Databases. AAAI/MIT Press, Anaheim, CA, USA; 1991.
- [26]Srikant R, Agrawal R: Mining sequential patterns: Generalizations and performance improvements. In International Conference on Extending Database Technology. Springer-Verlag, London, UK; 1996.
- [27]Pei J, Han B, Mortazavi-Asl B, Pinto H: Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA; 2001.
- [28]Zaki M: Spade: An efficient algorithm for mining frequent sequences. Mach Learn 2001, 42(1/2):31-60.
- [29]Wang J, Han J: Bide: Efficient mining of frequent closed sequences. In Proc. of the 20th Int. Conf. on Data Engineering. ICDE ’04. IEEE Computer Society, Boston, MA, USA; 2004.
- [30]Nanni M, Rigotti C: Extracting trees of quantitative serial episodes. In Knowledge Discovery in Inductive Databases 5th Int. Workshop KDID’06, Revised Selected and Invited Papers. Springer, Berlin, Germany; 2007.
- [31]Zweigenbaum P, Demner-Fushman D, Yu H, Cohen KB: Frontiers of biomedical text mining: current progress. Brief Bioinform 2007, 8:358-375.
- [32]Pei J, Han B, Lakshmanan LVS: Mining frequent itemsets with convertible constraints. In Proc. of the Int. Conf. on Data Engineering. IEEE Computer Society, Washington, DC, USA; 2001.
- [33]Crémilleux B, Soulet A, Kléma J, Hébert C, Gandrillon O: Discovering Knowledge from Local Patterns in SAGE Data. IGI Publishing, Hershey, Pennsylvania, USA; 2008.
- [34]Ng RT, Lakshmanan LVS, Han J, Pang A: Exploratory mining and pruning optimizations of constrained association rules. In SIGMOD International Conference on Management of Data. ACM Press, New York, NY, USA; 1998.
- [35]Cellier P, Charnois T, Plantevit M, Crémilleux B: Recursive sequence mining to discover named entity relations. In International Symposium on Advances in Intelligent Data Analysis. LNCS, vol 6065. Springer, Berlin, Germany; 2010.
- [36]Cellier P, Charnois T, Plantevit M: Sequential patterns to discover and characterise biological relations. In International Conference on Intelligent Text Processing and Computational Linguistics. LNCS, Berlin, Germany; 2010.
- [37]Rosario B, Hearst MA. Multi-way relation classification: application to protein-protein interactions. In: Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Vancouver, British Columbia, Canada: 2005. p. 732–9.
- [38]Schmid H. Probabilistic part-of-speech tagging using decision trees. In: International Conference on New Methods in Language Processing. Manchester, UK: 1994. p. 44–9.
- [39]DMT, 4SP tool. http://liris.cnrs.fr/~crigotti/dmt4sp.html.
- [40]Temkin JM, Gilder MR: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 2003, 19:2046-53.
- [41]Hao Y, Zhu X, Huang M, Ming L. Discovering patterns to extract protein-protein interactions from the literature : Part ii. Bioinformatics. 3294.
- [42]Farkas R, Vincze V, Mora G, Csirik J, Szarvas G. The conll-2010 shared task: Learning to detect hedges and their scope in natural language text. In: Conference on Computational Natural Language Learning: Shared Task. Uppsala, Sweden: 2010.
- [43]Bunescu R, Ge R, Kate RJ, Marcotte EM, Mooney RJ, Ramani AK, et al.: Comparative experiments on learning information extractors for proteins and their interactions. Artif Intell Med 2005, 33(2):139-55.
- [44]Pyysalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, et al.: Bioinfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics 2007, 8(1):50. BioMed Central Full Text
- [45]Fundel K, Küffner R, Zimmer R: Relex—relation extraction using dependency parse trees. Bioinformatics 2007, 23(3):365-71.
- [46]Pyysalo S, Airola A, Heimonen J, Bjorne J, Ginter F, Salakoski T: Comparative analysis of five protein-protein interaction corpora. BMC Bioinformatics 2008, 9(Suppl 3):6.
- [47]HGNC (HUGO Gene Nomenclature Committee). http://www.genenames.org/.
- [48]Tsuruoka Y, Tsujii J: Improving the performance of dictionary-based approaches in protein name recognition. J Biomed Inform 2004, 37(6):461-70.
- [49]Aggarwal BB, Kunnumakkara AB, Harikumar KB, Gupta SR, Tharakan ST, Koca C, et al.: Signal transducer and activator of transcription-3, inflammation, and cancer: how intimate is the relationship? Ann NY Acad Sci 2009, 1171(Natural Compounds and Their Role in Apoptotic Cell Signaling Pathways):59-76.
PDF