期刊论文详细信息
Journal of Biomedical Semantics
Generalising semantic category disambiguation with large lexical resources for fun and profit
Jun’ichi Tsujii2  Sophia Ananiadou3  Sampo Pyysalo3  Pontus Stenetorp1 
[1] Department of Computer Science, University of Tokyo, Tokyo, Japan;Microsoft Research Asia, Beijing, People’s Republic of China;National Centre for Text Mining, University of Manchester, Manchester, UK
关键词: Freebase;    Domain adaptation;    Named entity recognition;    Lexical resources;    Approximate string matching;    Semantic category disambiguation;   
Others  :  1135956
DOI  :  10.1186/2041-1480-5-26
 received in 2012-10-19, accepted in 2014-04-03,  发布年份 2014
PDF
【 摘 要 】

Background

Semantic Category Disambiguation (SCD) is the task of assigning the appropriate semantic category to given spans of text from a fixed set of candidate categories, for example PROTEIN to “Fibrin”. SCD is relevant to Natural Language Processing tasks such as Named Entity Recognition, coreference resolution and coordination resolution. In this work, we study machine learning-based SCD methods using large lexical resources and approximate string matching, aiming to generalise these methods with regard to domains, lexical resources and the composition of data sets. We specifically consider the applicability of SCD for the purposes of supporting human annotators and acting as a pipeline component for other Natural Language Processing systems.

Results

While previous research has mostly cast SCD purely as a classification task, we consider a task setting that allows for multiple semantic categories to be suggested, aiming to minimise the number of suggestions while maintaining high recall. We argue that this setting reflects aspects which are essential for both a pipeline component and when supporting human annotators. We introduce an SCD method based on a recently introduced machine learning-based system and evaluate it on 15 corpora covering biomedical, clinical and newswire texts and ranging in the number of semantic categories from 2 to 91.

With appropriate settings, our system maintains an average recall of 99% while reducing the number of candidate semantic categories on average by 65% over all data sets.

Conclusions

Machine learning-based SCD using large lexical resources and approximate string matching is sensitive to the selection and granularity of lexical resources, but generalises well to a wide range of text domains and data sets given appropriate resources and parameter settings. By substantially reducing the number of candidate categories while only very rarely excluding the correct one, our method is shown to be applicable to manual annotation support tasks and use as a high-recall component in text processing pipelines. The introduced system and all related resources are freely available for research purposes at: https://github.com/ninjin/simsem webcite.

【 授权许可】

   
2014 Stenetorp et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150311092047777.pdf 725KB PDF download
Figure 4. 27KB Image download
Figure 3. 29KB Image download
Figure 2. 37KB Image download
Figure 1. 41KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

【 参考文献 】
  • [1]Stoyanov V, Cardie C, Gilbert N, Riloff E, Buttler D, Hysom D: Coreference resolution with reconcile. In Proceedings of ACL 2010 Short Papers. Uppsala: Association for Computational Linguistics; 2010:156-161.
  • [2]Resnik P: Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 1999, 11:95-130.
  • [3]Stenetorp P, Pyysalo S, Tsujii J: SimSem: fast approximate string matching in relation to semantic category disambiguation. In Proceedings of BioNLP 2011 Workshop. Portland: Association for Computational Linguistics; 2011:136-145.
  • [4]Ratinov L, Roth D: Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009). Boulder: Association for Computational Linguistics; 2009:147-155.
  • [5]Torii M, Hu Z, Wu CH, Liu H: BioTagger-GM: a gene/protein name recognition system. J Am Med Informat Assoc 2009, 16(2):247-255.
  • [6]Wang Y, Kim JD, Saetre R, Pyysalo S, Tsujii J: Investigating heterogeneous protein annotations toward cross-corpora utilization. BMC Bioinformatics 2009, 10:403.
  • [7]Pyysalo S, Ginter F, Heimonen J, Björne J, Boberg J, Järvinen J, Salakoski T: BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics 2007, 8:50.
  • [8]Resnik P, Niv M, Nossal M, Schnitzer G, Stoner J, Kapit A, Toren R: Using intrinsic and extrinsic metrics to evaluate accuracy and facilitation in computer-assisted coding. In Perspectives in Health Information Management Computer Assisted Coding Conference Proceedings. Chicago: AHIMA; 2006-2006.
  • [9]Verspoor K, Cohen KB, Hunter L: The textual characteristics of traditional and Open Access scientific journals are similar. BMC Bioinformatics 2009, 10:183.
  • [10]Miller G: The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychol Rev 1956, 63(2):81-97.
  • [11]Stenetorp P, Pyysalo S, Ananiadou S, Tsujii J: Almost total recall: semantic category disambiguation using large lexical resources and approximate string matching. In Proceedings of the Fourth International Symposium on Languages in Biology and Medicine. Singapore, Singapore; 2011.
  • [12]Cohen KB, Christiansen T, Baumgartner W, Verspoor K, Hunter L: Fast and simple semantic class assignment for biomedical text. In Proceedings of BioNLP 2011 Workshop. Portland: Association for Computational Linguistics; 2011:38-45.
  • [13]Okazaki N, Tsujii J: Simple and efficient algorithm for approximate dictionary matching. In Proceedings of the 23rd International Conference on Computational Linguistics. Beijing: Coling 2010 Organizing Committee; 2010:851-859.
  • [14]Tratz S, Sanfilippo A, Gregory M, Chappell A, Posse C, Whitney P: PNNL: a supervised maximum entropy approach to word sense disambiguation. In Proceedings of the 4th International Workshop on Semantic Evaluations, Volume 7. Prague: Association for Computational Linguistics; 2007:264-267.
  • [15]Cho H, Gerner M, Solt I, Agarwal S, Liu F, Vishnyakova D, Ruch P, Romacker M, Rinaldi F, Bhattacharya S, Srinivasan P, Liu H, Torii M, Matos S, Campos D, Verspoor K, Livingston K, Wilbur W: The gene normalization task in BioCreative III. BMC Bioinformatics 2011, 12(Suppl 8):S2.
  • [16]Ney H, Mergel D, Noll A, Paeseler A: Data driven search organization for continuous speech recognition. IEEE Trans Signal Process 1992, 40(2):272-281.
  • [17]Stenetorp P, Pyysalo S, Ananiadou S, Tsujii J: Investigating approaches to semantic category disambiguation using large lexical resources and approximate string matching. In Information Processing Society of Japan Special Interest Group Notes. Ishigaki: Information Processing Society of Japan; 2011.
  • [18]Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ: LIBLINEAR: a library for large linear classification. J Mach Learn Res 2008, 9:1871-1874.
  • [19]Uzuner Ö, Solti I, Cadag E: Extracting medication information from clinical text. J Am Med Inform Assoc 2010, 17(5):514-518.
  • [20]Tjong Kim Sang EF: Introduction to the CoNLL-2002 shared task: language-independent named entity recognition. In Proceedings of the 6th Conference on Natural Language Learning, Volume 20. Taipei: Association for Computational Linguistics; 2002:1-4.
  • [21]Pyysalo S, Ohta T, Miwa M, Cho HC, Tsujii J, Ananiadou S: Event extraction across multiple levels of biological organization. Bioinformatics 2012, 28(18):i575-i581.
  • [22]Pyysalo S, Ohta T, Rak R, Sullivan D, Mao C, Wang C, Sobral B, Tsujii J, Ananiadou S: Overview of the Infectious Diseases (ID) task of BioNLP Shared Task 2011. In Proceedings of BioNLP Shared Task 2011 Workshop. Portland: Association for Computational Linguistics; 2011:26-35.
  • [23]Mintz M, Bills S, Snow R, Jurafsky D: Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec: Association for Computational Linguistics; 2009:1003-1011.
  • [24]Ritter A, Clark S, Etzioni O, Mausam: Named entity recognition in tweets: an experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh: Association for Computational Linguistics; 2011:1524-1534.
  • [25]Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, Tsujii J: brat: a web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon: Association for Computational Linguistics; 2012:102-107.
  • [26]Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25:25.
  • [27]Wu C, Yeh L, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu Z, Kourtesis P, Ledley R, Suzek B, Tsugita A, Vinayaka CR, Yeh LS, Zhang J, Barker WC: The protein information resource. Nucleic Acids Res 2003, 31:345.
  • [28]Bodenreider O: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004, 32:D267-D270.
  • [29]Maglott D, Ostell J, Pruitt K, Tatusova T: Entrez gene: gene-centered information at NCBI. Nucleic Acids Res 2005, 33(suppl 1):D54.
  • [30]Shi L, Campagne F: Building a protein name dictionary from full text: a machine learning term extraction approach. BMC Bioinformatics 2005, 6:88.
  • [31]Hettne K, Stierum R, Schuemie M, Hendriksen P, Schijvenaars B, Mulligen E, Kleinjans J, Kors J: A dictionary to identify small molecules and drugs in free text. Bioinformatics 2009, 25(22):2983.
  • [32]Björne J, Ginter F, Pyysalo S, Tsujii J, Salakoski T: Scaling up biomedical event extraction to the entire PubMed. In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. Uppsala: Association for Computational Linguistics; 2010:28-36.
  • [33]Chowdhury FM, Lavelli A: Disease mention recognition with specific features. In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. Uppsala: Association for Computational Linguistics; 2010:83-90.
  • [34]Gerner M, Nenadic G, Bergman C: LINNAEUS: a species name identification system for biomedical literature. BMC Bioinformatics 2010, 11:85.
  • [35]Ohta T, Pyysalo S, Tsujii J: Overview of the epigenetics and post-translational modifications (EPI) task of BioNLP shared task 2011. In Proceedings of BioNLP Shared Task 2011 Workshop. Portland: Association for Computational Linguistics; 2011.
  • [36]Kim J, Wang Y, Takagi T, Yonezawa A: Overview of Genia event task in BioNLP shared task 2011. In Proceedings of BioNLP Shared Task 2011 Workshop. Portland: Association for Computational Linguistics; 2011.
  • [37]Rebholz-Schuhmann D, Yepes A, Van Mulligen E, Kang N, Kors J, Milward D, Corbett P, Buyko E, Beisswanger E, Hahn U: CALBC silver standard corpus. J Bioinform Comput Biol 2010, 8:163-179.
  • [38]Kim J, Ohta T, Tsuruoka Y, Tateisi Y, Collier N: Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Geneva, Switzerland; 2004:70-75.
  • [39]Thompson P, Iqbal S, McNaught J, Ananiadou S: Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics 2009, 10:349.
  • [40]Buyko E, Beisswanger E, Hahn U: Testing different ACE-style feature sets for the extraction of gene regulation relations from MEDLINE abstracts. In Proceedings of the 3rd International Symposium on Semantic Mining in Biomedicine. Toronto, Canada; 2008:21-28.
  • [41]Gerner M, Nenadic G, Bergman CM: An exploration of mining gene expression mentions and their anatomical locations from biomedical text. In Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. Uppsala: Association for Computational Linguistics; 2010:72-80.
  • [42]Rosario B, Hearst M: Classifying semantic relations in bioscience texts. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics. Barcelona, Spain; 2004:430-437.
  • [43]Furlong L, Dach H, Hofmann-Apitius M, Sanz F: OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature. BMC Bioinformatics 2008, 9:84.
  文献评价指标  
  下载次数:19次 浏览次数:29次