| BMC Bioinformatics | |
| Semantic similarity in the biomedical domain: an evaluation across knowledge sources | |
| Vijay N Garla2  Cynthia Brandt1  | |
| [1] Connecticut VA Healthcare System, Bldg. 35A, Room 213 (11-ACSLG), 950 Campbell Avenue, West Haven, CT 06516, USA | |
| [2] Yale Center for Medical Informatics, Yale University, 300 George Street, Suite 501, New Haven, CT 06520-8009, USA | |
| 关键词: Biomedical ontologies; Information theory; Information content; Semantic similarity; | |
| Others : 1088109 DOI : 10.1186/1471-2105-13-261 |
|
| received in 2012-06-18, accepted in 2012-10-02, 发布年份 2012 | |
PDF
|
|
【 摘 要 】
Background
Semantic similarity measures estimate the similarity between concepts, and play an important role in many text processing tasks. Approaches to semantic similarity in the biomedical domain can be roughly divided into knowledge based and distributional based methods. Knowledge based approaches utilize knowledge sources such as dictionaries, taxonomies, and semantic networks, and include path finding measures and intrinsic information content (IC) measures. Distributional measures utilize, in addition to a knowledge source, the distribution of concepts within a corpus to compute similarity; these include corpus IC and context vector methods. Prior evaluations of these measures in the biomedical domain showed that distributional measures outperform knowledge based path finding methods; but more recent studies suggested that intrinsic IC based measures exceed the accuracy of distributional approaches. Limitations of previous evaluations of similarity measures in the biomedical domain include their focus on the SNOMED CT ontology, and their reliance on small benchmarks not powered to detect significant differences between measure accuracy. There have been few evaluations of the relative performance of these measures on other biomedical knowledge sources such as the UMLS, and on larger, recently developed semantic similarity benchmarks.
Results
We evaluated knowledge based and corpus IC based semantic similarity measures derived from SNOMED CT, MeSH, and the UMLS on recently developed semantic similarity benchmarks. Semantic similarity measures based on the UMLS, which contains SNOMED CT and MeSH, significantly outperformed those based solely on SNOMED CT or MeSH across evaluations. Intrinsic IC based measures significantly outperformed path-based and distributional measures. We released all code required to reproduce our results and all tools developed as part of this study as open source, available under http://code.google.com/p/ytex webcite. We provide a publicly-accessible web service to compute semantic similarity, available under http://informatics.med.yale.edu/ytex.web/ webcite.
Conclusions
Knowledge based semantic similarity measures are more practical to compute than distributional measures, as they do not require an external corpus. Furthermore, knowledge based measures significantly and meaningfully outperformed distributional measures on large semantic similarity benchmarks, suggesting that they are a practical alternative to distributional measures. Future evaluations of semantic similarity measures should utilize benchmarks powered to detect significant differences in measure accuracy.
【 授权许可】
2012 Garla and Brandt; licensee BioMed Central Ltd.
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| 20150117074755284.pdf | 283KB |
【 参考文献 】
- [1]Bloehdorn S, Hotho A: Ontologies for Machine Learning. In Handbook on Ontologies. International Handbooks on Information Systems. Edited by Staab S, Studer R. Berlin Heidelberg: Springer; 2009:637-661. http://dx.doi.org/10.1007/978-3-540-92673-3_29 webcite
- [2]Bloehdorn S, Moschitti A: Combined syntactic and semantic Kernels for text classification. In Proceedings of the 29th European conference on IR research. Rome, Italy: Springer; 2007:307-318.
- [3]Seaghdha DO: Semantic classification with WordNet kernels. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers. Boulder, Colorado: Association for Computational Linguistics; 2009:237-240.
- [4]Aseervatham S, Bennani Y: Semi-structured document categorization with a semantic kernel. Pattern Recogn 2009, 42:2067-2076.
- [5]Garla VN, Brandt C: Ontology-guided feature engineering for clinical text classification. J Biomed Inform 2012, 45:992-998.
- [6]Stevenson M, Greenwood M: A Semantic Approach to IE Pattern Induction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). Ann Arbor, Michigan: Association for Computational Linguistics; 2005:379-386.
- [7]Hliaoutakis A, Varelas G, Voutsakis E, Petrakis EGM, Milios E: Information Retrieval by Semantic Similarity. Intern Journal on Semantic Web and Information Systems (IJSWIS) 2006, 3(3):55-73. July/Sept. 2006. Special Issue of Multimedia Semantics
- [8]Sahami M, Heilman TD: A web-based kernel function for measuring the similarity of short text snippets. New York, NY, USA: ACM Press; 2006:377.
- [9]Patwardhan S, Banerjee S, Pedersen T: Using Measures of Semantic Relatedness for Word Sense Disambiguation. In Computational Linguistics and Intelligent Text Processing. 2588 edition. Edited by Gelbukh A. Heidelberg: Springer Berlin; 2003:241-257.
- [10]McInnes BT, Pedersen T, Liu Y, Melton GB, Pakhomov SV: Knowledge-based method for determining the meaning of ambiguous biomedical terms using information content measures of similarity. In Proc AMIA Symp. 2011, 2011:895-904.
- [11]Budanitsky A, Hirst G: Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. Workshop on WordNet and other lexical resources, second meeting of the North American Chapter of the Association for Computational Linguistics 2001.
- [12]Agirre E, Alfonseca E, Hall K, Kravalova J, Pasca M, Soroa A: A study on similarity and relatedness using distributional and WordNet-based approaches. In Proceedings of Human Language Technologies. Boulder, Colorado: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics; 2009:19-27.
- [13]Pedersen T, Pakhomov SVS, Patwardhan S, Chute CG: Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform 2007, 40:288-299.
- [14]Sánchez D, Batet M: Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective. J Biomed Inform 2011, 44:749-759.
- [15]Al-Mubaid H, Nguyen HA: Measuring Semantic Similarity between biomedical concepts within multiple Ontologies. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 2009, 39:389-398.
- [16]Caviedes JE, Cimino JJ: Towards the development of a conceptual distance metric for the UMLS. J Biomed Inform 2004, 37:77-85.
- [17]Al-Mubaid H, Nguyen HA: A cluster-based approach for semantic similarity in the biomedical domain. Conf Proc IEEE Eng Med Biol Soc 2006, 1:2713-2717.
- [18]Batet M, Sánchez D, Valls A: An ontology-based measure to compute semantic similarity in biomedicine. J Biomed Inform 2010, 44(1):118-125.
- [19]Pakhomov S, McInnes B, Adam T, Liu Y, Pedersen T, Melton G: Semantic similarity and relatedness between clinical terms: an experimental study. AMIA Annu Symp Proc 2010, 2010:572-576.
- [20]McInnes BT, Pedersen T, Pakhomov SVS: UMLS-Interface and UMLS-Similarity: open source software for measuring paths and semantic similarity. AMIA Annu Symp Proc 2009, 2009:431-435.
- [21]Agirre E, Cuadros M, Rigau G, Soroa A: Exploring Knowledge Bases for Similarity. LREC 2010.
- [22]Rada R, Mili H, Bicknell E, Blettner M: Development and application of a metric on semantic nets. Systems, Man and Cybernetics, IEEE Transactions on 1989, 19:17-30.
- [23]Budanitsky A, Hirst G: Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics 2006, 32:13-47.
- [24]Resnik P: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Proceedings of the 14th International Joint Conference on Artificial Intelligence 1995, 448-453.
- [25]Seco N, Veale T, Hayes J: An Intrinsic Information Content Metric for Semantic Similarity in WordNet. ECAI’2004, the 16th European Conference on Artificial Intelligence 2004.
- [26]Lin D: An Information-Theoretic Definition of Similarity. In Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc; 1998:296-304.
- [27]Rao D, Yarowsky D, Callison-Burch C: Affinity measures based on the graph Laplacian. In Proceedings of the 3rd Textgraphs Workshop on Graph-Based Algorithms for Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics; 2008:41-48.
- [28]Hughes T, Ramage D: Lexical Semantic Relatedness with Random Graph Walks. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Prague, Czech Republic: Association for Computational Linguistics; 2007:581-589.
- [29]Patwardhan S: Using WordNet-based context vectors to estimate the semantic relatedness of concepts. Proceedings of the EACL 2006, 1-8.
- [30]Lesk M: Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. Proceedings of the 5th annual international conference on Systems documentation. New York, NY, USA 1986, 24-26.
- [31]Banerjee S, Pedersen T: Extended Gloss Overlaps as a Measure of Semantic Relatedness. Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence 2003, 805-810.
- [32]Liu Y, McInnes BT, Pedersen T, Melton-Meaux G, Pakhomov S: Semantic relatedness study using second order co-occurrence vectors computed from biomedical corpora, UMLS and WordNet. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. Miami, Florida, USA: ACM; 2012:363-372.
- [33]Lin D: Automatic retrieval and clustering of similar words. In Proceedings of the 17th international conference on Computational linguistics - Volume 2. Montreal, Quebec, Canada: Association for Computational Linguistics; 1998:768-774.
- [34]Lee W-N, Shah N, Sundlass K, Musen M: Comparison of ontology-based semantic-similarity measures. AMIA Annu Symp Proc 2008, 2008:384-388.
- [35]Wu ST, Liu H, Li D, Tao C, Musen MA, Chute CG, Shah NH: Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis. J Am Med Inform Assoc 2012, 19:e149-e156.
- [36]MEDLINE Fact Sheet. http://www.nlm.nih.gov/pubs/factsheets/medline.html webcite
- [37]UMLS® Reference Manual - NCBI Bookshelf. http://www.ncbi.nlm.nih.gov/books/NBK9676/ webcite
- [38]Aronson AR: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 2001, 17-21.
- [39]Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG: Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010, 17:507-513.
- [40]Insertion of SNOMED CT into the UMLS Metathesaurus: Explanatory Notes. http://www.nlm.nih.gov/research/umls/Snomed/snomed_represented.html webcite
- [41]Pakhomov SVS, Pedersen T, McInnes B, Melton GB, Ruggieri A, Chute CG: Towards a framework for developing semantic relatedness reference standards. J Biomed Inform 2011, 44(2):251-265.
- [42]Leacock C, Chodorow M: Combining local context with WordNet similarity for word sense identification. WordNet: A Lexical Reference System and its Application 1998.
- [43]Wu Z, Palmer M: Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics. Las Cruces, New Mexico: Association for Computational Linguistics; 1994:133-138.
- [44]NLTK Toolkit. http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordnet-pysrc.html#Synset.wup_similarity webcite
- [45]Jiang JJ, Conrath DW: Semantic similarity based on corpus statistics and lexical taxonomy. Proc. of the Int’l. Conf. on Research in Computational Linguistics 1997, 19-33.
- [46]Brin S, Page L: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 1998, 30:107-117.
- [47]Agirre E, Soroa A, Stevenson M: Graph-based Word Sense Disambiguation of biomedical documents. Bioinformatics 2010, 26:2889-2896.
- [48]Haveliwala TH: Topic-sensitive PageRank. Honolulu, Hawaii, USA: ACM Press; 2002:517.
- [49]Medline Baseline Repository Detailed Reference Material. http://mbr.nlm.nih.gov/Reference/index.shtml webcite
- [50]Apache UIMA. th edition. http://uima.apache.org/ webcite
- [51]Cunningham H, Maynard D, Bontcheva K, Tablan V: GATE: an Architecture for Development of Robust HLT Applications. Recent Advanced in Language Processing 2002, 168-175.
PDF