Journal of Biomedical Semantics | |
A framework for ontology-based question answering with application to parasite immunology | |
Rick L. Tarleton2  Prashant Doshi1  Todd Minning2  Amir H. Asiaee1  | |
[1] THINC Lab, Department of Computer Science, University of Georgia, Athens, GA, USA;Tarleton Research Group, Department of Cellular Biology, University of Georgia, Athens, GA, USA | |
关键词: Question answering; Parasite data; Ontology; Natural language; Chagas; | |
Others : 1219633 DOI : 10.1186/s13326-015-0029-x |
|
received in 2013-12-15, accepted in 2015-06-19, 发布年份 2015 | |
【 摘 要 】
Background
Large quantities of biomedical data are being produced at a rapid pace for a variety of organisms. With ontologies proliferating, data is increasingly being stored using the RDF data model and queried using RDF based querying languages. While existing systems facilitate the querying in various ways, the scientist must map the question in his or her mind to the interface used by the systems. The field of natural language processing has long investigated the challenges of designing natural language based retrieval systems. Recent efforts seek to bring the ability to pose natural language questions to RDF data querying systems while leveraging the associated ontologies. These analyze the input question and extract triples (subject, relationship, object), if possible, mapping them to RDF triples in the data. However, in the biomedical context, relationships between entities are not always explicit in the question and these are often complex involving many intermediate concepts.
Results
We present a new framework, OntoNLQA, for querying RDF data annotated using ontologies which allows posing questions in natural language. OntoNLQA offers five steps in order to answer natural language questions. In comparison to previous systems, OntoNLQA differs in how some of the methods are realized. In particular, it introduces a novel approach for discovering the sophisticated semantic associations that may exist between the key terms of a natural language question, in order to build an intuitive query and retrieve precise answers. We apply this framework to the context of parasite immunology data, leading to a system called AskCuebee that allows parasitologists to pose genomic, proteomic and pathway questions in natural language related to the parasite, Trypanosoma cruzi. We separately evaluate the accuracy of each component of OntoNLQA as implemented in AskCuebee and the accuracy of the whole system. AskCuebee answers 68 % of the questions in a corpus of 125 questions, and 60 % of the questions in a new previously unseen corpus. If we allow simple corrections by the scientists, this proportion increases to 92 %.
Conclusions
We introduce a novel framework for question answering and apply it to parasite immunology data. Evaluations of translating the questions to RDF triple queries by combining machine learning, lexical similarity matching with ontology classes, properties and instances for specificity, and discovering associations between them demonstrate that the approach performs well and improves on previous systems. Subsequently, OntoNLQA offers a viable framework for building question answering systems in other biomedical domains.
【 授权许可】
2015 Asiaee et al.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150718093131741.pdf | 3984KB | download | |
Fig. 14. | 27KB | Image | download |
Fig. 13. | 14KB | Image | download |
Fig. 12. | 71KB | Image | download |
Fig. 11. | 40KB | Image | download |
Fig. 10. | 121KB | Image | download |
Fig. 9. | 44KB | Image | download |
Fig. 8. | 44KB | Image | download |
Fig. 7. | 41KB | Image | download |
Fig. 6. | 15KB | Image | download |
Fig. 5. | 48KB | Image | download |
Fig. 4. | 29KB | Image | download |
Fig. 3. | 39KB | Image | download |
Fig. 2. | 40KB | Image | download |
Fig. 1. | 10KB | Image | download |
【 图 表 】
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
Fig. 9.
Fig. 10.
Fig. 11.
Fig. 12.
Fig. 13.
Fig. 14.
【 参考文献 】
- [1]Idenhen K. Introducing OpenLink Virtuoso: universal data access without boundaries. http://www.openlinksw.com/. Last accessed November 1, 2013.
- [2]Aasman J. Allegro Graph: RDF triple database. Technical report. Franz Incorporated(2006).
- [3]Clark KG, Feigenbaum L, Torres E. SPARQL protocol for RDF. World Wide Web Consortium (W3C) Recommendation. 2008. http://www.w3.org/TR/rdf-sparql-protocol/.
- [4]Parikh PP, Minning TA, Nguyen V, Lalithsena S, Asiaee AH, Sahoo SS, et al.: A semantic problem solving environment for integrative parasite research: Identification of intervention targets for Trypanosoma cruzi. PLoS Negl Trop Dis 2012, 6(1):1458.
- [5]Mendes PN, McKnight B, Sheth AP, Kissinger JC. TcruziKB: Enabling complex queries for genomic data exploration. In: Semantic Computing, 2008 IEEE International Conference On. IEEE: 2008. p. 432–9.
- [6]Luciano JS, Andersson B, Batchelor C, Bodenreider O, Clark T, Denney CK, et al.: The Translational Medicine Ontology and Knowledge base: driving personalized medicine by bridging the gap between bench and bedside. J Biomed Semantics 2011, 2(Suppl 2):1. BioMed Central Full Text
- [7]Aslett M, Aurrecoechea C, Berriman M, Brestelli J, Brunk BP, Carrington M, et al.: TriTrypDB: a functional genomic resource for the Trypanosomatidae. Nucleic Acids Res 2010, 38(suppl 1):457-62.
- [8]Kiefer C, Bernstein A, Lee HJ, Klein M, Stocker M. Semantic process retrieval with iSPARQL. In: The Semantic Web: Research and Applications: 2007. p. 609–23.
- [9]Smart PR, Russell A, Braines D, Kalfoglou Y, Bao J, Shadbolt NR. A visual approach to semantic query design using a web-based graphical query designer. In: Knowledge Engineering: Practice and Patterns: 2008. p. 275–91.
- [10]Kobayashi N, Toyoda T. BioSPARQL: ontology-based smart building of SPARQL queries for biological linked open data. In: Proceedings of the 4th International Workshop on Semantic Web Applications and Tools for the Life Sciences. ACM: 2011. p. 47–9.
- [11]Bernstein A, Kaufmann E, Kaiser C: Querying the semantic web with Ginseng: A guided input natural language search engine. In 15th Workshop on Information Technologies and Systems. SSRN, Las Vegas, NV; 2005.
- [12]Asiaee AH, Doshi P, Minning T, Sahoo S, Parikh P, Sheth A, et al. From questions to effective answers: On the utility of knowledge-driven querying systems for life sciences data. In: Proceedings of the 9th International Conference on Data Integration in Life Sciences: 2013.
- [13]Sahoo SS, Weatherly DB, Mutharaju R, Anantharam P, Sheth A, Tarleton RL. Ontology-driven provenance management in escience: An application in parasite research. In: On the Move to Meaningful Internet Systems: OTM 2009: 2009. p. 992–1009.
- [14]Cao Y, Liu F, Simpson P, Antieau L, Bennett A, Cimino JJ, et al.: AskHERMES: An online question answering system for complex clinical questions. J Biomed Inform 2011, 44(2):277-88.
- [15]Hallett C, Scott D, Power R: Composing questions through conceptual authoring. Comput Linguist 2007, 33(1):105-33.
- [16]Gobeill J, Patsche E, Theodoro D, Veuthey AL, Lovis C, Ruch P. Question answering for biology and medicine. In: Information Technology and Applications in Biomedicine, 2009. ITAB 2009. 9th International Conference On. IEEE: 2009. p. 1–5.
- [17]Delbecque T, Jacquemart P, Zweigenbaum P: Indexing UMLS semantic types for medical question-answering. Stud Health Technol Inform 2005, 116:805-10.
- [18]Popescu AM, Etzioni O, Kautz H. Towards a theory of natural language interfaces to databases. In: Proceedings of the 8th International Conference on Intelligent User Interfaces. ACM: 2003. p. 149–57.
- [19]Lopez V, Uren V, Motta E, Pasin M: AquaLog: An ontology-driven question answering system for organizational semantic intranets. Web Semantics: Sci Serv Agents World Wide Web 2007, 5(2):72-105.
- [20]Kaufmann E, Bernstein A, Fischer L. NLP-Reduce: A naıve but domain-independent natural language interface for querying ontologies. In: 4th European Semantic Web Conference: 2007. p. 1–2.
- [21]Tartir S, Arpinar I, Nural M. Question answering in linked data for scientific exploration. In: The 2nd Annual Web Science Conference. ACM: 2010.
- [22]Hotez PJ, Dumonteil E, Woc-Colburn L, Serpa JA, Bezek S, Edwards MS, et al.: Chagas disease: “the new HIV/AIDS of the Americas”. PLoS Negl Trop Dis 2012, 6(5):1498.
- [23]Ananiadou S, Friedman C, Tsujii J: Introduction: named entity recognition in biomedicine. J Biomed Inform 2004, 37(6):393-5.
- [24]PubMed: A list of stopwords from PubMed. http://www.oocities.org/gumby9/physicians/advanced/stopwords.pdf. Last accessed November 1, 2013.
- [25]De Marneffe MC, Manning CD. The stanford typed dependencies representation. In: Coling 2008: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation. Association for Computational Linguistics: 2008. p. 1–8.
- [26]Jurafsky D, Martin JH, Kehler A, Vander Linden K, Ward N. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition: MIT Press; 2000.
- [27]Thayasivam U, Doshi P. On the utility of WordNet for ontology alignment: Is it really worth it? In: Semantic Computing (ICSC), 2011 Fifth IEEE International Conference On: 2011. p. 267–74.
- [28]Stanford: CoreNLP. http://nlp.stanford.edu/software/corenlp.shtml. Last accessed November 1, 2013.
- [29]Baldwin B, Carpenter B. LingPipe. http://alias-i.com/lingpipe. Last accessed November 1, 2013.
- [30]Baldridge J, Morton T, Bierner G. OpenNLP maxent package in Java. http://maxent.sourceforge.net. Last accessed November 1, 2013.
- [31]Collier N, Nobata C, Tsujii J-I. Extracting the names of genes and gene products with a hidden Markov model. In: Proceedings of the 18th Conference on Computational linguistics-Volume 1. Association for Computational Linguistics: 2000. p. 201–7.
- [32]Shen D, Zhang J, Zhou G, Su J, Tan CL. Effective adaptation of a hidden markov model-based named entity recognizer for biomedical domain. In: Proceedings of the ACL 2003 Workshop on Natural Language Processing in biomedicine-Volume 13. Association for Computational Linguistics: 2003. p. 49–56.
- [33]Morgan A, Hirschman L, Yeh A, Colosimo M. Gene name extraction using FlyBase resources. In: Proceedings of the ACL 2003 Workshop on Natural Language Processing in biomedicine-Volume 13. Association for Computational Linguistics: 2003. p. 1–8.
- [34]Kinoshita S, Cohen KB, Ogren PV, Hunter L: BioCreAtIvE task1A: entity identification with a stochastic tagger. BMC Bioinformatics 2005, 6(Suppl 1):4. BioMed Central Full Text
- [35]Finkel J, Dingare S, Nguyen H, Nissim M, Manning C, Sinclair G. Exploiting context for biomedical entity recognition: From syntax to the web. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications. Association for Computational Linguistics: 2004. p. 88–91.
- [36]Corbett P, Copestake A: Cascaded classifiers for confidence-based chemical named entity recognition. BMC Bioinformatics 2008, 9(Suppl 11):4. BioMed Central Full Text
- [37]Asahara M, Matsumoto Y. Japanese named entity extraction with redundant morphological analysis. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics: 2003. p. 8–15.
- [38]McCallum A, Li W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics: 2003. p. 188–91.
- [39]Yeh A, Morgan A, Colosimo M, Hirschman L: BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinformatics 2005, 6(Suppl 1):2. BioMed Central Full Text
- [40]Smith L, Tanabe L, Ando R, Kuo CJ, Chung IF, Hsu CN, et al.: Overview of bioCreAtIvE ii gene mention recognition. Genome Biol 2008, 9(Suppl 2):2. BioMed Central Full Text
- [41]Uzuner Ö, South BR, Shen S, DuVall SL: 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc 2011, 18(5):552-6.
- [42]Liao W, Veeramachaneni S. A simple semi-supervised algorithm for named entity recognition. In: Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing. Association for Computational Linguistics: 2009. p. 58–65.
- [43]Krauthammer M, Nenadic G: Term identification in the biomedical literature. J Biomed Inform 2004, 37(6):512-26.
- [44]Tsuruoka Y, Tsujii J. Boosting precision and recall of dictionary-based protein name recognition. In: Proceedings of the ACL 2003 Workshop on Natural Language Processing in biomedicine-Volume 13. Association for Computational Linguistics: 2003. p. 41–8.
- [45]Tsuruoka Y, Tsujii J. Probabilistic term variant generator for biomedical terms. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM: 2003. p. 167–73.
- [46]Tuason O, Chen L, Liu H, Blake JA, Friedman C. Biological nomenclatures: a source of lexical knowledge and ambiguity. In: Proceedings of the Pacific Symposium of Biocomputing: 2003. p. 238.
- [47]Tatusova TA, Madden TL: BLAST 2 sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett 1999, 174(2):247-50.
- [48]Franzén K, Eriksson G, Olsson F, Asker L, Lidén P, Cöster J: Protein names and how to find them. Int J Med Inform 2002, 67(1):49-61.
- [49]Fukuda K-I, Tsunoda T, Tamura A, Takagi T, et al. Toward information extraction: identifying protein names from biological papers: 1998. p 707–18.
- [50]Hou WJ, Chen HH. Enhancing performance of protein name recognizers using collocation. In: Proceedings of the ACL 2003 Workshop on Natural Language Processing in biomedicine-Volume 13. Association for Computational Linguistics: 2003. p. 25–32.
- [51]Narayanaswamy M, Ravikumar K, Vijay-Shanker K, Ay-shanker KV. A biological named entity recognizer. In: Pac Symp Biocomput: 2003. p. 427.
- [52]Simpson MS, Demner-Fushman D. Biomedical text mining: A survey of recent progress. In: Mining Text Data. Springer: 2012. p. 465–517.
- [53]Nadeau D, Sekine S: A survey of named entity recognition and classification. Lingvisticae Investigationes 2007, 30(1):3-26.
- [54]Kim JD, Ohta T, Tsuruoka Y, Tateisi Y, Collier N. Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications. Association for Computational Linguistics: 2004. p. 70–5.
- [55]Liu DC, Nocedal J: On the limited memory bfgs method for large scale optimization. Math Program 1989, 45(1–3):503-28.
- [56]Stoilos G, Stamou G, Kollias S. A string metric for ontology alignment. In: The Semantic Web–ISWC 2005. Springer: 2005. p. 624–37.
- [57]Ehrig M. Ontology Alignment: Bridging the Semantic Gap: Springer; 2007.
- [58]Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48(3):443-53.
- [59]Levenshtein VI. Binary codes capable of correcting deletions, insertions and reversals. In: Soviet Physics Doklady: 1966. p. 707.
- [60]Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147(1):195-7.
- [61]Singhal A: Modern information retrieval: A brief overview. IEEE Data Eng Bull 2001, 24(4):35-43.
- [62]Mosier C, Taube L: Weighted similarity measure heuristics for the group technology machine clustering problem. Omega 1985, 13(6):577-9.
- [63]Resnik P: Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res (JAIR) 1999, 11:95-130.
- [64]Wagner D, Willhalm T. Speed-up techniques for shortest-path computations. In: STACS 2007. Springer: 2007. p. 23–36.
- [65]OpenRDF. Sesame RDF Database. http://rdf4j.org. Last accessed November 1, 2013.
- [66]Life-cycle P. Ontology. http://bioportal.bioontology.org/ontologies/OPL. Last accessed November 1, 2013.
- [67]Jonquet C, Shah NH, Musen MA: The open biomedical annotator. Summit Trans Bioinform 2009, 2009:56.
- [68]Parsia B, Sirin E. Pellet: An OWL-DL reasoner. In: Third International Semantic Web Conference-Poster: 2004. p. 18.
- [69]McCallum A. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu/. Last accessed November 1, 2013.
- [70]Achananuparp P, Hu X, Shen X. The evaluation of sentence similarity measures. In: Data Warehousing and Knowledge Discovery: 2008. p. 305–16.
- [71]Delbru R. SIREn: Entity retrieval system for the web of data. In: Proceedings of the 3rd Symposium on Future Directions in Information Access (FDIA): 2009.
- [72]Lucene A. A high-performance, full-featured text search engine library. http://lucene.apache.org/. Last accessed November 1, 2013.
- [73]Antezana E, Blondé W, Egaña M, Rutherford A, Stevens R, De Baets B, et al.: BioGateway: a semantic systems biology tool for the life sciences. BMC Bioinformatics 2009, 10(Suppl 10):11. BioMed Central Full Text
- [74]Good BM, Wilkinson MD: The life sciences semantic web is full of creeps! Brief Bioinformatics 2006, 7(3):275-86.
- [75]Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al.: Gene Ontology: tool for the unification of biology. Nat Genet 2000, 25(1):25-9.
- [76]Cheung KH, Frost HR, Marshall MS, Prud’hommeaux E, Samwald M, Zhao J, et al.: A journey to Semantic Web query federation in the life sciences. BMC Bioinformatics 2009, 10(Suppl 10):10. BioMed Central Full Text
- [77]Hogenboom F, Milea V, Frasincar F, Kaymak U. RDF-GL: a SPARQL-based graphical query language for RDF. In: Emergent Web Intelligence: Advanced Information Retrieval. Springer: 2010. p. 87–116.
- [78]Lopez V, Uren V, Sabou M, Motta E: Is question answering fit for the semantic web?: a survey. Semantic Web 2011, 2(2):125-55.
- [79]Cunningham H, Maynard D, Bontcheva K, Tablan V. Gate: an architecture for development of robust HLT applications. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics: 2002. p. 168–75.
- [80]Lei Y, Sabou M, Lopez V, Zhu J, Uren V, Motta E. An infrastructure for acquiring high quality semantic metadata. In: The Semantic Web: Research and Applications: 2006. p. 230–44.
- [81]Noy NF, McGuinness DL, et al. Ontology development 101: A guide to creating your first ontology. Stanford knowledge systems laboratory technical report KSL-01-05. 2001.
- [82]Lopez V, Fernández M, Motta E, Stieler N: PowerAqua: Supporting users in querying and exploring the semantic web. Semantic Web 2012, 3(3):249-65.
- [83]Kim JD, Yamamoto Y, Yamaguchi A, Nakao M, Oouchida K, Chun HW, et al. Natural language query processing for life science knowledge. In: Active Media Technology: 2010. p. 158–65.
- [84]Dang HT, Kelly D, Lin JJ. Overview of the TREC 2007 question answering track. In: TREC. Citeseer: 2007. p. 63.
- [85]Kaufmann E, Bernstein A. How useful are natural language interfaces to the semantic web for casual end-users? In: The Semantic Web: 2007. p. 281–94.
- [86]Damljanovic D, Agatonovic M, Cunningham H. Natural language interfaces to ontologies: Combining syntactic analysis and ontology-based lookup through the user interaction. In: The Semantic Web: Research and Applications. Springer: 2010. p. 106–20.
- [87]Baumgart M, Eckhardt S, Griebsch J, Kosub S, Nowak J. All-pairs ancestor problems in weighted dags. In: Combinatorics, Algorithms, Probabilistic and Experimental Methodologies: 2007. p. 282–93.
- [88]Gabow HN, Bentley JL, Tarjan RE. Scaling and related techniques for geometry problems. In: Symposium on Theory of Computing (STOC): 1984. p. 135–43.