学位论文详细信息
Part of speech N-grams for information retrieval
QA75 Electronic computers. Computer science
Lioma, Christina Amalia ; Van Rijsbergen, C.J.
University:University of Glasgow
Department:School of Computing Science
关键词: Information Retrieval, Computational Linguistics, Natural Language Processing;   
Others  :  http://theses.gla.ac.uk/340/1/2008LiomaPhD.pdf
来源: University of Glasgow
PDF
【 摘 要 】

The increasing availability of information on the World Wide Web (Web), and the need to access relevant specs of this information provide an important impetus for the development of automatic intelligent Information Retrieval (IR) technology. IR systems convert human authored language into representations that can be processed by computers, with the aim to provide humans with access to knowledge. Specifically, IR applications locate and quantify informative content in data, and make statistical decisions on the topical similarity, or relevance, between different items of data. The wide popularity of IR applications in the last decades has driven intensive research and development into theoretical models of information and relevance, and their implementation into usable applications, such as commercial search engines.The majority of IR systems today typically rely on statistical manipulations of individual lexical frequencies (i.e., single word counts) to estimate the relevance of a document to a user request, on the assumption that such lexical statistics can be sufficiently representative of informative content. Such estimations implicitly assume that words occur independently of each other, and as such ignore the compositional semantics of language. This assumption however is not entirely true, and can cause several problems, such as ambiguity in understanding textual information, misinterpreting or falsifying the original informative intent, and limiting the semantic scope of text. These problems can hinder the accurate estimation of relevance between texts, and hence harm the performance of an IR application.This thesis investigates the use of non-lexical statistics by IR models, with the goal to enhance the estimation of relevance between a document and a user request. These non-lexical statistics consist of part of speech information. The parts of speech are the grammatical classes of words (e.g., noun, verb). Part of speech statistics are modelled in the form of part of speech (POS) n-grams, which are contiguous sequences of parts of speech, extracted from text. The distribution of POS n-grams in language is statistically analysed. It is shown that there exists a relationship between the frequency and informative content of POS n-grams. Based on this, different applications of POS n-grams to IR technology are described and evaluated with state of the art systems. Experimental results show that POS n-grams can assist the retrieval process.

【 预 览 】
附件列表
Files Size Format View
Part of speech N-grams for information retrieval 95626KB PDF download
  文献评价指标  
  下载次数:22次 浏览次数:6次