科技报告

【摘要】

InfoBoxes in Wikipedia pages were originally meant as quick references for readers. However, an assortment of knowledge bases built from such InfoBoxes now play a crucial role in a variety of important applications, including review summarization, document categorization, question answering, and semantic search. Unfortunately, current InfoBoxes suffer from incompleteness, inconsistencies, and inaccuracies, largely due to the fact that they are created manually. Previous research attempts to correct these problems have relied on text mining approaches that mostly exploit structured information in Wikipedia such as internal links, redirects, or disambiguation pages, but not the morphological information in the text. In this paper, we present a novel system, IBminer, to derive structured information (in form of InfoBoxes) from the free text in Wikipedia pages using Natural Language Processing (NLP). We accomplish our goal by applying our SemScape text mining framework, which converts free text into graph structures called TextGraphs using morphological information from the text. These TextGraphs capture the information in the text as categorical, semantic, and grammatical relations between words and multi-word terms. Two novel features of the SemScape framework are: a) a common sense knowledge base containing categorical information and b) simple pronoun and co-reference resolution. IBminer generates subject-attribute-value triples from TextGraphs by using a set of predefined SPARQL-like queries. After resolving pronouns and co-references used in subject and value parts, IBminer matches the attribute names to those in the currently existing InfoBox triples. Using information from our knowledge base and from WordNet, IBminer can suggest new or incorrect InfoBox triples, and propose attribute synonyms.

【预览】

附件列表
Files	Size	Format	View
RO201804090001223LZ	1210KB	PDF	download


Deducing InfoBoxes from Unstructured Text in Wikipedia Pages

Hamid Mousavi ; Deirdre Kerr ; Markus Iseli ; Carlo Zaniolo
UCLA Henry Samueli School of Engineering and Applied Science
RP-ID : 130001
学科分类：计算机科学（综合）
美国\|英语
来源: UCLA Computer Science Technical Reports Database
PDF


	文献评价指标
	下载次数：13次	浏览次数：14次

【 摘 要 】

【 预 览 】

【摘要】

【预览】