学位论文详细信息
Truth finding in databases
data integration;truth finding;data fusion;data quality;entity matching;data mining;probabilistic graphical models
Zhao, Bo
关键词: data integration;    truth finding;    data fusion;    data quality;    entity matching;    data mining;    probabilistic graphical models;   
Others  :  https://www.ideals.illinois.edu/bitstream/handle/2142/42470/Bo_Zhao.pdf?sequence=1&isAllowed=y
美国|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF
【 摘 要 】

In practical data integration systems, it is common for the data sources being integrated to provide conflicting information about the same entity.Consequently, a major challenge for data integration is to derive the most complete and accurate integrated records from diverse and sometimes conflicting sources. We term this challenge the truth finding problem. We observe that some sources are generally more reliable than others, and therefore a good model of source quality is the key to solving the truth finding problem. In this thesis, we propose probabilistic models that can automatically infer true records and source quality without any supervision on both categorical data and numerical data. We further develop a new entity matching framework that considers source quality based on truth-finding models.On categorical data, in contrast to previous methods, our principled approach leverages a generative process of two types of errors (false positive and false negative) by modeling two different aspects of source quality. In so doing, ours is also the first approach designed to merge multi-valued attribute types. Our method is scalable, due to an efficient sampling-based inference algorithm that needs very few iterations in practice and enjoys linear time complexity, with an even faster incremental variant. Experiments on two real world datasets show that our new method outperforms existing state-of-the-art approaches to the truth finding problem on categorical data.While in practice, numerical data is not only ubiquitous but also of high value, e.g. price, weather, census, polls and economic statistics. Quality issues on numerical data can also be even more common and severe than categorical data due to its characteristics. Therefore, in this thesis we propose a new truth-finding method specially designed for handling numerical data. Based on Bayesian probabilistic models, our method can leverage the characteristics of numerical data in a principled way, when modeling the dependencies among source quality, truth, and claimed values. Experiments on two real world datasets show that our new method outperforms existing state-of-the-art approaches in both effectiveness and efficiency.We further observe that modeling source quality not only can help decide the truth but also can help match entities across different sources. Therefore, as a natural next step, we integrate truth finding with entity matching so that we could infer matching of entities, true attributes of entities and source quality in a joint fashion. This is the first entity matching approach that involves modeling source quality and truth finding. Experiments show that our approach can outperform state-of-the-art baselines.

【 预 览 】
附件列表
Files Size Format View
Truth finding in databases 2713KB PDF download
  文献评价指标  
  下载次数:28次 浏览次数:27次