科技报告

【摘要】

Traditional data mining methodologies have focused on ''flat'' data i.e. a collection of identically structured entities, assumed to be independent and identically distributed. However, many real-world datasets are innately relational in that they consist of multi-modal entities and multi-relational links (where each entity- or link-type is characterized by a different set of attributes). Link structure is an important characteristic of a dataset and should not be ignored in modeling efforts, especially when statistical dependencies exist between related entities. These dependencies can in fact significantly improve the accuracy of inference and prediction results, if the relational structure is appropriately leveraged (Figure 1). The need for models that can incorporate relational structure has been accentuated by new technological developments which allow us to easily track, store, and make accessible large amounts of data. Recently, there has been a surge of interest in statistical models for dealing with richly interconnected, heterogeneous data, fueled largely by information mining of web/hypertext data, social networks, bibliographic citation data, epidemiological data and communication networks. Graphical models have a natural formalism for representing complex relational data and for predicting the underlying evolving system in a dynamic framework. The present survey provides an overview of probabilistic methods and techniques that have been developed over the last few years for dealing with relational data. Particular emphasis is paid to approaches pertinent to the research areas of pattern recognition, group discovery, entity/node classification, and anomaly detection. We start with supervised learning tasks, where two basic modeling approaches are discussed--i.e. discriminative and generative. Several discriminative techniques are reviewed and performance results are presented. Generative methods are discussed in a separate survey. A special section is devoted to latent variable models due to their unique characteristics and usefulness in static and dynamic frameworks and in both supervised and unsupervised learning processes. Section 4 contains a brief discussion of unsupervised learning techniques with an emphasis on computational efficiency and large networks. Finally, section 5 discusses performance metrics with an emphasis on classification problems.

【预览】

附件列表
Files	Size	Format	View
900137.pdf	354KB	PDF	download


A Survey of Probabilistic Models for Relational Data

Koutsourelakis, P S
Lawrence Livermore National Laboratory
关键词: Efficiency; Mining; Forecasting; Classification; Surges;
DOI : 10.2172/900137 RP-ID : UCRL-TR-225637 RP-ID : W-7405-ENG-48 RP-ID : 900137
美国\|英语
来源: UNT Digital Library
PDF


	文献评价指标
	下载次数：12次	浏览次数：79次

【 摘 要 】

【 预 览 】

【摘要】

【预览】