学位论文详细信息
Toward accurate and efficient outlier detection in high dimensional and large data sets
Database;Data mining;Outlier detection
Nguyen, Minh Quoc ; Computing
University:Georgia Institute of Technology
Department:Computing
关键词: Database;    Data mining;    Outlier detection;   
Others  :  https://smartech.gatech.edu/bitstream/1853/34657/1/nguyen_minh_q_201008_phd.pdf
美国|英语
来源: SMARTech Repository
PDF
【 摘 要 】

An efficient method to compute local density-based outliers in high dimensional data was proposed. In our work, we have shown that this type of outlier is present even in any subset of the dataset. This property is used to partition the data set into random subsets to compute the outliers locally. The outliers are then combined from different subsets. Therefore, the local density-based outliers can be computed efficiently. Another challenge in outlier detection in high dimensional data is that the outliers are often suppressed when the majority of dimensions do not exhibit outliers. The contribution of this work is to introduce a filtering method whereby outlier scores are computed in sub-dimensions. The low sub-dimensional scores are filtered out and the high scores are aggregated into the final score. This aggregation with filtering eliminates the effect of accumulating delta deviations in multiple dimensions. Therefore, the outliers are identified correctly. In some cases, the set of outliers that form micro patterns are more interesting than individual outliers. These micro patterns are considered anomalous with respect to the dominant patterns in the dataset. In the area of anomalous pattern detection, there are two challenges. The first challenge is that the anomalous patterns are often overlooked by the dominant patterns using the existing clustering techniques. A common approach is to cluster the dataset using the k-nearest neighbor algorithm. The contribution of this work is to introduce the adaptive nearest neighbor and the concept of dual-neighbor to detect micro patterns more accurately. The next challenge is to compute the anomalous patterns very fast. Our contribution is to compute the patterns based on the correlation between the attributes. The correlation implies that the data can be partitioned into groups based on each attribute to learn the candidate patterns within the groups. Thus, a feature-based method is developed that can compute these patternsefficiently.

【 预 览 】
附件列表
Files Size Format View
Toward accurate and efficient outlier detection in high dimensional and large data sets 1520KB PDF download
  文献评价指标  
  下载次数:17次 浏览次数:17次