科技报告

【摘要】

Data clustering is one of the fundamental techniques in scientific data analysis and data mining. It partitions a data set into groups of similar items, as measured by some distance metric. Over the years, data set sizes have grown rapidly with the exponential growth of computer storage and increasingly automated business and manufacturing processes. Many of these datasets are geographically distributed across multiple sites, e.g. different sales or warehouse locations. To cluster such large and distributed data sets, efficient distributed algorithms are called for to reduce the communication overhead, central storage requirements, and computation time, as well as to bring the resources of multiple machines to bear on a given problem as the data set sizes scale-up. We describe a technique for parallelizing a family of center-based data clustering algorithms. The central idea is to communicate only sufficient statistics, yielding linear speed-up with excellent efficiency. The technique does not involve approximation and may be used orthogonally in conjunction with sampling or aggregation-based methods, such as BIRCH, to lessen the quality degradation of their approximation or to handle larger data sets. We demonstrate in this paper that even for relatively small problem sizes, it can be more cost effective to cluster the data in-place using an exact distributed algorithm than to collect the data in one central location for clustering . 6 Pages

【预览】

附件列表
Files	Size	Format	View
RO201804100002140LZ	64KB	PDF	download


Distributed Data Clustering can be Efficient and Exact

Forman, George ; Zhang, Bin
HP Development Company
关键词: multidimensional data clustering; data mining; very large databases; parallel algorithms; distributed computing;
RP-ID : HPL-2000-158
学科分类：计算机科学（综合）
美国\|英语
来源: HP Labs
PDF


	文献评价指标
	下载次数：14次	浏览次数：44次

【 摘 要 】

【 预 览 】

【摘要】

【预览】