科技报告

【摘要】

Data clustering is one of the fundamental techniques in scientific data analysis and data mining. It partitions a data set into groups of similar items, as measured by some distance metric. Over the years, data set sizes have grown rapidly with the exponential growth of computer storage and increasingly automated business and manufacturing processes. To cluster such large data sets, efficient parallel algorithms are called for, both to reduce the computation time, and to bring the resources of multiple machines to bear on a given large problem in order to scale up the largest problem size one can handle. We describe a technique for parallelizing center-based data clustering algorithms. The central idea is to communicate only sufficient statistics, yielding linear speed-up with excellent efficiency. The technique does not involve approximation and may be used orthogonally in conjunction with sampling or aggregation-based methods, such as BIRCH, to lessen the quality degradation of their approximation or to handle larger data sets. We have demonstrated elsewhere the decomposition and theoretical speed-up behaviors for three clustering algorithms: K-Means, K-Harmonic Means and Expectation Maximization (EM) [Zhang, Hsu, Forman '00]. Here we present experimental measurements of their parallel behaviors, e.g., 93% speed-up efficiency for EM on a million data points spread over 128 computers connected via 100baseT Ethernet. 6 Pages

【预览】

附件列表
Files	Size	Format	View
RO201804100002252LZ	75KB	PDF	download


Linear Speed-Up for a Parallel Non-Approximate Recasting of Center-Based Clustering

Forman, George ; Zhang, Bin
HP Development Company
关键词: Expectation-Maximization; multidimensional data clustering; data mining; very large databases; parallel algorithms; scale-up;
RP-ID : HPL-2000-93
学科分类：计算机科学（综合）
美国\|英语
来源: HP Labs
PDF


	文献评价指标
	下载次数：28次	浏览次数：40次

【 摘 要 】

【 预 览 】

【摘要】

【预览】