科技报告

【摘要】

As data collection increases at an accelerating rate with the advances of computers and networking technology, analyzing the data (data mining) becomes very important. Data clustering is one of the basic tools widely used as a component in many data mining solutions. Even though many data clusteringalgorithms have been developed in the last few decades, theyface new challenges in front of hugh data sets. Algorithmswith quadratic (or higher order) computational complexity, like agglomerative algorithms, drop out very quickly. More efficient algorithms like K-Means and EM, which have linear cost per iteration, also need scale-up before they can be applied to verylarge data sets. This paper shows that many parameter estimation algorithms, including the clustering algorithms like K-Means, K-Harmonic Means and EM,have intrinsic parallel structure in them. Many workstations over a LAN or a multiple-processor computer can be efficiently used to run this class ofalgorithms in parallel. With 60 workstations running in parallel (on a fast LAN), clustering 28.8 GBytesof 40 dimensional data into 100 clusters, theutilization of the computing units is above 80%.23 Pages

【预览】

附件列表
Files	Size	Format	View
RO201804100002445LZ	127KB	PDF	download


Tech Report: HPL-2000-6:Scale Up Center-Based Data

Zhang, Bin ; Hsu, Meichun
HP Development Company
关键词: parallel algorithms; data mining; data clustering; K-Means; K-Harmonic Means; Expectation Maximization;
RP-ID : HPL-2000-6
学科分类：计算机科学（综合）
美国\|英语
来源: HP Labs
PDF


	文献评价指标
	下载次数：21次	浏览次数：38次

【 摘 要 】

【 预 览 】

【摘要】

【预览】