科技报告详细信息
Tech Report: HPL-2000-6:Scale Up Center-Based Data
Zhang, Bin ; Hsu, Meichun
HP Development Company
关键词: parallel algorithms;    data mining;    data clustering;    K-Means;    K-Harmonic Means;    Expectation Maximization;   
RP-ID  :  HPL-2000-6
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

As data collection increases at an accelerating rate with the advances of computers and networking technology, analyzing the data (data mining) becomes very important. Data clustering is one of the basic tools widely used as a component in many data mining solutions. Even though many data clusteringalgorithms have been developed in the last few decades, theyface new challenges in front of hugh data sets. Algorithmswith quadratic (or higher order) computational complexity, like agglomerative algorithms, drop out very quickly. More efficient algorithms like K-Means and EM, which have linear cost per iteration, also need scale-up before they can be applied to verylarge data sets. This paper shows that many parameter estimation algorithms, including the clustering algorithms like K-Means, K-Harmonic Means and EM,have intrinsic parallel structure in them. Many workstations over a LAN or a multiple-processor computer can be efficiently used to run this class ofalgorithms in parallel. With 60 workstations running in parallel (on a fast LAN), clustering 28.8 GBytesof 40 dimensional data into 100 clusters, theutilization of the computing units is above 80%.23 Pages

【 预 览 】
附件列表
Files Size Format View
RO201804100002445LZ 127KB PDF download
  文献评价指标  
  下载次数:11次 浏览次数:38次