科技报告详细信息
Sampling Within k-Means Algorithm to Cluster Large Datasets
Bejarano, Jeremy1  Bose, Koushiki2  Brannan, Tyler3  Thomas, Anita4  Adragni, Kofi5  Neerchal, Nagaraj6 
[1] Brigham Young University;Brown University;North Carolina State University;Illinois Institute of Technology;University of Maryland;University of Ma
关键词: ACCURACY;    ALGORITHMS;    BENCHMARKS;    PERFORMANCE;    SAMPLING;    SIMULATION;    TOLERANCE;    VELOCITY;    WEATHER;   
DOI  :  10.2172/1025410
RP-ID  :  ORNL/TM-2011/394
PID  :  OSTI ID: 1025410
Others  :  Other: KJ0502000
Others  :  ERKJZN1
Others  :  HPCF 2011 12
Others  :  TRN: US201120%%760
学科分类:社会科学、人文和艺术(综合)
美国|英语
来源: SciTech Connect
PDF
【 摘 要 】

Due to current data collection technology, our ability to gather data has surpassed our ability to analyze it. In particular, k-means, one of the simplest and fastest clustering algorithms, is ill-equipped to handle extremely large datasets on even the most powerful machines. Our new algorithm uses a sample from a dataset to decrease runtime by reducing the amount of data analyzed. We perform a simulation study to compare our sampling based k-means to the standard k-means algorithm by analyzing both the speed and accuracy of the two methods. Results show that our algorithm is significantly more efficient than the existing algorithm with comparable accuracy. Further work on this project might include a more comprehensive study both on more varied test datasets as well as on real weather datasets. This is especially important considering that this preliminary study was performed on rather tame datasets. Also, these datasets should analyze the performance of the algorithm on varied values of k. Lastly, this paper showed that the algorithm was accurate for relatively low sample sizes. We would like to analyze this further to see how accurate the algorithm is for even lower sample sizes. We could find the lowest sample sizes, by manipulating width and confidence level, for which the algorithm would be acceptably accurate. In order for our algorithm to be a success, it needs to meet two benchmarks: match the accuracy of the standard k-means algorithm and significantly reduce runtime. Both goals are accomplished for all six datasets analyzed. However, on datasets of three and four dimension, as the data becomes more difficult to cluster, both algorithms fail to obtain the correct classifications on some trials. Nevertheless, our algorithm consistently matches the performance of the standard algorithm while becoming remarkably more efficient with time. Therefore, we conclude that analysts can use our algorithm, expecting accurate results in considerably less time.

【 预 览 】
附件列表
Files Size Format View
RO201704210001602LZ 138KB PDF download
  文献评价指标  
  下载次数:10次 浏览次数:16次