科技报告

【摘要】

Due to current data collection technology, our ability to gather data has surpassed our ability to analyze it. In particular, k-means, one of the simplest and fastest clustering algorithms, is ill-equipped to handle extremely large datasets on even the most powerful machines. Our new algorithm uses a sample from a dataset to decrease runtime by reducing the amount of data analyzed. We perform a simulation study to compare our sampling based k-means to the standard k-means algorithm by analyzing both the speed and accuracy of the two methods. Results show that our algorithm is significantly more efficient than the existing algorithm with comparable accuracy. Further work on this project might include a more comprehensive study both on more varied test datasets as well as on real weather datasets. This is especially important considering that this preliminary study was performed on rather tame datasets. Also, these datasets should analyze the performance of the algorithm on varied values of k. Lastly, this paper showed that the algorithm was accurate for relatively low sample sizes. We would like to analyze this further to see how accurate the algorithm is for even lower sample sizes. We could find the lowest sample sizes, by manipulating width and confidence level, for which the algorithm would be acceptably accurate. In order for our algorithm to be a success, it needs to meet two benchmarks: match the accuracy of the standard k-means algorithm and significantly reduce runtime. Both goals are accomplished for all six datasets analyzed. However, on datasets of three and four dimension, as the data becomes more difficult to cluster, both algorithms fail to obtain the correct classifications on some trials. Nevertheless, our algorithm consistently matches the performance of the standard algorithm while becoming remarkably more efficient with time. Therefore, we conclude that analysts can use our algorithm, expecting accurate results in considerably less time.

【预览】

附件列表
Files	Size	Format	View
RO201704210001602LZ	138KB	PDF	download


Sampling Within k-Means Algorithm to Cluster Large Datasets

Bejarano, Jeremy¹ Bose, Koushiki² Brannan, Tyler³ Thomas, Anita⁴ Adragni, Kofi⁵ Neerchal, Nagaraj⁶
[1] Brigham Young University;Brown University;North Carolina State University;Illinois Institute of Technology;University of Maryland;University of Ma
关键词: ACCURACY; ALGORITHMS; BENCHMARKS; PERFORMANCE; SAMPLING; SIMULATION; TOLERANCE; VELOCITY; WEATHER;
DOI : 10.2172/1025410 RP-ID : ORNL/TM-2011/394 PID : OSTI ID: 1025410 Others : Other: KJ0502000 Others : ERKJZN1 Others : HPCF 2011 12 Others : TRN: US201120%%760
学科分类：社会科学、人文和艺术（综合）
美国\|英语
来源: SciTech Connect
PDF


	文献评价指标
	下载次数：10次	浏览次数：16次

【 摘 要 】

【 预 览 】

【摘要】

【预览】