Sampling Within k-Means Algorithm to Cluster Large Datasets | |
Bejarano, Jeremy ; Bose, Koushiki ; Brannan, Tyler ; Thomas, Anita ; Adragni, Kofi ; Neerchal, Nagaraj ; Ostrouchov, George | |
Oak Ridge National Laboratory | |
关键词: 99 General And Miscellaneous//Mathematics, Computing, And Information Science; Accuracy; Sampling; Algorithms; Benchmarks; | |
DOI : 10.2172/1025410 RP-ID : ORNL/TM-2011/394 RP-ID : DE-AC05-00OR22725 RP-ID : 1025410 |
|
美国|英语 | |
来源: UNT Digital Library | |
【 摘 要 】
Due to current data collection technology, our ability to gather data has surpassed our ability to analyze it. In particular, k-means, one of the simplest and fastest clustering algorithms, is ill-equipped to handle extremely large datasets on even the most powerful machines. Our new algorithm uses a sample from a dataset to decrease runtime by reducing the amount of data analyzed. We perform a simulation study to compare our sampling based k-means to the standard k-means algorithm by analyzing both the speed and accuracy of the two methods. Results show that our algorithm is significantly more efficient than the existing algorithm with comparable accuracy. Further work on this project might include a more comprehensive study both on more varied test datasets as well as on real weather datasets. This is especially important considering that this preliminary study was performed on rather tame datasets. Also, these datasets should analyze the performance of the algorithm on varied values of k. Lastly, this paper showed that the algorithm was accurate for relatively low sample sizes. We would like to analyze this further to see how accurate the algorithm is for even lower sample sizes. We could find the lowest sample sizes, by manipulating width and confidence level, for which the algorithm would be acceptably accurate. In order for our algorithm to be a success, it needs to meet two benchmarks: match the accuracy of the standard k-means algorithm and significantly reduce runtime. Both goals are accomplished for all six datasets analyzed. However, on datasets of three and four dimension, as the data becomes more difficult to cluster, both algorithms fail to obtain the correct classifications on some trials. Nevertheless, our algorithm consistently matches the performance of the standard algorithm while becoming remarkably more efficient with time. Therefore, we conclude that analysts can use our algorithm, expecting accurate results in considerably less time.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
1025410.pdf | 138KB | download |