| NEUROCOMPUTING | 卷:463 |
| An LSH-based k-representatives clustering method for large categorical data | |
| Article | |
| Mau, Toan Nguyen1  Huynh, Van-Nam1  | |
| [1] Japan Adv Inst Sci & Technol, Sch Adv Sci & Technol, Nomi, Ishikawa, Japan | |
| 关键词: Categorical data; Clustering; Dissimilarity measure; k-Means like algorithm; Locality-Sensitive Hashing; | |
| DOI : 10.1016/j.neucom.2021.08.050 | |
| 来源: Elsevier | |
PDF
|
|
【 摘 要 】
Clustering categorical data remains a challenging problem in the era of big data, due to the difficulty in measuring dis/similarity meaningfully for categorical data and the high computational complexity of existing clustering algorithms that makes it difficult to be applied in practical use for big data mining applications. In this paper, we propose an integrated approach that incorporates the Locality-Sensitive Hashing (LSH) technique into the k-means-like clustering so as to make it capable of predicting the better initial clusters for boosting clustering effectiveness. To this end, we first utilize a data-driven dissimilarity measure for categorical data to construct a family of binary hash functions that are then used to generate the initial clusters. We also propose to use a nearest neighbor search at each iteration for cluster reassignment of data objects to improve the clustering complexity. These solutions are incorporated into the k representatives algorithm resulting in the so-called LSH-k-representatives algorithm. Extensive experiments conducted on multiple real-world and synthetic datasets have demonstrated the effectiveness of the proposed method. It is shown that the newly developed algorithm yields comparable or better clustering results in comparison to the existing closely related works, yet it is significantly more efficient by a factor of between 2x and 32x. (c) 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
【 授权许可】
Free
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| 10_1016_j_neucom_2021_08_050.pdf | 1468KB |
PDF