期刊论文详细信息
Statistical Analysis and Data Mining
The next generation Kmeans algorithm
Eugene Demidenko1 
[1] Department of Biomedical Data Science and Department of Mathematics Dartmouth College Hanover New Hampshire
关键词: clusterwise regression;    hard classification;    K‐;    medians;    maximum likelihood;    multilevel data;    robust clustering, SigClust;   
DOI  :  10.1002/sam.11379
学科分类:社会科学、人文和艺术(综合)
来源: John Wiley & Sons, Inc.
PDF
【 摘 要 】

Typically, when referring to a model‐based classification, the mixture distribution approach is understood. In contrast, we revive the hard‐classification model‐based approach developed by Banfield and Raftery (1993) for which K‐means is equivalent to the maximum likelihood (ML) estimation. The next‐generation K‐means algorithm does not end after the classification is achieved, but moves forward to answer the following fundamental questions: Are there clusters, how many clusters are there, what are the statistical properties of the estimated means and index sets, what is the distribution of the coefficients in the clusterwise regression, and how to classify multilevel data? The statistical model‐based approach for the K‐means algorithm is the key, because it allows statistical simulations and studying the properties of classification following the track of the classical statistics. This paper illustrates the application of the ML classification to testing the no‐clusters hypothesis, to studying various methods for selection of the number of clusters using simulations, robust clustering using Laplace distribution, studying properties of the coefficients in clusterwise regression, and finally to multilevel data by marrying the variance components model with K‐means.

【 授权许可】

Unknown   

【 预 览 】
附件列表
Files Size Format View
RO201910251217730ZK.pdf 1115KB PDF download
  文献评价指标  
  下载次数:20次 浏览次数:13次