期刊论文

【摘要】

BackgroundClustering is a widely used collection of unsupervised learning techniques for identifying natural classes within a data set. It is often used in bioinformatics to infer population substructure. Genomic data are often categorical and high dimensional, e.g., long sequences of nucleotides. This makes inference challenging: The distance metric is often not well-defined on categorical data; running time for computations using high dimensional data can be considerable; and the Curse of Dimensionality often impedes the interpretation of the results. Up to the present, however, the literature and software addressing clustering for categorical data has not yet led to a standard approach.ResultsWe present software for an ensemble method that performs well in comparison with other methods regardless of the dimensionality of the data. In an ensemble method a variety of instantiations of a statistical object are found and then combined into a consensus value. It has been known for decades that ensembling generally outperforms the components that comprise it in many settings. Here, we apply this ensembling principle to clustering.We begin by generating many hierarchical clusterings with different clustering sizes. When the dimension of the data is high, we also randomly select subspaces also of variable size, to generate clusterings. Then, we combine these clusterings into a single membership matrix and use this to obtain a new, ensembled dissimilarity matrix using Hamming distance.ConclusionsEnsemble clustering, as implemented in R and called EnsCat, gives more clearly separated clusters than other clustering techniques for categorical data. The latest version with manual and examples is available at https://github.com/jlp2duke/EnsCat.

【授权许可】

CC BY
© The Author(s) 2016

【预览】

附件列表
Files	Size	Format	View
RO202311097089901ZK.pdf	1780KB	PDF	download
12864_2017_4309_Article_IEq13.gif	1KB	Image	download
12864_2015_1994_Article_IEq16.gif	1KB	Image	download
12864_2017_4130_Article_IEq5.gif	1KB	Image	download
12864_2016_3426_Article_IEq2.gif	1KB	Image	download

【图表】

12864_2016_3426_Article_IEq2.gif

12864_2017_4130_Article_IEq5.gif

12864_2015_1994_Article_IEq16.gif

12864_2017_4309_Article_IEq13.gif

【参考文献】

[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]

BMC Bioinformatics
EnsCat: clustering of categorical data via ensembling
Software
Saeid Amiri¹ Bertrand S. Clarke² Jennifer L. Clarke³
[1] Department of Natural and Applied Sciences, University of Wisconsin Madison, Iowa City, IA, USA;Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, USA;Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, USA;Department of Food Science and Technology, University of Nebraska-Lincoln, Lincoln, NE, USA;
关键词: Categorical data; Clustering; Ensembling methods; High dimensional data;
DOI : 10.1186/s12859-016-1245-9
received in 2016-05-21, accepted in 2016-09-08, 发布年份 2016
来源: Springer
PDF


	文献评价指标
	下载次数：10次	浏览次数：4次

【 摘 要 】

【 授权许可】

【 预 览 】

【 图 表 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【图表】

【参考文献】