期刊论文

【摘要】

BackgroundSuccessfully modeling high-dimensional data involving thousands of variables is challenging. This is especially true for gene expression profiling experiments, given the large number of genes involved and the small number of samples available. Random Forests (RF) is a popular and widely used approach to feature selection for such "small n, large p problems." However, Random Forests suffers from instability, especially in the presence of noisy and/or unbalanced inputs.ResultsWe present RKNN-FS, an innovative feature selection procedure for "small n, large p problems." RKNN-FS is based on Random KNN (RKNN), a novel generalization of traditional nearest-neighbor modeling. RKNN consists of an ensemble of base k-nearest neighbor models, each constructed from a random subset of the input variables. To rank the importance of the variables, we define a criterion on the RKNN framework, using the notion of support. A two-stage backward model selection method is then developed based on this criterion. Empirical results on microarray data sets with thousands of variables and relatively few samples show that RKNN-FS is an effective feature selection approach for high-dimensional data. RKNN is similar to Random Forests in terms of classification accuracy without feature selection. However, RKNN provides much better classification accuracy than RF when each method incorporates a feature-selection step. Our results show that RKNN is significantly more stable and more robust than Random Forests for feature selection when the input data are noisy and/or unbalanced. Further, RKNN-FS is much faster than the Random Forests feature selection method (RF-FS), especially for large scale problems, involving thousands of variables and multiple classes.ConclusionsGiven the superiority of Random KNN in classification performance when compared with Random Forests, RKNN-FS's simplicity and ease of implementation, and its superiority in speed and stability, we propose RKNN-FS as a faster and more stable alternative to Random Forests in classification problems involving feature selection for high-dimensional datasets.

【授权许可】

CC BY
© Li et al; licensee BioMed Central Ltd. 2011

【预览】

附件列表
Files	Size	Format	View
RO202311109195461ZK.pdf	564KB	PDF	download

【参考文献】

[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]

BMC Bioinformatics
Random KNN feature selection - a fast and stable alternative to Random Forests
Methodology Article
E James Harner¹ Shengqiao Li² Donald A Adjeroh³
[1] The Department of Statistics, West Virginia University, 26506, Morgantown, WV, USA;The Department of Statistics, West Virginia University, 26506, Morgantown, WV, USA;Health Effects Laboratory Division, the National Institute for Occupational Safety and Health, 26505, Morgantown, WV, USA;The Lane Department of Computer Science and Electrical Engineering, West Virginia University, 26506, Morgantown, WV, USA;
关键词: Feature Selection; Classification Accuracy; Random Forest; Base Classifier; High Dimensional Dataset;
DOI : 10.1186/1471-2105-12-450
received in 2011-01-31, accepted in 2011-11-18, 发布年份 2011
来源: Springer
PDF


	文献评价指标
	下载次数：13次	浏览次数：1次

【 摘 要 】

【 授权许可】

【 预 览 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【参考文献】