| BMC Bioinformatics | |
| McTwo: a two-step feature selection algorithm based on maximal information coefficient | |
| Methodology Article | |
| Guoqing Wang1  Dongli Ma2  Fengfeng Zhou3  Guoqin Mai3  Youxi Luo4  Manli Zhou5  Qinghan Meng5  Ruiquan Ge5  | |
| [1] Department of Pathogenobiology, Basic Medical College of Jilin University, Changchun, Jilin, China;Shenzhen Children’s Hospital, 518026, Shenzhen, Guangdong, P.R. China;Shenzhen Institutes of Advanced Technology, and Key Lab for Health Informatics, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, 518055, Shenzhen, Guangdong, P.R. China;Shenzhen Institutes of Advanced Technology, and Key Lab for Health Informatics, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, 518055, Shenzhen, Guangdong, P.R. China;School of Science, Hubei University of Technology, 430068, Wuhan, Hubei, P.R. China;Shenzhen Institutes of Advanced Technology, and Key Lab for Health Informatics, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, 518055, Shenzhen, Guangdong, P.R. China;Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, 518055, Shenzhen, Guangdong, P.R. China; | |
| 关键词: Maximal information coefficient (MIC); Heuristic algorithm; Feature selection; Filter algorithm; Wrapper algorithm; | |
| DOI : 10.1186/s12859-016-0990-0 | |
| received in 2015-12-02, accepted in 2016-03-14, 发布年份 2016 | |
| 来源: Springer | |
PDF
|
|
【 摘 要 】
BackgroundHigh-throughput bio-OMIC technologies are producing high-dimension data from bio-samples at an ever increasing rate, whereas the training sample number in a traditional experiment remains small due to various difficulties. This “large p, small n” paradigm in the area of biomedical “big data” may be at least partly solved by feature selection algorithms, which select only features significantly associated with phenotypes. Feature selection is an NP-hard problem. Due to the exponentially increased time requirement for finding the globally optimal solution, all the existing feature selection algorithms employ heuristic rules to find locally optimal solutions, and their solutions achieve different performances on different datasets.ResultsThis work describes a feature selection algorithm based on a recently published correlation measurement, Maximal Information Coefficient (MIC). The proposed algorithm, McTwo, aims to select features associated with phenotypes, independently of each other, and achieving high classification performance of the nearest neighbor algorithm. Based on the comparative study of 17 datasets, McTwo performs about as well as or better than existing algorithms, with significantly reduced numbers of selected features. The features selected by McTwo also appear to have particular biomedical relevance to the phenotypes from the literature.ConclusionMcTwo selects a feature subset with very good classification performance, as well as a small feature number. So McTwo may represent a complementary feature selection algorithm for the high-dimensional biomedical datasets.
【 授权许可】
CC BY
© Ge et al. 2016
【 预 览 】
| Files | Size | Format | View |
|---|---|---|---|
| RO202311105822486ZK.pdf | 1984KB |
【 参考文献 】
- [1]
- [2]
- [3]
- [4]
- [5]
- [6]
- [7]
- [8]
- [9]
- [10]
- [11]
- [12]
- [13]
- [14]
- [15]
- [16]
- [17]
- [18]
- [19]
- [20]
- [21]
- [22]
- [23]
- [24]
- [25]
- [26]
- [27]
- [28]
- [29]
- [30]
- [31]
- [32]
- [33]
- [34]
- [35]
- [36]
- [37]
- [38]
- [39]
- [40]
- [41]
- [42]
- [43]
- [44]
- [45]
- [46]
- [47]
- [48]
- [49]
- [50]
- [51]
- [52]
- [53]
- [54]
- [55]
- [56]
- [57]
PDF