期刊论文详细信息
BMC Bioinformatics
An experimental study of the intrinsic stability of random forest variable importance measures
Research Article
Fan Yang1  Huazhen Wang2  Zhiyuan Luo3 
[1] Automation Department, Xiamen University, Siming South Road, 361005, Xiamen, China;College of Computer Science and Technology, Huaqiao University, Jimei Avenue, 361021, Xiamen, China;Computer Learning Research Centre, Royal Holloway, University of London, Egham, TW20 0EX, Surrey, UK;Computer Learning Research Centre, Royal Holloway, University of London, Egham, TW20 0EX, Surrey, UK;
关键词: Random forest;    Variable importance measure;    Stability;    Feature selection;   
DOI  :  10.1186/s12859-016-0900-5
 received in 2015-06-16, accepted in 2015-12-15,  发布年份 2016
来源: Springer
PDF
【 摘 要 】

BackgroundThe stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. The motivation of this study is two-fold. First, we empirically verify the prevalence of intrinsic stability of VIMs over many real-world datasets to highlight that the instability of VIMs does not originate exclusively from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. Second, through Spearman and Pearson tests we comprehensively investigate how different factors influence the intrinsic stability.ResultsThe experiments are carried out on 19 benchmark datasets with diverse characteristics, including 10 high-dimensional and small-sample gene expression datasets. Experimental results demonstrate the prevalence of intrinsic stability of VIMs. Spearman and Pearson tests on the correlations between intrinsic stability and different factors show that #feature (number of features) and #sample (size of sample) have a coupling effect on the intrinsic stability. The synthetic indictor, #feature/#sample, shows both negative monotonic correlation and negative linear correlation with the intrinsic stability, while OOB accuracy has monotonic correlations with intrinsic stability. This indicates that high-dimensional, small-sample and high complexity datasets may suffer more from intrinsic instability of VIMs. Furthermore, with respect to parameter settings of random forest, a large number of trees is preferred. No significant correlations can be seen between intrinsic stability and other factors. Finally, the magnitude of intrinsic stability is always smaller than that of traditional stability.ConclusionFirst, the prevalence of intrinsic stability of VIMs demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. This finding gives a better understanding of VIM stability, and may help reduce the instability of VIMs. Second, by investigating the potential factors of intrinsic stability, users would be more aware of the risks and hence more careful when using VIMs, especially on high-dimensional, small-sample and high complexity datasets.

【 授权许可】

CC BY   
© Wang et al. 2016

【 预 览 】
附件列表
Files Size Format View
RO202311100083848ZK.pdf 1316KB PDF download
Fig. 10 1239KB Image download
Fig. 2 86KB Image download
Fig. 6 90KB Image download
Fig. 2 661KB Image download
Fig. 6 118KB Image download
Fig. 5 2831KB Image download
Fig. 4 2788KB Image download
【 图 表 】

Fig. 4

Fig. 5

Fig. 6

Fig. 2

Fig. 6

Fig. 2

Fig. 10

【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  • [22]
  • [23]
  • [24]
  • [25]
  • [26]
  • [27]
  • [28]
  • [29]
  • [30]
  • [31]
  • [32]
  • [33]
  • [34]
  • [35]
  • [36]
  • [37]
  • [38]
  • [39]
  • [40]
  • [41]
  • [42]
  文献评价指标  
  下载次数:5次 浏览次数:2次