期刊论文详细信息
BMC Bioinformatics
Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
Research Article
Yu Guo1  Raji Balasubramanian2  Armin Graber3  Robert N McBurney4 
[1] BG Medicine, Inc., 610 Lincoln St., 02451, Waltham, MA, USA;Division of Biostatistics and Epidemiology, University of Massachusetts - Amherst, 715 North Pleasant Street, 01003, Amherst, MA, USA;Institute for Bioinformatics and Translational Research, UMIT, Eduard Wallnoefer Zentrum 1, 6060, Hall in Tyrol, Austria;Optimal Medicine Ltd., Warwick Enterprise Park, CV35 9EF, Wellesbourne, Warwick, UK;
关键词: Support Vector Machine;    Random Forest;    Simulated Dataset;    Average Classification Accuracy;    Recursive Feature Elimination;   
DOI  :  10.1186/1471-2105-11-447
 received in 2010-02-18, accepted in 2010-09-03,  发布年份 2010
来源: Springer
PDF
【 摘 要 】

BackgroundData generated using 'omics' technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of 'omics' data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques.ResultsThe analysis of data from seven 'omics' studies revealed that the average magnitude of effect size observed in human studies was markedly lower when compared to that in animal studies. The data measured in human studies were characterized by higher biological variation and the presence of outliers. The results from simulation studies indicated that the classifier Prediction Analysis for Microarrays (PAM) had the highest power when the class conditional feature distributions were Gaussian and outcome distributions were balanced. Random Forests was optimal when feature distributions were skewed and when class distributions were unbalanced. We provide a free open-source R statistical software library (MVpower) that implements the simulation strategy proposed in this paper.ConclusionNo single classifier had optimal performance under all settings. Simulation studies provide useful guidance for the design of biomedical studies involving high-dimensionality data.

【 授权许可】

Unknown   
© Guo et al; licensee BioMed Central Ltd. 2010. This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

【 预 览 】
附件列表
Files Size Format View
RO202311105446646ZK.pdf 1336KB PDF download
【 参考文献 】
  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]
  • [7]
  • [8]
  • [9]
  • [10]
  • [11]
  • [12]
  • [13]
  • [14]
  • [15]
  • [16]
  • [17]
  • [18]
  • [19]
  • [20]
  • [21]
  • [22]
  • [23]
  • [24]
  • [25]
  • [26]
  • [27]
  • [28]
  • [29]
  • [30]
  • [31]
  • [32]
  • [33]
  • [34]
  • [35]
  文献评价指标  
  下载次数:1次 浏览次数:0次