期刊论文

期刊论文详细信息

BMC Medical Research Methodology
A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization

Anne-Laure Boulesteix¹ Thomas Stadler⁴ Rory Wilson¹ Caroline Truntzer² Christoph Bernau³ Roman Hornung¹
[1] Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich D-81377, Germany;Clinical and Innovation Proteomic Platform, Pôle de Recherche Université de Bourgogne, 15 Bd Maréchal de Lattre de Tassigny, Dijon F-21000, France;Leibniz Supercomputing Center, Boltzmannstr. 1, Garching D-85748, Germany;Department of Urology, University of Munich, Marchioninistr. 15, Munich D-81377, Germany
关键词: Supervised learning; Practical guidelines; Over-optimism; Error estimation; Cross-validation;
Others : 1230326 DOI : 10.1186/s12874-015-0088-9

received in 2015-06-24, accepted in 2015-10-19, 发布年份 2015

【摘要】

Background

In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset—in its entirety—before training/test set based prediction error estimation by cross-validation (CV)—an approach referred to as “incomplete CV”. Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values.

Methods

We devise the easily interpretable and general measure CVIIM (“CV Incompleteness Impact Measure”) to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA.

Results

Performing normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings.

Conclusions

While the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias.

【授权许可】

2015 Hornung et al.

附件列表
Files	Size	Format	View
Fig. 5.	113KB	Image	download
Fig. 4.	156KB	Image	download
Fig. 3.	106KB	Image	download
Fig. 2.	72KB	Image	download
Fig. 1.	60KB	Image	download
Fig. 5.	113KB	Image	download
Fig. 4.	156KB	Image	download
Fig. 3.	106KB	Image	download
Fig. 2.	72KB	Image	download
Fig. 1.	60KB	Image	download

【图表】

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

【参考文献】

[1]Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of dna microarray data for diagnostic and prognostic classification. J Nat Cancer Inst. 2003; 95:14-8.
[2]Daumer M, Held U, Ickstadt K, Heinz M, Schach S, Ebers G. Reducing the probability of false positive research findings by pre-publication validation—experience with a large multiple sclerosis database. BMC Med Res Methodol. 2008; 18:8.
[3]Ambroise C, McLachlan GJ. Proc Nat Acad Sci USA. 2002; 99:6562–6.
[4]Wood IA, Visscher PM, Mengersen KL. Classification based upon gene expression data: bias and precision of error rates. Bioinformatics. 2007; 23:1363-70.
[5]Zhu JX, McLachlan GJ, Jones LB-T, Wood IA. On selection biases with prediction rules formed from gene expression data. J Stat Plann Inference. 2008; 138:374-86.
[6]Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006; 7:91. BioMed Central Full Text
[7]Bernau C, Augustin T, Boulesteix AL. Correcting the optimal resampling-based error rate by estimating the error rate of wrapper algorithms. Biometrics. 2013; 69:693-702.
[8]Boulesteix AL, Strobl C. Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction. BMC Med Res Methodol. 2009; 85:9.
[9]Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, van Velzen EJJ, van Duijnhoven JPM, van Dorsten FA. Assessment of plsda cross validation. Metabolomics. 2008; 4:81-9.
[10]Hastie T, Tibshirani R, Friedman J. The Elements of statistical learning: data mining, inference and prediction. Springer, New York; 2009.
[11]Zhu X, Ambroise C, McLachlan GJ. Selection bias in working with the top genes in supervised classification of tissue samples. Stat Methodol. 2006; 3:29-41.
[12]Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U et al.. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003; 4:249-64.
[13]Kolesnikov N, Hastings E, Keays M, Melnichuk O, Tang YA, Williams E, et al. ArrayExpress update – simplifying data submissions. Nucleid Acid Res. 2015. doi:10.1093/nar/gku1057.
[14]Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M et al.. Ncbi geo: archive for functional genomics data sets–update. Nucleid Acids Res. 2013; 41:991-5.
[15]Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C et al.. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002; 1:203-9.
[16]Bin RD, Herold T, Boulesteix AL. Added predictive value of omics data: specific issues related to validation illustrated by two case studies. BMC Med Res Methodol. 2014; 117:4.
[17]Kostka D, Spang R. Microarray based diagnosis profits from better documentation of gene expression signatures. PLoS Comput Biol. 2008; 4:22.
[18]Huber W, von Heydebreck A, Sültmann H, Poustka A, Vingron M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002; 18:96-104.
[19]Huber W. Introduction to robust calibration and variance stabilisation with VSN. Vignette. 2014. http://www.bioconductor.org/packages/release/bioc/vignettes/vsn/inst/doc/vsn.pdf/. Accessed 13 Feb 2015.
[20]Dai JJ, Lieu L, Rocke D. Dimension reduction for classification with gene expression microarray data. Stat Appl Genet Mol Biol. 2006; 5:6.
[21]Boulesteix AL, Hable R, Lauer S, Eugster MJE. A statistical framework for hypothesis testing in real data comparison studies. Am Stat. 2015; 69:201-212.
[22]Boulesteix AL. On representative and illustrative comparisons with real data in bioinformatics: response to the letter to the editor by Smith et al. Bioinformatics. 2013; 29:2664-6.
[23]Bengio Y, Grandvalet Y. No unbiased estimator of the variance of k-fold cross-validation. J Mach Learn Res. 2004; 5:1089-105.
[24]Bernau C, Riester M, Boulesteix AL, Parmigiani G, Huttenhower C, Waldron L et al.. Cross-study validation for the assessment of prediction algorithms. Bioinformatics. 2014; 30:105-12.
[25]Simon R. When is a genomic classifier ready for prime time? Nat Clin Prac. 2004; 1:4-5.
[26]Collins GS, de Groot JA, Dutton S, Omar O, Shanyinde M, Tajar A et al.. External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol. 2014; 40:14.


	文献评价指标
	下载次数：97次	浏览次数：32次

Copyright © 2016 中国科学院文献情报中心

京公网安备340104078870146号 878987797 028-85220240

OAinOne平台基于对开放资源的发现、遴选和评价方式，发现、获取、集成9类优质的开放科技资源，包括开放期刊、开放会议论文、开放课件、科技政策、开放学位论文、开放图书、开放科技报告、科研项目、开放科学数据。同时，为实现开放知识资源普遍服务、个性化服务、精准服务，基于OAinONE集成的丰富开放资源，开发建设领域开放知识资源服务定制工具(OAtoYOU)、开放资源评价评估体系（OAEvaluation），建设集成OAinONE资源及其他第三方资源的OA Hub，及其面向我院分布式大数据知识资源系统及其他第三方的开放接口服务，并打造特色专题数据库产品建设，包括科技政策集成及趋势平台、开放课程大讲堂等。此外，OAinOne构建开放知识资源建设的可持续发展机制，支持我院研究所特色馆藏资源、自建资源、古籍资源等在OAinONE平台上的集成、开放、共享。