BMC Medical Research Methodology | |
On the assessment of the added value of new predictive biomarkers | |
Nicholas Petrick1  Berkman Sahiner1  Le Kang1  Brandon D Gallas1  Frank W Samuelson1  Weijie Chen1  | |
[1] Division of Imaging and Applied Mathematics, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, 10903 New Hampshire Avenue, Silver Spring, MD 20993, USA | |
关键词: Area under the ROC curve; Classification; Biomarkers; | |
Others : 1092143 DOI : 10.1186/1471-2288-13-98 |
|
received in 2013-02-09, accepted in 2013-07-24, 发布年份 2013 | |
【 摘 要 】
Background
The surge in biomarker development calls for research on statistical evaluation methodology to rigorously assess emerging biomarkers and classification models. Recently, several authors reported the puzzling observation that, in assessing the added value of new biomarkers to existing ones in a logistic regression model, statistical significance of new predictor variables does not necessarily translate into a statistically significant increase in the area under the ROC curve (AUC). Vickers et al. concluded that this inconsistency is because AUC “has vastly inferior statistical properties,” i.e., it is extremely conservative. This statement is based on simulations that misuse the DeLong et al. method. Our purpose is to provide a fair comparison of the likelihood ratio (LR) test and the Wald test versus diagnostic accuracy (AUC) tests.
Discussion
We present a test to compare ideal AUCs of nested linear discriminant functions via an F test. We compare it with the LR test and the Wald test for the logistic regression model. The null hypotheses of these three tests are equivalent; however, the F test is an exact test whereas the LR test and the Wald test are asymptotic tests. Our simulation shows that the F test has the nominal type I error even with a small sample size. Our results also indicate that the LR test and the Wald test have inflated type I errors when the sample size is small, while the type I error converges to the nominal value asymptotically with increasing sample size as expected. We further show that the DeLong et al. method tests a different hypothesis and has the nominal type I error when it is used within its designed scope. Finally, we summarize the pros and cons of all four methods we consider in this paper.
Summary
We show that there is nothing inherently less powerful or disagreeable about ROC analysis for showing the usefulness of new biomarkers or characterizing the performance of classification models. Each statistical method for assessing biomarkers and classification models has its own strengths and weaknesses. Investigators need to choose methods based on the assessment purpose, the biomarker development phase at which the assessment is being performed, the available patient data, and the validity of assumptions behind the methodologies.
【 授权许可】
2013 Chen et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150128180649119.pdf | 241KB | download | |
Figure 1. | 43KB | Image | download |
【 图 表 】
Figure 1.
【 参考文献 】
- [1]Begg C, Vickers A: One statistical test is sufficient for assessing new predictive markers. BMC Med Res Methodol 2011, 11(13):1-7.
- [2]Demler OV, Pencina MJ, D’Agostino R: Misuse of DeLong test to compare AUCs for nested models. Stat Med 2012, 31:2577-2587.
- [3]DeLong ER, DeLong DM, Clarke-Pearson DL: Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 1988, 44(3):837-845.
- [4]Efron B, Tibshirani R: Improvements on cross-validation: the.632+ bootstrap method. J Am Stat Assoc 1997, 92(438):548-560.
- [5]Hosmer DW, Lemeshow S: Applied Logistic Regression. New York, NY: John Wiley & Sons; 2004. 2, illustrated edition
- [6]Su JQ, Liu JS: Linear combinations of multiple diagnostic markers. J Am Stat Assoc 1993, 88(424):1350-1355.
- [7]Rao CR: Tests of significance in multivariate analysis. Biometrika 1948, 35:58-79.
- [8]Demler OV, Pencina MJ, D’Agostino R: Equivalence of improvement in area under ROC curve and linear discriminant analysis coefficient under assumption of normality. Stat Med 2011, 30:1410-1418.
- [9]Chen W, Gallas BD, Yousef WA: Classifier variability: accounting for training and testing. Pattern Recognit 2012, 45(7):2661-2671.
- [10]Efron B: The efficiency of logistic regression compared to normal discriminant analysis. J Am Stat Assoc 1975, 70(352):892-898.
- [11]Hoeffding W: A class of statistics with asymptotically normal distribution. Ann Math Stat 1948, 19(3):293-325.
- [12]Pepe MS: Testing for improvement in prediction model performance. Stat Med 2013, 32:1467-1482.
- [13]Kerr KF, McClelland RL, Brown ER, Lumley T: Evaluating the incremental value of new biomarkers with integrated discrimination improvement. Am J Epidemiol 2011, 174:364-374.
- [14]Chen W, Wagner RF, Yousef WA, Gallas BD: Comparison of classifier performance estimators: a simulation study. 2009.
- [15]Sahiner B, Chan HP, Hadjiiski L: Classifier performance prediction for computer-aided diagnosis using a limited dataset. Med Phys 2008, 35(4):1559.
- [16]Pepe MS, Etzioni R, Feng Z, Potter JD, Thompson ML, Thornquist M, Winget M, Yasui Y: Phases of biomarker development for early detection of cancer. J Natl Cancer Inst 2001, 93(14):1054-1061.
- [17]Baker SG, Kramer BS, McIntosh M, Patterson BH, Shyr Y, Skates S: Evaluating markers for the early detection of cancer: overview of study designs and methods. Clin Trials 2006, 3:43-56.
- [18]Pepe MS, Feng Z, Janes H, Bossuyt PM, Potter JD: Pivotal evaluation of the accuracy of a biomarker used for classification or prediction: standards for study design. J Natl Cancer Inst 2008, 100(20):1432-1438.
- [19]Sen PK: On some convergence properties of U-statistics. Calcutta Stat Assoc Bull 1960, 10:1-18.