期刊论文详细信息
BMC Medical Research Methodology
Validation of prediction models based on lasso regression with multiply imputed data
Ronald B Geskus1  Gerben ter Riet2  Milo A Puhan3  Aeilko H Zwinderman1  Jammbe Z Musoro1 
[1] Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Center, University of Amsterdam, Meibergdreef 9, 1105 Amsterdam, the Netherlands;Department of General Practice, Academic Medical Center, University of Amsterdam, Meibergdreef 9, 1105 Amsterdam, the Netherlands;Institute for Social and Preventive Medicine, University of Zurich, Hirschengraben 84, CH-8001 Zurich, Switzerland
关键词: Shrinkage;    Quality of life;    Multiple imputation;    Model validation;    Clinical prediction models;   
Others  :  1090711
DOI  :  10.1186/1471-2288-14-116
 received in 2014-03-13, accepted in 2014-10-10,  发布年份 2014
PDF
【 摘 要 】

Background

In prognostic studies, the lasso technique is attractive since it improves the quality of predictions by shrinking regression coefficients, compared to predictions based on a model fitted via unpenalized maximum likelihood. Since some coefficients are set to zero, parsimony is achieved as well. It is unclear whether the performance of a model fitted using the lasso still shows some optimism. Bootstrap methods have been advocated to quantify optimism and generalize model performance to new subjects. It is unclear how resampling should be performed in the presence of multiply imputed data.

Method

The data were based on a cohort of Chronic Obstructive Pulmonary Disease patients. We constructed models to predict Chronic Respiratory Questionnaire dyspnea 6 months ahead. Optimism of the lasso model was investigated by comparing 4 approaches of handling multiply imputed data in the bootstrap procedure, using the study data and simulated data sets. In the first 3 approaches, data sets that had been completed via multiple imputation (MI) were resampled, while the fourth approach resampled the incomplete data set and then performed MI.

Results

The discriminative model performance of the lasso was optimistic. There was suboptimal calibration due to over-shrinkage. The estimate of optimism was sensitive to the choice of handling imputed data in the bootstrap resampling procedure. Resampling the completed data sets underestimates optimism, especially if, within a bootstrap step, selected individuals differ over the imputed data sets. Incorporating the MI procedure in the validation yields estimates of optimism that are closer to the true value, albeit slightly too larger.

Conclusion

Performance of prognostic models constructed using the lasso technique can be optimistic as well. Results of the internal validation are sensitive to how bootstrap resampling is performed.

【 授权许可】

   
2014 Musoro et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150128162846186.pdf 694KB PDF download
Figure 9. 38KB Image download
Figure 8. 43KB Image download
Figure 7. 34KB Image download
Figure 6. 36KB Image download
Figure 5. 36KB Image download
Figure 4. 37KB Image download
Figure 3. 25KB Image download
Figure 2. 54KB Image download
Figure 1. 60KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

【 参考文献 】
  • [1]Tibshirani R: Regression shrinkage and selection via lasso. J Roy Stat Soc B 1996, 58:267-288.
  • [2]Tibshirani R: The lasso method for variable selection in the Cox model. Stat Med 1997, 16:385-395.
  • [3]Steyerberg EW: Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York: Springer; 2010.
  • [4]Justice AC, Covinsky KE, Berlin JA: Assessing the generalizability of prognostic information. Ann Intern Med 1999, 130:515-524.
  • [5]Steyerberg EW, Harrell FE, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD: Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol 2001, 8:774-781.
  • [6]Harrell FE, Lee KL, Mark DB: Multivariate prognostic models: issues in developing models, evaluating assumptions and accuracy, and measuring and reducing errors. Stat Med 1996, 15:361-387.
  • [7]Breiman L: The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. J Am Stat Assoc 1992, 87:738-754.
  • [8]Efron B, Tibshirani RJ: An Introduction to the Bootstrap. New York: Chapman & Hall; 1986.
  • [9]Harrell FE: Regression Modeling Strategies: with Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer; 2001.
  • [10]Van Houwelingen JC, Le Cessie S: Predictive value of statistical models. Stat Med 1990, 9:1303-1325.
  • [11]Copas JB: Regression, prediction and shrinkage. J Roy Stat Soc B 1983, 45:311-354.
  • [12]Heymans MW, van Buuren S, Knol DL, van Mechelen W, de Vet HCW: Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol 2007, 7:33. BioMed Central Full Text
  • [13]Rubin DB: Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons; 1987.
  • [14]White IR, Royston P, Wood AM: Multiple imputation using chained equations: issues and guidance for practice. Stat Med 2011, 30:377-399.
  • [15]Vergouwe Y, Royston P, Moons KG, Altman DG: Development and validation of a prediction model with missing predictor data: a practical approach. J Clin Epidemiol 2010, 63:205-214.
  • [16]Vergouw D, Heymans MW, Peat GM, Kuijpers T, Croft PR, de Vet HCW, van der Horst HE, van der Windt DAWM: The search for stable prognostic models in multiple imputed data sets. BMC Med Res Methodol 2010, 10:81. BioMed Central Full Text
  • [17]Siebeling L, Puhan MA, Muggensturm P, Zoller M, ter Riet G: Characteristics of Dutch and Swiss primary care COPD patients - baseline data of the ICE COLD ERIC study. Clin Epidemiol 2011, 3:273-283.
  • [18]Siebeling L, ter Riet G, van der Wal WM, Geskus RB, Zoller M, Muggensturm P, Joleska I, Puhan MA: Ice cold eric–international collaborative effort on chronic obstructive lung disease: exacerbation risk index cohorts–study protocol for an international copd cohort study. BMC Pulm Med 2009, 9:16. BioMed Central Full Text
  • [19]Puhan MA, Behnke M, Frey M, Grueter T, Brandli O, Lichtenschop A, Guyatt GH, Schunemann HJ: Self-administration and interviewer-administration of the German chronic respiratory questionnaire: instrument development and assessment of validity and reliability in two randomised studies. Health Qual Life Outcomes 2004, 2:1. BioMed Central Full Text
  • [20]Puhan MA, Behnke M, Laschke M, Lichtenschopf A, Brändli O, Guyatt GH, Schünemann HJ: Self-administration and standardisation of the chronic respiratory questionnaire: a randomised trial in three German-speaking countries. Respir Med 2004, 98:342-350.
  • [21]van Buuren S, Karin G: Mice: multivariate imputation by chained equations in R. J Stat Software 2011, 45:1-67.
  • [22]Moons KGM, Donders RART, Stijnen T, Harrell FE: Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol 2006, 59:1092-1101.
  • [23]vonHippel PT: Regression with missing Ys: an improved strategy for analyzing multiply imputed data. Socio Meth 2007, 37:83-117.
  • [24]Cox DR: Two further applications of a model for binary regression. Biometrika 1958, 45:562-565.
  • [25]Schunemann HJ, Puhan M, Goldstein R, Jaeschke R, Guyatt GH: Measurement properties and interpretability of the chronic respiratory disease questionnaire (crq). COPD 2005, 2:81-89.
  • [26]R Core Team: R: A Language and Environment for Statistical Computing. Vienna: R foundation for statistical computing; 2012. R foundation for statistical computing. ISBN 3-900051-07-0. [http://www.R-project.org/ webcite]
  • [27]Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Software 2010, 33:1-22.
  • [28]Kuhn M, Contributions from Wing J, Weston S, Williams A, Keefer C, Engelhardt A: Caret: Classification and Regression Training. 2012. R package version 5.15-023. [http://CRAN.R-project.org/package=caret webcite]
  • [29]Van Houwelingen JC, Sauerbrei W: Cross-validation, shrinkage and variable selection in linear regression revisited. Open J Stat 2013, 3:79.
  • [30]Wan Y, Datta S, Conklin DJ, Kong M: Variable selection models based on multiple imputation with an application for predicting median effective dose and maximum effect. J Stat Comput Simulat 2014, 1-15. [doi:10.1080/00949655.2014.907801]
  • [31]Chen Q, Wang S: Variable selection for multiply-imputed data with application to dioxin exposure study. Stat Med 2013, 32:3646-3659.
  • [32]Wood AM, White IR, Royston P: How should variable selection be performed with multiply imputed data? Stat Med 2008, 27:3227-3246.
  • [33]Yang X, Belin TR, Boscardin W: Imputation and variable selection in linear regression models with missing covariates. Biometrics 2005, 61:498-506.
  文献评价指标  
  下载次数:103次 浏览次数:23次