Model averaging has been proposed as an alternative to model selection which is intendedto overcome the underestimation of standard errors that is a consequence ofmodel selection. Model selection and model averaging become more complicated in thepresence of missing data. Three different model selection approaches (RR, STACK andM-STACK) and model averaging using three model-building strategies (non-overlappingvariable sets, inclusive and restrictive strategies) were explored to combine results frommultiply-imputed data sets using a Monte Carlo simulation study on some simple linearand generalized linear models. Imputation was carried out using chained equations (viathe "norm" method in the R package MICE). The simulation results showed that theSTACK method performs better than RR and M-STACK in terms of model selectionand prediction, whereas model averaging performs slightly better than STACK in termsof prediction. The inclusive and restrictive strategies perform better in terms of prediction,but non-overlapping variable sets performs better for model selection. STACK andmodel averaging using all three model-building strategies were proposed to combine theresults from a multiply-imputed data set from the Gateshead Millennium Study (GMS).The performance of STACK and model averaging was compared using mean square errorof prediction (MSE(P)) in a 10% cross-validation test. The results showed that STACKusing an inclusive strategy provided a better prediction than model averaging. Thiscoincides with the results obtained through a mimic simulation study of GMS data. Inaddition, the inclusive strategy for building imputation and prediction models was betterthan the non-overlapping variable sets and restrictive strategy. The presence of highlycorrelated covariates and response is believed to have led to better prediction in thisparticular context. Model averaging using non-overlapping variable sets performs betteronly if an auxiliary variable is available. However, STACK using an inclusive strategyperforms well when there is no auxiliary variable available. Therefore, it is advisable touse STACK with an inclusive model-building strategy and highly correlated covariates(where available) to make predictions in the presence of missing data. Alternatively,model averaging with non-overlapping variables sets can be used if an auxiliary variableis available.
【 预 览 】
附件列表
Files
Size
Format
View
Model selection and model averaging in the presence of missing values