Journal of Data Science | |
Variable Selection with Scalable Bootstrapping in Generalized Linear Model for Massive Data | |
article | |
Zhang Zhang1  Zhibing He2  Yichen Qin3  Ye Shen4  Ben-Chang Shia5  Yang Li1  | |
[1] Center for Applied Statistics and School of Statistics, Renmin University of China;School of Mathematical and Statistical Sciences, Arizona State University;Department of Operations;College of Public Health, University of Georgia;Graduate Institute of Business Administration and College of Management, Fu Jen Catholic University;RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China | |
关键词: distributed computing; large-scale dataset; scalable bootstrap; variable selection; | |
DOI : 10.6339/22-JDS1052 | |
学科分类:土木及结构工程学 | |
来源: JDS | |
【 摘 要 】
Bootstrapping is commonly used as a tool for non-parametric statistical inference to assess the quality of estimators in variable selection models. However, for a massive dataset, the computational requirement when using bootstrapping in variable selection models (BootVS) can be crucial. In this study, we propose a novel framework using a bag of little bootstraps variable selection (BLBVS) method with a ridge hybrid procedure to assess the quality of estimators in generalized linear models with a regularized term, such as lasso and group lasso penalties. The proposed method can be easily and naturally implemented with distributed computing, and thus has significant computational advantages for massive datasets. The simulation results show that our novel BLBVS method performs excellently in both accuracy and efficiency when compared with BootVS. Real data analyses including regression on a bike sharing dataset and classification of a lending club dataset are presented to illustrate the computational superiority of BLBVS in large-scale datasets.
【 授权许可】
CC BY
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO202307150000502ZK.pdf | 794KB | download |