BioData Mining | |
Risk estimation using probability machines | |
Abhijit Dasgupta4  Silke Szymczak5  Jason H Moore2  Joan E Bailey-Wilson3  James D Malley1  | |
[1] Mathematical and Statistical Computing Laboratory, Center for Information Technology, National Institutes of Health, Bldg 12A, Room 2039, Bethesda, MD 20892-5620, USA | |
[2] Department of Genetics, Dartmouth College, HB 7937, Dartmouth-Hitchcock Medical Center, One Medical Center Drive, NH 03756 Lebanon, USA | |
[3] Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive, Suite 1200, Baltimore 21224MD, USA | |
[4] Clinical Trials and Outcomes Branch, National Institute of Arthritis, Musculoskeletal and Skin Diseases, National Institutes of Health, Room 4-1350, Bldg 10 CRC, 10 Center Drive, Bethesda, MD 20892-1468, USA | |
[5] Current address: Institute of Clinical Molecular Biology, Christian-Albrechts-University Kiel, Am Botanischen Garten 11, 24118 Kiel, Germany | |
关键词: Interactions; Counterfactuals; Odds ratio; Probability machine; Logistic regression; Consistent nonparametric regression; | |
Others : 795123 DOI : 10.1186/1756-0381-7-2 |
|
received in 2013-06-20, accepted in 2014-02-19, 发布年份 2014 | |
【 摘 要 】
Background
Logistic regression has been the de facto, and often the only, model used in the description and analysis of relationships between a binary outcome and observed features. It is widely used to obtain the conditional probabilities of the outcome given predictors, as well as predictor effect size estimates using conditional odds ratios.
Results
We show how statistical learning machines for binary outcomes, provably consistent for the nonparametric regression problem, can be used to provide both consistent conditional probability estimation and conditional effect size estimates. Effect size estimates from learning machines leverage our understanding of counterfactual arguments central to the interpretation of such estimates. We show that, if the data generating model is logistic, we can recover accurate probability predictions and effect size estimates with nearly the same efficiency as a correct logistic model, both for main effects and interactions. We also propose a method using learning machines to scan for possible interaction effects quickly and efficiently. Simulations using random forest probability machines are presented.
Conclusions
The models we propose make no assumptions about the data structure, and capture the patterns in the data by just specifying the predictors involved and not any particular model structure. So they do not run the same risks of model mis-specification and the resultant estimation biases as a logistic model. This methodology, which we call a “risk machine”, will share properties from the statistical machine that it is derived from.
【 授权许可】
2014 Dasgupta et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20140705081934794.pdf | 913KB | download | |
Figure 7. | 55KB | Image | download |
Figure 6. | 59KB | Image | download |
Figure 5. | 68KB | Image | download |
Figure 4. | 70KB | Image | download |
Figure 3. | 72KB | Image | download |
Figure 2. | 72KB | Image | download |
Figure 1. | 74KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.
【 参考文献 】
- [1]Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A: Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inf Med 2012, 51:74-81. [http://dx.doi.org/10.3414/ME00-01-0052 webcite]
- [2]Breiman L: Random forests. Mach Learn 2001, 45:5-32.
- [3]Biau G, Devroye L, Lugosi G: Consistency of random forests and other averaging classifiers. J Mach Learn Res 2008, 9:2015-2033.
- [4]Biau G, Cerou F, Guyader A: On the rate of convergence of the bagged nearest neighbor estimate. J Mach Learn Res 2010, 11:687-712.
- [5]Biau G: Analysis of a random forests model. J Machine Learning Res 2012, 13:1063-1095.
- [6]Chen Z-X, Sturgil D, Qu J, Jiang H, Park S, Boley N, Suzuki AM, Fletcher AR, Plachetzki DC, FitzGerald PC, Artieri CG, Atallah J, Barmina O, Brown JB, Blankenburg KP, Clough E, Dasgupta A, Gubbala S, Han Y, Jayaseelan JC, Kalra D, Kim Y-A, Kovar CL, Lee SL, Li M, Malley JD, Malone JH, Mathew T, Mattiuzzo NR, Munidasa M, et al.: Comparative analysis of the D. melanogaster modEncode transcriptome annotation. Genome Researchin press
- [7]Liaw A, Wiener M: Classification and regression by random forest. R News 2002, 2(3):18-22.
- [8]Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E: Scikit-learn: machine learning in python. J Mach Learn Res 2011, 12:2825-2830.