BMC Medical Research Methodology | |
Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections | |
Fabrice Carrat1  Yohann Mansiaux2  | |
[1] Public Health Unit, Saint-Antoine Hospital, 75012 Paris, France;Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1136, Institut Pierre Louis d’Epidémiologie et de Santé Publique, F-75013 Paris, France | |
关键词: Influenza; Logistic regression; LASSO; Boosted regression trees; Random forest; Data mining; | |
Others : 1091195 DOI : 10.1186/1471-2288-14-99 |
|
received in 2014-04-24, accepted in 2014-08-14, 发布年份 2014 | |
【 摘 要 】
Background
Big data is steadily growing in epidemiology. We explored the performances of methods dedicated to big data analysis for detecting independent associations between exposures and a health outcome.
Methods
We searched for associations between 303 covariates and influenza infection in 498 subjects (14% infected) sampled from a dedicated cohort. Independent associations were detected using two data mining methods, the Random Forests (RF) and the Boosted Regression Trees (BRT); the conventional logistic regression framework (Univariate Followed by Multivariate Logistic Regression - UFMLR) and the Least Absolute Shrinkage and Selection Operator (LASSO) with penalty in multivariate logistic regression to achieve a sparse selection of covariates. We developed permutations tests to assess the statistical significance of associations. We simulated 500 similar sized datasets to estimate the True (TPR) and False (FPR) Positive Rates associated with these methods.
Results
Between 3 and 24 covariates (1%-8%) were identified as associated with influenza infection depending on the method. The pre-seasonal haemagglutination inhibition antibody titer was the unique covariate selected with all methods while 266 (87%) covariates were not selected by any method. At 5% nominal significance level, the TPR were 85% with RF, 80% with BRT, 26% to 49% with UFMLR, 71% to 78% with LASSO. Conversely, the FPR were 4% with RF and BRT, 9% to 2% with UFMLR, and 9% to 4% with LASSO.
Conclusions
Data mining methods and LASSO should be considered as valuable methods to detect independent associations in large epidemiologic datasets.
【 授权许可】
2014 Mansiaux and Carrat; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150128170240811.pdf | 628KB | download | |
Figure 2. | 91KB | Image | download |
Figure 1. | 94KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
【 参考文献 】
- [1]Trelles O, Prins P, Snir M, Jansen RC: Big data, but are we ready? Nat Rev Genet 2011, 12:224.
- [2]Fontana JM, Alexander E, Salvatore M: Translational research in infectious disease: current paradigms and challenges ahead. Transl Res 2012, 159:430-453.
- [3]Shah NH, Tenenbaum JD: The coming age of data-driven medicine: translational bioinformatics’ next frontier. J Am Med Informatics Assoc 2012, 19:e2-e4.
- [4]Bougnères P, Valleron A-J: Causes of early-onset type 1 diabetes: toward data-driven environmental approaches. J Exp Med 2008, 205:2953-2957.
- [5]Choi H, Pavelka N: When one and one gives more than two: challenges and opportunities of integrative omics. Front Genet 2011, 2:105.
- [6]Murdoch TB, Detsky AS: The inevitable application of big data to health care. JAMA 2013, 309:1351-1352.
- [7]Liao H, Lynn HS: A survey of variable selection methods in two Chinese epidemiology journals. BMC Med Res Methodol 2010, 10:87. BioMed Central Full Text
- [8]Walter S, Tiemeier H: Variable selection: current practice in epidemiological studies. Eur J Epidemiol 2009, 24:733-736.
- [9]Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR: A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996, 49:1373-1379.
- [10]Smyth P: Data mining: data analysis on a grand scale? Stat Methods Med Res 2000, 9:309-327.
- [11]Maimon O, Rokach L (Eds): Data Mining and Knowledge Discovery Handbook. New York: Springer; 2010.
- [12]Austin PC: A comparison of regression trees, logistic regression, generalized additive models, and multivariate adaptive regression splines for predicting AMI mortality. Stat Med 2007, 26:2937-2957.
- [13]Maroco J, Silva D, Rodrigues A, Guerreiro M, Santana I, DE Mendonca A: Data mining methods in the prediction of dementia: a real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests. BMC Res Notes 2011, 4:299. BioMed Central Full Text
- [14]Green M, Björk J, Forberg J, Ekelund U, Edenbrandt L, Ohlsson M: Comparison between neural networks and multiple logistic regression to predict acute coronary syndrome in the emergency room. Artif Intell Med 2006, 38:305-318.
- [15]Regnier-Coudert O, McCall J, Lothian R, Lam T, McClinton S, N’dow J: Machine learning for improved pathological staging of prostate cancer: a performance comparison on a range of classifiers. Artif Intell Med 2012, 55:25-35.
- [16]Austin PC, Lee DS, Steyerberg EW, Tu JV: Regression trees for predicting mortality in patients with cardiovascular disease: what improvement is achieved by using ensemble-based methods? Biometrical J 2012, 54:657-673.
- [17]Austin PC, Tu JV, Ho JE, Levy D, Lee DS: Using methods from the data-mining and machine-learning literature for disease classification and prediction: a case study examining classification of heart failure subtypes. J Clin Epidemiol 2013, 66:398-407.
- [18]Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B 1996, 58:267-288.
- [19]Xu C-J, van der Schaaf A, Schilstra C, Langendijk JA, van’t Veld AA: Impact of statistical learning methods on the predictive power of multivariate normal tissue complication probability models. Int J Radiat Oncol Biol Phys 2012, 82:e677-e684.
- [20]Avalos M, Adroher ND, Lagarde E, Thiessard F, Grandvalet Y, Contrand B, Orriols L: Prescription-drug-related risk in driving: comparing conventional and lasso shrinkage logistic regressions. Epidemiology 2012, 23:706-712.
- [21]Lapidus N, De Lamballerie X, Salez N, Setbon M, Ferrari P, Delabre RM, Gougeon M-L, Vely F, Leruez-Ville M, Andreoletti L, Cauchemez S, Boëlle P-Y, Vivier E, Abel L, Schwarzinger M, Legeas M, Le Cann P, Flahault A, Carrat F: Integrative study of pandemic A/H1N1 influenza infections: design and methods of the CoPanFlu-France cohort. BMC Public Health 2012, 12:417. BioMed Central Full Text
- [22]CDC protocol of realtime RTPCR for influenza A (H1N1) [http://www.who.int/csr/resources/publications/swineflu/realtimeptpcr/en/ webcite]
- [23]Reijans M, Dingemans G, Klaassen CH, Meis JF, Keijdener J, Mulders B, Eadie K, van Leeuwen W, van Belkum A, Horrevorts AM, Simons G: RespiFinder: a new multiparameter test to differentially identify fifteen respiratory viruses. J Clin Microbiol 2008, 46:1232-1240.
- [24]European Medicines Agency - Committee for proprietary medicinal products. Note for guidance on harmonization of requirements for influenza vaccines (CPMP/BWP/214/96) [http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500003945.pdf webcite]
- [25]Lapidus N, de Lamballerie X, Salez N, Setbon M, Delabre RM, Ferrari P, Moyen N, Gougeon M-L, Vely F, Leruez-Ville M, Andreoletti L, Cauchemez S, Boëlle P-Y, Vivier E, Abel L, Schwarzinger M, Legeas M, Le Cann P, Flahault A, Carrat F: Factors associated with post-seasonal serological titer and risk factors for infection with the pandemic A/H1N1 virus in the French general population. PLoS One 2013, 8:e60127.
- [26]Breiman L: Random Forests. Mach Learn 2001, 45:123-140.
- [27]Friedman JH: Greedy function approximation: a gradient boosting machine. North 2001, 29:1189-1232.
- [28]Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd edition. New York: Springer; 2009.
- [29]Friedman JH: Stochastic gradient boosting. Comput Stat Data Anal 2002, 38:367-378.
- [30]McCullagh P, Nelder JA: Generalized Linear Models. 2nd edition. London: Chapman and Hall/CRC; 1989.
- [31]glmnet: Lasso and elastic-net regularized generalized linear models [http://cran.r-project.org/web/packages/glmnet/ webcite]
- [32]Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010, 33:1-22.
- [33]Hesterberg T, Moore DS, Monaghan S, Clipson A, Epstein R: Bootstrap Methods and Permutation Tests. In Introd to Pract Stat. Volume 5. Edited by Moore D, McCabe G. New York: WH Freeman & Co; 2005.
- [34]Altmann A, Toloşi L, Sander O, Lengauer T: Permutation importance: a corrected feature importance measure. Bioinformatics 2010, 26:1340-1347.
- [35]Steuer R, Kurths J, Daub CO, Weise J, Selbig J: The mutual information: detecting and evaluating dependencies between variables. Bioinformatics 2002, 18(Suppl 2):S231-S240.
- [36]Liaw A, Wiener M: Classification and regression by randomForest. R News 2002, 2/3:18-22.
- [37]Ridgeway G: Generalized boosted models: a guide to the gbm package. Compute 2007, 1:1-12.
- [38]Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, Wels M, van Hijum SAFT: Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? Brief Bioinform 2013, 14:315-326.
- [39]Tolosi L, Lengauer T: Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 2011, 27:1986-1994.
- [40]Bender R, Lange S: Adjusting for multiple testing–when and how? J Clin Epidemiol 2001, 54:343-349.
- [41]Bender R, Lange S: Multiple test procedures other than Bonferroni’s deserve wider use. BMJ 1999, 318:600-601.
- [42]Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Stat Soc - Ser B Stat Methodol 2005, 67:301-320. [Series B (Statistical Methodology)]
- [43]Ng S, Fang VJ, Ip DKM, Chan K-H, Leung GM, Peiris JSM, Cowling BJ: Estimation of the association between antibody titers and protection against confirmed influenza virus infection in children. J Infect Dis 2013, 208:1320-1324.
- [44]Riley S, Kwok KO, Wu KM, Ning DY, Cowling BJ, Wu JT, Ho L-M, Tsang T, Lo S-V, Chu DKW, Ma ESK, Peiris JSM: Epidemiological characteristics of 2009 (H1N1) pandemic influenza based on paired sera from a longitudinal community cohort study. PLoS Med 2011, 8:e1000442.
- [45]Simmerman JM, Suntarattiwong P, Levy J, Jarman RG, Kaewchana S, Gibbons RV, Cowling BJ, Sanasuttipun W, Maloney SA, Uyeki TM, Kamimoto L, Chotipitayasunondh T: Findings from a household randomized controlled trial of hand washing and face masks to reduce influenza transmission in Bangkok, Thailand. Influenza Other Respi Viruses 2011, 5:256-267.
- [46]Kloepfer KM, Olenec JP, Lee WM, Liu G, Vrtis RF, Roberg KA, Evans MD, Gangnon RE, Lemanske RF, Gern JE: Increased H1N1 infection rate in children with asthma. Am J Respir Crit Care Med 2012, 185:1275-1279.
- [47]Chen MIC, Lee VJM, Barr I, Lin C, Goh R, Lee C, Singh B, Tan J, Lim WY, Cook AR, Ang B, Chow A, Tan BH, Loh J, Shaw R, Chia KS, Lin RTP, Leo YS: Risk factors for pandemic (H1N1) 2009 virus seroconversion among hospital staff, Singapore. Emerg Infect Dis 2010, 16:1554-1561.