期刊论文详细信息
BMC Bioinformatics
Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event data
Harald Binder1  Isabell Hoffmann1  Murat Sariyar2 
[1]Institute of Medical Biostatistics, Epidemiology and Informatics, Medical Center of the Johannes Gutenberg University, Mainz 55131, Germany
[2]Institute of Pathology, Charite – University Medicine Berlin, Campus Benjamin Franklin, Berlin 12200, Germany
关键词: Time to event settings;    Random forest;    Prediction error curves;    Model complexity;    Model selection;    High-dimensional data;    Boosting;   
Others  :  1087610
DOI  :  10.1186/1471-2105-15-58
 received in 2013-10-22, accepted in 2014-01-28,  发布年份 2014
PDF
【 摘 要 】

Background

Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is computationally too expensive to check all possible interactions.

Results

We use a boosting technique for estimation of effects and the following building blocks for pre-selecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance.

Conclusion

Screening interactions through random forests is feasible and useful, when one is interested in finding relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones.

【 授权许可】

   
2014 Sariyar et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150117022901798.pdf 1068KB PDF download
Figure 4. 45KB Image download
Figure 3. 76KB Image download
Figure 2. 43KB Image download
Figure 1. 84KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

【 参考文献 】
  • [1]Fan J, Lv J: A selective overview of variable selection in high dimensional feature space (invited review article). Stat Sinica 2010, 20:101-148.
  • [2]Fan J, Samworth R, Wu Y: Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res 2009, 10:2013-2038.
  • [3]Guyon I: An introduction to variable and feature selection. J Mach Learn Res 2003, 3:1157-1182.
  • [4]Buhlmann P, van de Geer S: Statistics for High-Dimensional Data: Methods, Theory and Applications. New York: Springer; 2011.
  • [5]Buhlmann P, Hothorn T: Boosting algorithms: regularization, prediction and model fitting. Stat Sci 2007, 22(4):477-505.
  • [6]Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat Soc (Series B) 1996, 58:267-288.
  • [7]Tibshirani R: The Lasso method for variable selection in the Cox model. Stat Med 1997, 16(4):385-395.
  • [8]Park MY, Hastie T: L1-regularization path algorithm for generalized linear models. J R Stat Soc: Series B (Stat Methodol) 2007, 69(4):659-677.
  • [9]Chen HC, Chen J: Assessment of reproducibility of cancer survival risk predictions across medical centers. BMC Med Res Methodol 2013, 13:25. BioMed Central Full Text
  • [10]Huang Y, Gottardo R: Comparability and reproducibility of biomedical data. Brief Bioinform 2013, 14(4):391-401.
  • [11]Nilsson R, Bjorkegren J, Tegner J: On reliable discovery of molecular signatures. BMC Bioinformatics 2009, 10:38. BioMed Central Full Text
  • [12]Lee Y, Scheck A, Cloughesy T, Lai A, Dong J, Farooqi H, Liau L, Horvath S, Mischel P, Nelson S: Gene expression analysis of glioblastomas identifies the major molecular basis for the prognostic benefit of younger age. BMC Med Genom 2008, 1:52. BioMed Central Full Text
  • [13]Bovelstad H, Nygard S, Borgan O: Survival prediction from clinico-genomic models - a comparative study. BMC Bioinformatics 2009, 10:413. BioMed Central Full Text
  • [14]Kammers K, Lang M, Hengstler J, Schmidt M, Rahnenfuhrer J: Survival models with preclustered gene groups as covariates. BMC Bioinformatics 2011, 12:478. BioMed Central Full Text
  • [15]Binder H, Schumacher M: Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics 2008, 9:14. BioMed Central Full Text
  • [16]Park MY, Hastie T: Penalized logistic regression for detecting gene interactions. Biostatistics 2008, 9:30-50.
  • [17]Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 2009, 25(6):714-721.
  • [18]Dinu I, Mahasirimongkol S, Liu Q, Yanai H, Sharaf Eldin N, Kreiter E, Wu X, Jabbari S, Tokunaga K, Yasui Y: SNP-SNP Interactions discovered by logic regression explain Crohn’s disease genetics. PLoS ONE 2012, 7(10):e43035.
  • [19]Schwender H, Ickstadt K: Identification of SNP interactions using logic regression. Biostatistics 2008, 2007:9-187.
  • [20]Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 2001, 69:138-147.
  • [21]Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics 2003, 19(3):376-382.
  • [22]Breiman L: Random forests. Mach Learn 2001, 45:5-32.
  • [23]Jiang R, Tang W, Wu X, Fu W: A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics 2009, 10(Suppl 1):1-12. BioMed Central Full Text
  • [24]Upstill-Goddard R, Eccles D, Fliege J, Collins A: Machine learning approaches for the discovery of gene-gene interactions in disease data. Brief Bioinform 2013, 14(2):251-260.
  • [25]Gao H, Wu Y, Li J, Li H, Li J, Yang R: Forward LASSO analysis for high-order interactions in genome-wide association study. Brief Bioinform 2013. Jun 17. [Epub ahead of print]
  • [26]Bien J, Simon N, Tibshirani R: A lasso for hierarchical testing of interactions. Tech. rep., Department of Computer Science, Michigan State University; 2012. [http://www-stat.stanford.edu/~tibs/research.html webcite]
  • [27]Pashova H, LeBlanc M, Kooperberg C: Boosting for detection of gene-environment interactions. Stat Med 2013, 32(2):255-266.
  • [28]duVerle DA, Takeuchi I, Murakami-Tonami Y, Kadomatsu K, Tsuda K: Discovering combinatorial interactions in survival data. Bioinformatics 2013, 29(23):3053-3059.
  • [29]Biau G, Devroye L: On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. J Multivariate Anal 2010, 101(10):2499-2518.
  • [30]Biau G: Analysis of a random forests model. J Mach Learn Res 2012, 98888:1063-1095.
  • [31]Teng S, Luo H, Wang L: Random forest-based prediction of protein sumoylation sites from sequence features. In Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology. New York: ACM; 2010:120-126.
  • [32]Qi Y, Bar-Joseph Z, Klein-Seetharaman J: Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins: Struct Funct Bioinformatics 2006, 63(3):490-500.
  • [33]Lunetta K, Hayward LB, Segal J, Van Eerdewegh P: Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics 2004, 5:32. BioMed Central Full Text
  • [34]Winham S, Colby C, Freimuth R, Wang X, de Andrade M, Huebner M, Biernacka J: SNP interaction detection with Random forests in high-dimensional genetic data. BMC Bioinformatics 2012, 13:164. BioMed Central Full Text
  • [35]Hapfelmeier A, Ulm K: A new variable selection approach using random forests. Comput Stat Data Anal 2013, 60(0):50-69.
  • [36]Yoshida M, Koike A: SNPInterForest: A new method for detecting epistatic interactions. BMC Bioinformatics 2011, 12:469. BioMed Central Full Text
  • [37]Ishwaran H: Variable importance in binary regression trees and forests. Electron J Stat 2007, 1:519-537.
  • [38]Strobl C, Boulesteix AL, Zeileis A, Hothorn T: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 2007, 8:25. BioMed Central Full Text
  • [39]Hapfelmeier A, Hothorn T, Ulm K, Strobl C: A new variable importance measure for random forests with missing data. Stat Comput 2012, 1-14.
  • [40]Tutz G, Binder H: Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics 2006, 62(4):961-971.
  • [41]Porzelius C, Schumacher M, Binder H: Sparse regression techniques in low-dimensional survival data settings. Stat Comput 2010, 20(2):151-163.
  • [42]Binder H, Schumacher M: Incorporating pathway information into boosting estimation of high-dimensional risk prediction models. BMC Bioinformatics 2009, 10:18. BioMed Central Full Text
  • [43]Tutz G, Binder H: Boosting ridge regression. Comput Stat Data Anal 2007, 51(12):6044-6059.
  • [44]Cox DR: Regression models and life-tables. J R Stat Soci. Series B (Methodological) 1972, 34(2):187-220.
  • [45]Binder H, Allignol A, Schumacher M, Beyersmann J: Boosting for high-dimensional time-to-event data with competing risks. Bioinformatics 2009, 25(7):890-896.
  • [46]Binder H: CoxBoost: Cox models by likelihood based boosting for a single survival endpoint or competing risks. 2013. [R package version 1.4]
  • [47]Biau G, Devroye L, Lugosi G: Consistency of random forests and other averaging classifiers. J Mach Learn Res 2008, 9:2015-2033.
  • [48]Reif DM, Motsinger AA, McKinney BA, Crowe JE, Moore JH: Feature Selection using a Random Forests Classifier for the Integrated Analysis of Multiple Data Types. In Computational Intelligence and Bioinformatics and Computational Biology. New York: IEEE; 2006:1-8.
  • [49]Chen X, Ishwaran H: Random forests for genomic data analysis. Genomics 2012, 99(6):323-329.
  • [50]Ishwaran H, Kogalur UB, Gorodeski EZ, Minn A, Lauer MS: High-dimensional variable selection for survival data. J Am Stat Assoc 2010, 105(489):205-217.
  • [51]Buhlmann P, Yu B: Analyzing Bagging. Ann Stat 2002, 30:927-961.
  • [52]Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP: Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 2003, 43(6):1947-1958.
  • [53]Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS: Random survival forests. Ann Appl Stat 2008, 2(3):841-860.
  • [54]Segal MR: Regression trees for censored data. Biometrics 1988, 44:35-47.
  • [55]Ishwaran H, Kogalur UB, Chen X, Minn AJ: Random survival forests for high-dimensional data. Stat Anal Data Mining 2011, 4:115-132.
  • [56]Ishwaran H, Kogalur U: Random Forests for Survival, Regression and Classification (RF-SRC). 2013. [R package version 1.3]
  • [57]Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A: Conditional variable importance for random forests. BMC Bioinformatics 2008., 9(307)
  • [58]Genuer R, Poggi JM, Tuleau-Malot C: Variable selection using random forests. Pattern Recognit Lett 2010, 31(14):2225-2236.
  • [59]Boulesteix AL, Janitza S, Kruppa J, König IR: Overview of randomforestmethodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscipl Rev : Data Mining Knowl Discov 2012, 2(6):493-507.
  • [60]Ishwaran H, Kogalur U: Random survival forests for R. R News 2007, 7(2):25-31.
  • [61]Cook RD, Weisberg S: Applied Regression Including Computing and Graphics. New York: Wiley-Interscience; 1999.
  • [62]Harrell FE: Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer; 2010.
  • [63]Starmans M, Pintilie M, John T, Der S, Shepherd F, Jurisica I, Lambin P, Tsao MS, Boutros P: Exploiting the noise: improving biomarkers with ensembles of data analysis methodologies. Genome Med 2012, 4(11):84. BioMed Central Full Text
  • [64]Yang Y: Prediction/Estimation with simple linear models: is it really that simple? Econometric Theory 2007, 23:1-36.
  • [65]Blum A, Langley P: Selection of relevant features and examples in machine learning. Artif Intell 1997, 97:245-271.
  • [66]Kohavi R, John GH: Wrappers for feature subset selection. Artif Intell 1997, 97:273-324.
  • [67]Brier GW: Verification of forecasts expressed in terms of probability. Mon Weather Rev 1950, 78:1-3.
  • [68]Gerds TA, Schumacher M: Consistent estimation of the expected Brier score in general survival models with right-censored event times. Biom J 2006, 48(6):1029-1040.
  • [69]Gneiting T, Raftery AE: Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 2004, 102:359-378.
  • [70]Efron B, Tibshirani R: Improvements on cross-validation: the.632+ bootstrap method. J Am Stat Assoc 1997, 92(438):548-560.
  • [71]Binder H, Schumacher M: Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples. Stat Appl Genet Mol Biol 2008, 7:1-28.
  • [72]Porzelius C, Schumacher M, Binder H: The benefit of data-based model complexity selection via prediction error curves in time-to-event data. Comput Stat 2011, 26(2):293-302.
  • [73]Kohavi R: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conferences on Artificial Intelligence. Montreal: Morgan Kaufmann, Los Altos; 1995:1137-1143.
  • [74]Abraham G, Kowalczyk A, Loi S, Haviv I, Zobel J: Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context. BMC Bioinformatics 2010, 11:277. BioMed Central Full Text
  • [75]Graf E, Schmoor C, Sauerbrei W, Schumacher M: Assessment and comparison of prognostic classification schemes for survival data. Stat Med 1999, 18(17-18):2529-2545.
  • [76]Jelizarow M, Guillemot V, Tenenhaus A, Strimmer K, Boulesteix AL: Over-optimism in bioinformatics: an illustration. Bioinformatics 2010, 26(16):1990-1998.
  • [77]Bender R, Augustin T, Blettner M: Generating survival times to simulate Cox proportional hazards models. Stat Med 2005, 24(11):1713-1723.
  • [78]Nicodemus KK, Malley J, Strobl C, Ziegler A: The behavior of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinformatics 2010, 11:110. BioMed Central Full Text
  • [79]Nicodemus K: Letter to the editor: On the stability and ranking of predictors from random forest variable importance measures. Brief Bioinform 2011, 12(4):369-373.
  • [80]Nicodemus KK, Malley J, Strobl C, Ziegler A: The behavior of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinformatics 2010, 11:110. BioMed Central Full Text
  • [81]Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, et al.: The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. New Engl J Med 2002, 346(25):1937-1947.
  • [82]Segal MR: Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisited. Biostatistics 2006, 7(2):268-285.
  • [83]Zhang L, Li L, Liu H, Borowitz JL, Isom GE: BNIP3 mediates cell death by different pathways following localization to endoplasmic reticulum and mitochondrion. J Fed Am Soc Exp Biol 2009, 23(10):3405-14.
  • [84]Wong CS, Moller A: Siah: A promising anticancer target. Cancer Res 2013, 73(8):2400-2406.
  • [85]Crow MT: Hypoxia, BNIP3 Proteins, and the mitochondrial death pathway in Cardiomyocytes. Circ Res 2002, 91(3):183-185.
  • [86]Nakayama K, Ronai Z: Siah: new players in the cellular response to hypoxia. Cell Cycle 2004, 3(11):1345-7.
  • [87]Chinnadurai G, Vijayalingam S, Gibson SB: BNIP3 subfamily BH3-only proteins: mitochondrial stress sensors in normal and pathological functions. Oncogene 2008, 27(Suppl 1):S114-27.
  • [88]House CM, Moller A, Bowtell DD: Siah Proteins: novel drug targets in the Ras and Hypoxia pathways. Cancer Res 2009, 69(23):8835-8838.
  • [89]Oberthuer A, Kaderali L, Kahlert Y, Hero B, Westermann F, Berthold F, Brors B, Eils R, Fischer M: Subclassification and individual survival time prediction from gene expression data of neuroblastoma patients by using CASPAR. Clin Cancer Res 2008, 14(20):6590-6601.
  • [90]Harrell FE: Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression and Survival Analysis. New York: Springer; 2001.
  • [91]Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Stat 2004, 32(2):407-499.
  • [92]Hesterberg T, Choi NH, Meier L, Fraley C: Least angle and l1 penalized regression: A review. Stat Surv 2008, 2(2008):61-93.
  • [93]Zhao P, Yu B: On model selection consistency of Lasso. J Mach Learn Res 2006, 7:2541-2563.
  文献评价指标  
  下载次数:101次 浏览次数:50次