期刊论文详细信息
BMC Bioinformatics
Multi-model inference using mixed effects from a linear regression based genetic algorithm
Koen Van der Borght2  Geert Verbeke3  Herman van Vlijmen1 
[1] Janssen Infectious Diseases-Diagnostics BVBA, B-2340 Beerse, Belgium
[2] Interuniversity Institute for Biostatistics and statistical Bioinformatics, Katholieke Universiteit Leuven, B-3000 Leuven, Belgium
[3] Interuniversity Institute for Biostatistics and statistical Bioinformatics, Universiteit Hasselt, B-3590 Diepenbeek, Belgium
关键词: Multi-model inference;    Mixed-effects model;    Genetic algorithm;    Linear regression;    Variable selection;   
Others  :  1087581
DOI  :  10.1186/1471-2105-15-88
 received in 2013-10-11, accepted in 2014-03-21,  发布年份 2014
PDF
【 摘 要 】

Background

Different high-dimensional regression methodologies exist for the selection of variables to predict a continuous variable. To improve the variable selection in case clustered observations are present in the training data, an extension towards mixed-effects modeling (MM) is requested, but may not always be straightforward to implement.

In this article, we developed such a MM extension (GA-MM-MMI) for the automated variable selection by a linear regression based genetic algorithm (GA) using multi-model inference (MMI). We exemplify our approach by training a linear regression model for prediction of resistance to the integrase inhibitor Raltegravir (RAL) on a genotype-phenotype database, with many integrase mutations as candidate covariates. The genotype-phenotype pairs in this database were derived from a limited number of subjects, with presence of multiple data points from the same subject, and with an intra-class correlation of 0.92.

Results

In generation of the RAL model, we took computational efficiency into account by optimizing the GA parameters one by one, and by using tournament selection. To derive the main GA parameters we used 3 times 5-fold cross-validation. The number of integrase mutations to be used as covariates in the mixed effects models was 25 (chrom.size). A GA solution was found when R2MM > 0.95 (goal.fitness). We tested three different MMI approaches to combine the results of 100 GA solutions into one GA-MM-MMI model. When evaluating the GA-MM-MMI performance on two unseen data sets, a more parsimonious and interpretable model was found (GA-MM-MMI TOP18: mixed-effects model containing the 18 most prevalent mutations in the GA solutions, refitted on the training data) with better predictive accuracy (R2) in comparison to GA-ordinary least squares (GA-OLS) and Least Absolute Shrinkage and Selection Operator (LASSO).

Conclusions

We have demonstrated improved performance when using GA-MM-MMI for selection of mutations on a genotype-phenotype data set. As we largely automated setting the GA parameters, the method should be applicable on similar datasets with clustered observations.

【 授权许可】

   
2014 Van der Borght et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150117021523730.pdf 783KB PDF download
Figure 9. 57KB Image download
Fig. 3. 72KB Image download
Figure 7. 81KB Image download
Figure 6. 41KB Image download
Figure 5. 40KB Image download
Figure 4. 30KB Image download
Figure 3. 31KB Image download
Figure 2. 48KB Image download
Figure 1. 71KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Fig. 3.

Figure 9.

【 参考文献 】
  • [1]Orelien JG, Edwards LJ: Fixed-effect variable selection in linear mixed models using R2 statistics. Comput Stat Data An 2008, 52:1896-1907.
  • [2]Schelldorfer J, Bühlmann P, Van de Geer S: Estimation for high-dimensional linear mixed-effects models using l1-penalization. Scand J Stat 2011, 38:197-214.
  • [3]Taylor JD, Verbyla AP, Cavanagh C, Newberry M: Variable selection in linear mixed models using an extended class of penalties. Aust N Z J Stat 2012, 54:427-449.
  • [4]Hajjem A, Bellavance F, Larocque D: Mixed-effects random forest for clustered data. J Stat Comput Simul 2012. [http://dx.doi.org/10.1080/00949655.2012.741599 webcite]
  • [5]Van der Borght K, Verheyen A, Feyaerts M, Van Wesenbeeck L, Verlinden Y, Van Craenenbroeck E, van Vlijmen H: Quantitative prediction of integrase inhibitor resistance from genotype through consensus linear regression modeling. Virol J 2013, 10:8. BioMed Central Full Text
  • [6]Verbeke G, Molenberghs G: Linear Mixed Models for Longitudinal Data. New York: Springer; 2000.
  • [7]Zuur AF, Ieno EN, Walker N, Saveliev AA, Smith GM: Mixed Effects Models and Extensions in Ecology with R. New York: Springer; 2009.
  • [8]Kutner MH, Nachtsheim CJ, Neter J, Li W: Applied Linear Statistical Models. New York: McGraw-Hill; 2004.
  • [9]Tibshirani R: Regression Shrinkage and Selection via the Lasso. J Royal Stat Soc B 1996, 58:267-288.
  • [10]FDA: Isentress (raltegravir) drug label. 2009. [http://www.accessdata.fda.gov/drugsatfda_docs/label/2009/022145s004lbl.pdf webcite]
  • [11]Affenzeller M, Winkler S, Wagner S, Beham A: Genetic Algorithms and Genetic Programming – Modern Concepts and Practical Applications. Boca Raton: CRC Press; 2009.
  • [12]Butz MV: Rule-Based Evolutionary Online Learning Systems – A Principled Approach to LCS Analysis and Design. Berlin: Springer; 2006.
  • [13]Hopgood AA: Intelligent Systems for Engineers and Scientists. Boca Raton: CRC Press; 2001.
  • [14]Sivanandam SN, Deepa SN: Introduction to Genetic Algorithms. Heidelberg: Springer; 2008.
  • [15]Michalewicz Z: Genetic Algorithms + Data Structures = Evolution Programs. New York: Springer; 1996.
  • [16]Edwards LJ, Muller KE, Wolfinger RD, Quaqish BF, Schabenberger O: An R2 statistic for fixed effects in the linear mixed model. Stat Med 2008, 27:6137-6157.
  • [17]Kramer M: R2 statistics for mixed models. In Proceedings of the 17th annual Kansas State University Conference on Applied Statistics in Agriculture: 25-27 April 2005. Manhattan, Kansas. Kansas State University: ; 2005:148-160.
  • [18]Nakagawa S, Schielzeth H: A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods Ecol Evol 2013, 4:133-142.
  • [19]Flack VF, Chang PC: Frequency of selecting noise variables in subset regression analysis: a simulation study. Am Stat 1987, 41:84-86.
  • [20]Lukacs PM, Burnham KP, Anderson DR: Model selection bias and Freedman’s paradox. Ann Inst Stat Math 2010, 62:117-125.
  • [21]Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010, 33:1. R package version 1.9-3. [http://CRAN.R-project.org/package=glmnet webcite]
  • [22]Belloni A, Chernozhukov V: Least squares after model selection in high-dimensional sparse models. Bernoulli 2013, 19:521-547.
  • [23]Trevino V, Falciani F: GALGO: an R package for multivariate variable selection using genetic algorithms. Bioinformatics 2006, 22:1154-1156. R package version 1.0.11. [http://biptemp.bham.ac.uk/vivo/galgo/AppNotesPaper.htm webcite]
  • [24]Tenorio F 2013. [gaoptim: Genetic Algorithm optimization for real-valued problems] R package version 1.0. [http://CRAN.R-project.org/package=gaoptim webcite]
  • [25]Miller BL, Goldberg DE: Genetic algorithms, tournament selection, and the effects of noise. Complex Systems 1995, 9:193-212.
  • [26]Krishnamoorthy K, Mathew T: Statistical Tolerance Regions: Theory, Applications, and Computation. Hoboken, NJ: John Wiley & Sons; 2009.
  • [27]Young DS: An R package for estimating tolerance intervals. J Stat Softw 2010, 36:5. R package version 0.5.2. [http://CRAN.R-project.org/package=tolerance webcite]
  文献评价指标  
  下载次数:112次 浏览次数:30次