期刊论文详细信息
BMC Bioinformatics
SNP interaction detection with Random Forests in high-dimensional genetic data
Stacey J Winham2  Colin L Colby2  Robert R Freimuth2  Xin Wang2  Mariza de Andrade2  Marianne Huebner3  Joanna M Biernacka1 
[1] Department of Psychiatry and Psychology, Mayo Clinic, 200 First Street Southwest, Rochester, MN, 55905, USA
[2] Department of Health Sciences Research, Mayo Clinic, 200 First Street Southwest, Rochester, MN, 55905, USA
[3] Department of Statistics and Probability, Michigan State University, A413 Wells Hall, East Lansing, MI, 48824, USA
Others  :  1088206
DOI  :  10.1186/1471-2105-13-164
 received in 2011-12-21, accepted in 2012-04-30,  发布年份 2012
PDF
【 摘 要 】

Background

Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional.

Results

RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions.

Conclusions

While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.

【 授权许可】

   
2012 Winham et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150117084527939.pdf 1068KB PDF download
Figure 3. 37KB Image download
Figure 2. 67KB Image download
Figure 1. 45KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

【 参考文献 】
  • [1]Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls Nature 2007, 447(7145):661-678.
  • [2]McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN: Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 2008, 9(5):356-369.
  • [3]Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al.: Finding the missing heritability of complex diseases. Nature 2009, 461(7265):747-753.
  • [4]Cordell HJ: Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet 2009, 10(6):392-404.
  • [5]Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH: Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet 2010, 11(6):446-450.
  • [6]Moore JH: A global view of epistasis. Nat Genet 2005, 37(1):13-14.
  • [7]Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K: A comprehensive review of genetic association studies. Genet Med 2002, 4(2):45-61.
  • [8]Marchini J, Donnelly P, Cardon LR: Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet 2005, 37(4):413-417.
  • [9]Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B Methodol 1996, 58(1):267-288.
  • [10]Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 2001, 69(1):138-147.
  • [11]Lucek PR, Ott J: Neural network analysis of complex traits. Genet Epidemiol 1997, 14(6):1101-1106.
  • [12]Cortes C, Vapnik V: Support-vector networks. Mach Learn 1995, 20(3):273-297.
  • [13]Breiman L: Random forests. Mach Learn 2001, 45:5-32.
  • [14]Goldstein BA, Polley EC, Briggs FBS: Random forests for genetic association studies. Stat Appl Genet Mol Biol 2011, 10(1):  Article 32.
  • [15]Goldstein BA, Hubbard AE, Cutler A, Barcellos LF: An application of random forests to a genome-wide association dataset: methodological considerations & new findings. BMC Genet 2010, 11:49.
  • [16]Schwarz DF, Szymczak S, Ziegler A, Konig IR: Picking single-nucleotide polymorphisms in forests. BMC Proc 2007, 1(Suppl 1):S59. BioMed Central Full Text
  • [17]Lunetta KL, Hayward LB, Segal J, Van Eerdewegh P: Screening large-scale association study data: exploiting interactions using random forests. BMC Genet 2004, 5(1):32.
  • [18]Diaz-Uriarte R, Alvarez de Andres S: Gene selection and classification of microarray data using random forest. BMC Bioinforma 2006, 7:3. BioMed Central Full Text
  • [19]Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A: Conditional variable importance for random forests. BMC Bioinforma 2008, 9:307. BioMed Central Full Text
  • [20]Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, Van Eerdewegh P: Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol 2005, 28(2):171-182.
  • [21]Sun YV: Multigenic modeling of complex disease by random forests. Adv Genet 2010, 72:73-99.
  • [22]McKinney BA, Crowe JE, Guo J, Tian D: Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis. PLoS Genet 2009, 5(3):e1000432.
  • [23]Breiman L: Bagging predictors. Mach Learn 1996, 24(2):123-140.
  • [24]Breiman L, Friedman J, Stone CJ, Ohlsen RA: Classification and regression trees. Chapman and Hall, Belmont: Wadsworth; 1984.
  • [25]Strobl C, Boulesteix AL, Zeileis A, Hothorn T: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinforma 2007, 8:25. BioMed Central Full Text
  • [26]Schwarz DF, Konig IR, Ziegler A: On safari to random jungle: a fast implementation of random forests for high-dimensional data. Bioinformatics 2010, 26(14):1752-1758.
  • [27]Meng YA, Yu Y, Cupples LA, Farrer LA, Lunetta KL: Performance of random forest when SNPs are in linkage disequilibrium. BMC Bioinforma 2009, 10:78. BioMed Central Full Text
  • [28]Falconer DS, Mackay TF: Introduction to quantitative genetics. 4th edition. Addison Wesley Longman Limited, Essex, England; 1996.
  • [29]Culverhouse R, Suarez BK, Lin J, Reich T: A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet 2002, 70(2):461-471.
  • [30]Biau G, Devroye L, Lugosi G: Consistency of random forests and other averaging classifiers. J Mach Learn Res 2008, 9:2015-2033.
  • [31]Biau G, Devroye L: On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. J Multivar Anal 2010, 101(10):2499-2518.
  • [32]Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol 2003, 24(2):150-157.
  • [33]Motsinger-Reif AA, Reif DM, Fanelli TJ, Ritchie MD: A comparison of analytical methods for genetic association studies. Genet Epidemiol 2008, 32(8):767-778.
  • [34]Biau G: Analysis of a random forests model. J Mach Learn Res 2012, 13:1063-1095.
  • [35]Nicodemus KK, Malley JD: Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics 2009, 25(15):1884-1890.
  • [36]Montana G: HapSim: a simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients. Bioinformatics 2005, 21(23):4309-4311.
  • [37]Bierut LJ, Agrawal A, Bucholz KK, Doheny KF, Laurie C, Pugh E, Fisher S, Fox L, Howells W, Bertelsen S, et al.: A genome-wide association study of alcohol dependence. Proc Natl Acad Sci U S A 2010, 107(11):5082-5087.
  • [38]Scheet P, Stephens M: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 2006, 78(4):629-644.
  文献评价指标  
  下载次数:3次 浏览次数:1次