期刊论文详细信息
BMC Bioinformatics
A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform
Joanna Zhuang1  Martin Widschwendter1  Andrew E Teschendorff2 
[1] Department of Women's Cancer, UCL Elizabeth Garrett Anderson Institute for Women's Health, University College London, Room 340, 74 Huntley Street, London WC1E 6 AU, UK
[2] Statistical Genomics Group, Paul O'Gorman Building, UCL Cancer Institute, University College London, 72 Huntley Street, London WC1E 6BT, UK
关键词: Beadarrays;    Feature selection;    Classification;    DNA methylation;   
Others  :  1121792
DOI  :  10.1186/1471-2105-13-59
 received in 2011-12-20, accepted in 2012-04-24,  发布年份 2012
PDF
【 摘 要 】

Background

The 27k Illumina Infinium Methylation Beadchip is a popular high-throughput technology that allows the methylation state of over 27,000 CpGs to be assayed. While feature selection and classification methods have been comprehensively explored in the context of gene expression data, relatively little is known as to how best to perform feature selection or classification in the context of Illumina Infinium methylation data. Given the rising importance of epigenomics in cancer and other complex genetic diseases, and in view of the upcoming epigenome wide association studies, it is critical to identify the statistical methods that offer improved inference in this novel context.

Results

Using a total of 7 large Illumina Infinium 27k Methylation data sets, encompassing over 1,000 samples from a wide range of tissues, we here provide an evaluation of popular feature selection, dimensional reduction and classification methods on DNA methylation data. Specifically, we evaluate the effects of variance filtering, supervised principal components (SPCA) and the choice of DNA methylation quantification measure on downstream statistical inference. We show that for relatively large sample sizes feature selection using test statistics is similar for M and β-values, but that in the limit of small sample sizes, M-values allow more reliable identification of true positives. We also show that the effect of variance filtering on feature selection is study-specific and dependent on the phenotype of interest and tissue type profiled. Specifically, we find that variance filtering improves the detection of true positives in studies with large effect sizes, but that it may lead to worse performance in studies with smaller yet significant effect sizes. In contrast, supervised principal components improves the statistical power, especially in studies with small effect sizes. We also demonstrate that classification using the Elastic Net and Support Vector Machine (SVM) clearly outperforms competing methods like LASSO and SPCA. Finally, in unsupervised modelling of cancer diagnosis, we find that non-negative matrix factorisation (NMF) clearly outperforms principal components analysis.

Conclusions

Our results highlight the importance of tailoring the feature selection and classification methodology to the sample size and biological context of the DNA methylation study. The Elastic Net emerges as a powerful classification algorithm for large-scale DNA methylation studies, while NMF does well in the unsupervised context. The insights presented here will be useful to any study embarking on large-scale DNA methylation profiling using Illumina Infinium beadarrays.

【 授权许可】

   
2012 Zhuang et al; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150213012254110.pdf 1281KB PDF download
Figure 7. 30KB Image download
Figure 6. 64KB Image download
Figure 5. 41KB Image download
Figure 4. 24KB Image download
Figure 3. 28KB Image download
Figure 2. 51KB Image download
Figure 1. 60KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

【 参考文献 】
  • [1]Jones PA, Baylin SB: The fundamental role of epigenetic events in cancer. Nat Rev Genet 2002, 3(6):415-428.
  • [2]Feinberg AP, Ohlsson R, Henikoff S: The epigenetic progenitor origin of human cancer. Nat Rev Genet 2006, 7(1):21-33.
  • [3]Teschendorff AE, et al.: Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res 2010, 20(4):440-446.
  • [4]Rakyan VK, et al.: Human aging-associated DNA hypermethylation occurs preferentially at bivalent chromatin domains. Genome Res 2010, 20(4):434-439.
  • [5]Maegawa S, et al.: Widespread and tissue specific age-related DNA methylation changes in mice. Genome Res 2010, 20(3):332-340.
  • [6]Issa JP, et al.: Accelerated age-related CpG island methylation in ulcerative colitis. Cancer Res 2001, 61(9):3573-3577.
  • [7]Ahuja N, Issa JP: Aging, methylation and cancer. Histol Histopathol 2000, 15(3):835-842.
  • [8]Ahuja N, et al.: Aging and DNA methylation in colorectal mucosa and cancer. Cancer Res 1998, 58(23):5489-5494.
  • [9]Laird PW: Principles and challenges of genomewide DNA methylation analysis. Nat Rev Genet 2010, 11(3):191-203.
  • [10]Bibikova M, Fan JB: Genome-wide DNA methylation profiling. Wiley Interdiscip Rev Syst Biol Med 2010, 2(2):210-223.
  • [11]Teschendorff AE, et al.: An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One 2009, 4(12):e8274.
  • [12]Bell CG, et al.: Genome-wide DNA methylation analysis for diabetic nephropathy in type 1 diabetes mellitus. BMC Med Genomics 2010, 3:33. BioMed Central Full Text
  • [13]Noushmehr H, et al.: Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. Cancer Cell 2010, 17(5):510-522.
  • [14]Hinoue T, et al.: Genome-scale analysis of aberrant DNA methylation in colorectal cancer. Genome Res 2012, 22(2):271-82.
  • [15]Schellenberg A, et al.: Replicative senescence of mesenchymal stem cells causes DNA-methylation changes which correlate with repressive histone marks. Aging (Albany NY) 2011, 3(9):873-888.
  • [16]Koestler DC, et al.: Semi-supervised recursively partitioned mixture models for identifying cancer subtypes. Bioinformatics 2010, 26(20):2578-2585.
  • [17]Houseman EA, et al.: Copy number variation has little impact on bead-array-based measures of DNA methylation. Bioinformatics 2009, 25(16):1999-2005.
  • [18]Houseman EA, et al.: Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics 2008, 9:365. BioMed Central Full Text
  • [19]Du P, et al.: Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 2010, 11:587. BioMed Central Full Text
  • [20]Tibshirani R, et al.: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002, 99(10):6567-6572.
  • [21]Bair E, Tibshirani R: Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol 2004, 2(4):E108.
  • [22]Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001, 98(9):5116-5121.
  • [23]Hastie T, et al.: Supervised harvesting of expression trees. Genome Biol 2001, 2(1):RESEARCH0003.
  • [24]Tomlins SA, et al.: Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 2005, 310(5748):644-648.
  • [25]Teschendorff AE, et al.: PACK: Profile Analysis using Clustering and Kurtosis to find molecular classifiers in cancer. Bioinformatics 2006, 22(18):2269-2275.
  • [26]Calza S, et al.: Filtering genes to improve sensitivity in oligonucleotide microarray data analysis. Nucleic Acids Res 2007, 35(16):e102.
  • [27]Bourgon R, Gentleman R, Huber W: Independent filtering increases detection power for high-throughput experiments. Proc Natl Acad Sci USA 2010, 107(21):9546-9551.
  • [28]Simon R, et al.: Analysis of gene expression data using BRB-ArrayTools. Cancer Inform 2007, 3:11-17.
  • [29]Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002, 97:77-87.
  • [30]Radmacher MD, McShane LM, Simon R: A paradigm for class prediction using gene expression profiles. J Comput Biol 2002, 9(3):505-511.
  • [31]Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 2005, 365(9458):488-492.
  • [32]Friedman J, Hastie T, Tibshirani R: Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 2010, 33(1):1-22.
  • [33]Furey TS, et al.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16(10):906-914.
  • [34]Teschendorff AE, et al.: An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer. Genome Biol 2007, 8(8):R157. BioMed Central Full Text
  • [35]Sandoval J, et al.: Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics 2011, 6(6):692-702.
  • [36]Rakyan VK, et al.: Epigenome-wide association studies for common human diseases. Nat Rev Genet 2011, 12(8):529-541.
  • [37]Irizarry RA, et al.: Comprehensive high-throughput arrays for relative methylation (CHARM). Genome Res 2008, 18(5):780-790.
  • [38]van't Veer LJ, et al.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530-536.
  • [39]Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci USA 2003, 100(16):9440-9445.
  • [40]Tibshirani R: Regression shrinkage and selection via the lasso. J Royal Statist Soc B 1996, 58(1):267-288.
  • [41]Brown MP, et al.: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 2000, 97(1):262-267.
  • [42]Bocklandt S, et al.: Epigenetic predictor of age. PLoS One 2011, 6(6):e14821.
  • [43]Leek JT: Asymptotic conditional singular value decomposition for high-dimensional genomic data. Biometrics 2011, 67(2):344-352.
  • [44]Sharov AA, Dudekula DB, Ko MS: A web-based tool for principal component and significance analysis of microarray data. Bioinformatics 2005, 21(10):2548-2549.
  • [45]Liu L, et al.: Robust singular value decomposition analysis of microarray data. Proc Natl Acad Sci USA 2003, 100(23):13167-13172.
  • [46]Wall ME, Dyck PA, Brettin TS: SVDMAN-singular value decomposition analysis of microarray data. Bioinformatics 2001, 17(6):566-568.
  • [47]Brunet JP, et al.: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 2004, 101(12):4164-4169.
  • [48]Gao Y, Church G: Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics 2005, 21(21):3970-3975.
  • [49]Wang G, Kossenkov AV, Ochs MF: LS-NMF: a modified non-negative matrix factorization algorithm utilizing uncertainty estimates. BMC Bioinformatics 2006, 7:175. BioMed Central Full Text
  • [50]Qi Q, et al.: Non-negative matrix factorization of gene expression profiles: a plug-in for BRB-ArrayTools. Bioinformatics 2009, 25(4):545-547.
  • [51]Kim H, Park H: Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 2007, 23(12):1495-1502.
  • [52]Gaujoux R, Seoighe C: A flexible R package for nonnegative matrix factorization. BMC Bioinformatics 2010, 11:367. BioMed Central Full Text
  • [53]Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004., 3Article3
  • [54]Subramanian A, et al.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102(43):15545-15550.
  • [55]Leek JT, Storey JD: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 2007, 3(9):1724-1735.
  • [56]Leek JT, Storey JD: A general framework for multiple testing dependence. Proc Natl Acad Sci USA 2008, 105(48):18718-18723.
  • [57]Leek JT, et al.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 2010, 11(10):733-739.
  • [58]Johnson WE, Li C, Rabinovic A: Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 2007, 8(1):118-127.
  • [59]Teschendorff AE, Zhuang J, Widschwendter M: Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies. Bioinformatics 2011, 27(11):1496-1505.
  • [60]Leek JT, et al.: The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 2012, 28(6):882-3.
  • [61]Dedeurwaerder S, et al.: Evaluation of the Infinium Methylation 450 K technology. Epigenomics 2011, 3(6):771-784.
  文献评价指标  
  下载次数:61次 浏览次数:11次