GigaScience | |
Applying compressed sensing to genome-wide association studies | |
Carson C Chow2  Stephen D H Hsu1  Christopher C Chang1  James J Lee1  Shashaank Vattikuti2  | |
[1] Cognitive Genomics Lab, BGI Shenzhen, Yantian District, Shenzhen, China;Mathematical Biology Section, Laboratory of Biological Modeling, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, South Drive, Bethesda, MD 20814, USA | |
关键词: Phase transition; Sparsity; Underdetermined system; Lasso; Compressed sensing; Genomic selection; GWAS; | |
Others : 861291 DOI : 10.1186/2047-217X-3-10 |
|
received in 2014-01-08, accepted in 2014-05-23, 发布年份 2014 | |
【 摘 要 】
Background
The aim of a genome-wide association study (GWAS) is to isolate DNA markers for variants affecting phenotypes of interest. This is constrained by the fact that the number of markers often far exceeds the number of samples. Compressed sensing (CS) is a body of theory regarding signal recovery when the number of predictor variables (i.e., genotyped markers) exceeds the sample size. Its applicability to GWAS has not been investigated.
Results
Using CS theory, we show that all markers with nonzero coefficients can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability equal to one (h2 = 1), there is a sharp phase transition from poor performance to complete selection as the sample size is increased. For heritability below one, complete selection still occurs, but the transition is smoothed. We find for h2 ∼ 0.5 that a sample size of approximately thirty times the number of markers with nonzero coefficients is sufficient for full selection. This boundary is only weakly dependent on the number of genotyped markers.
Conclusion
Practical measures of signal recovery are robust to linkage disequilibrium between a true causal variant and markers residing in the same genomic region. Given a limited sample size, it is possible to discover a phase transition by increasing the penalization; in this case a subset of the support may be recovered. Applying this approach to the GWAS analysis of height, we show that 70-100% of the selected markers are strongly correlated with height-associated markers identified by the GIANT Consortium.
【 授权许可】
2014 Vattikuti et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20140725000229822.pdf | 1325KB | download | |
91KB | Image | download | |
75KB | Image | download | |
69KB | Image | download | |
27KB | Image | download | |
40KB | Image | download | |
62KB | Image | download | |
33KB | Image | download | |
59KB | Image | download | |
Figure 1. | 20KB | Image | download |
57KB | Image | download | |
67KB | Image | download |
【 图 表 】
Figure 1.
【 参考文献 】
- [1]Johnstone IM, Titterington DM: Statistical challenges of high-dimensional data. Philos Trans R Soc A 2009, 367:4237-4253.
- [2]Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ: Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet 2008, 4:e1000130.
- [3]Goddard ME, Wray NR, Verbyla K, Visscher PM: Estimating effects and making predictions from genome-wide marker data. Stat Sci 2009, 24:517-529.
- [4]Kemper KE, Daetwyler HD, Visscher PM, Goddard ME: Comparing linkage and association analyses in sheep points to a better way of doing GWAS. Genet Res 2012, 94:191-203.
- [5]Genovese CR, Jin J, Wasserman L, Yao Z: A comparison of the lasso and marginal regression. J Mach Learn Res 2012, 13:2107-2143.
- [6]Visscher PM, Brown MA, McCarthy MI, Yang J: Five years of GWAS discovery. Am J Hum Genet 2012, 90:7-24.
- [7]Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM: Common SNPs explain a large proportion of the heritability for human height. Nat Genet 2010, 42:565-569.
- [8]Tibshirani R: Regression shrinkage and selection via the lasso. J Roy Stat Soc B 1996, 58:267-288.
- [9]Park J-H, Gail MH, Weinberg CR, Carroll RJ, Chung CC, Wang Z, Chanock SJ, Fraumeni JF, Chatterjee N: Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants. Proc Natl Acad Sci U S A 2011, 108:18026-18031.
- [10]Stahl EA, Wegmann D, Trynka G, Gutierrez-Achury J, Do R, Voight BF, Kraft P, Chen R, Kallberg HJ, Kurreeman FAS, Kathiresan S, Wijmenga C, Gregersen PK, Alfredsson L, Siminovitch KA, Worthington J, Bakker PIW d, Raychaudhuri S, Plenge RM, Diabetes Genetics Replication and Meta-Analysis Consortium, Myocardial Infarction Genetics Consortium: Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat Genet 2012, 44:483-489.
- [11]Ripke S, O’Dushlaine C, Chambert K, Moran JL, Kähler AK, Akterin S, Bergen SE, Collins AL, Crowley JJ, Fromer M, Kim Y, Lee SH, Magnusson PKE, Sanchez N, Stahl EA, Williams S, Wray NR, Xia K, Bettella F, Børglum AD, Bulik-Sullivan BK, Cormican P, Craddock N, de Leeuw C, Durmishi N, Gill M, Golimbet V, Hamshere ML, Holmans P, Hougaard DM, et al.: Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nat Genet 2013, 45:1150-1159.
- [12]Meuwissen T, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics 2001, 157:1819-1829.
- [13]de los Campos G, Gianola D, Allison DB: Predicting genetic predisposition in humans: the promise of whole-genome markers. Nat Rev Genet 2010, 11:880-886.
- [14]Hayes BJ, Pryce J, Chamberlain AJ, Bowman PJ, Goddard ME: Genetic architecture of complex traits and accuracy of genomic prediction: coat colour, milk-fat percentage, and type in Holstein cattle as contrasting model traits. PLoS Genet 2010, 6:e1001139.
- [15]Meuwissen T, Hayes BJ, Goddard ME: Accelerating improvement of livestock with genomic selection. Annu Rev Anim Biosci 2013, 1:221-237.
- [16]Usai MG, Goddard ME, Hayes BJ: LASSO with cross-validation for genomic selection. Genet Res 2009, 91:427-436.
- [17]Wimmer V, Lehermeier C, Albrecht T, Auinger H-J, Wang Y, Schön C-C: Genome-wide prediction of traits with different genetic architecture through efficient variable selection. Genetics 2013, 195:573-587.
- [18]Zhou X, Carbonetto P, Stephens M: Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet 2013, 9:e1003264.
- [19]Gianola D: Priors in whole-genome regression: the Bayesian alphabet returns. Genetics 2013, 194:573-596.
- [20]Donoho DL, Tanner J: Sparse nonnegative solution of underdetermined linear equations by linear programming. Proc Natl Acad Sci U S A 2005, 102:9446-9451.
- [21]Candès EJ, Plan Y: Near-ideal model selection by L1 minimization. Ann Stat 2009, 37:2145-2177.
- [22]Candès EJ, Plan Y: A probabilistic and RIPless theory of compressed sensing. IEEE Trans Inform Theory 2011, 57:7235-7254.
- [23]Candès EJ, Romberg J, Tao T: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans Inform Theory 2006, 52:489-509.
- [24]Donoho DL, Maleki A, Montanari A: The noise-sensitivity phase transition in compressed sensing. IEEE Trans Inform Theory 2011, 57:6920-6941.
- [25]Donoho DL, Maleki A, Montanari A: Message-passing algorithms for compressed sensing. Proc Natl Acad Sci U S A 2009, 106:18914-18919.
- [26]Donoho DL: High-dimensional centrally symmetric polytopes with neighborliness proportional to dimension. Discrete Comput Geom 2006, 35:617-652.
- [27]Donoho DL, Tanner J: Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing. Philos Trans A Math Phys Eng Sci 2009, 367:4273-93.
- [28]Donoho DL, Stodden V: Breakdown point of model selection when the number of variables exceeds the number of observations. 2006, 1916-1921. [International joint conference on neural networks]
- [29]Monajemi H, Jafarpour S, Gavish M, Donoho DL, Stat 330/CME 362 Collaboration: Deterministic matrices matching the compressed sensing phase transition of Gaussian random matrices. Proc Natl Acad Sci U S A 2013, 110:1181-1186.
- [30]Vattikuti S, Guo J, Chow CC: Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits. PLoS Genet 2012, 8(3):e1002637. doi:10.1371/journal.pgen.1002637
- [31]Vattikuti S, Chow CC: Software and supporting material for: “Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits”. GitHubhttps://github.com/ShashaankV/MVMLE webcite
- [32]Lee JJ, Chow CC: Conditions for the validity of SNP-based heritability estimation. Hum Genet 2014. doi10.1007/s00439-014-1441-5
- [33]Johnstone IM: Oracle inequalities and nonparametric function estimation. Documenta Mathematica 1998, 3:267-278.
- [34]Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L, Popova N, Pretel S, Ziyabari L, Lee M, Shao Y, Wang ZY, Sirotkin K, Ward M, Kholodov M, Zbicz K, Beck J, Kimelman M, Shevelev S, Preuss D, Yaschenko E, Graeff A, Ostell J, Sherry ST: The NCBI dbGaP database of genotypes and phenotypes. Nat Genet 2007, 39(10):1181-6.
- [35]Purcell SM, Neale BM, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, Bakker PIW d, Daly MJ, Sham PC: PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007, 81:559-575.
- [36]Shaun P, Christopher C: PLINK 2. https://www.cog-genomics.org/plink2 webcite
- [37]Davies G, Tenesa A, Payton A, Yang J, Harris SE, Goddard ME, Liewald D, Ke X, Le Hellard S, Christoforou A, Luciano M, McGhee KA, Lopez LM, Gow AJ, Corley J, Redmond P, Fox HC, Haggarty P, Whalley LJ, McNeill G, Espeseth T, Lundervold AJ, Reinvang I, Pickles A, Steen VM, Ollier W, Porteous DJ, Horan MA, Starr JM, Pendleton N, et al.: Genome-wide association studies establish that human intelligence is highly heritable and polygenic. Mol Psychiatry 2011, 16:996-1005.
- [38]Chatterjee N, Wheeler B, Sampson J, Hartge P, Chanock SJ, Park J-H: Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet 2013, 45:400-405.
- [39]Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, Willer CJ, Jackson AU, Vedantam S, Raychaudhuri S, Ferreira T, Wood AR, Weyant RJ, Segre AV, Speliotes EK, Wheeler E, Soranzo N, Park J-H, Yang J, Gudbjartsson D, Heard-Costa NL, Randall JC, Qi L, Vernon Smith A, Magi R, Pastinen T, Liang L, Heid IM, Luan J, Thorleifsson G, et al.: Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 2010, 467:832-838.
- [40]Johnson AD, Handsaker RE, Pulit SL, Nizzari MM, O’Donnell CJ, Bakker PIW d: SNAP: A web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics 2008, 24:2938-2939.
- [41]Abraham G, Kowalczyk A, Zobel J, Inouye M: Performance and robustness of penalized and unpenalized methods for genetic prediction of complex human disease. Genet Epidemiol 2013, 37:184-195.
- [42]Donoho DL, Tanner J: Precise undersampling theorems. Proc IEEE 2010, 98:913-924.
- [43]Storey J, Tibshirani R: Statistical significance for genome-wide studies. Proc Natl Acad Sci 2003, 100:9440-9445.
- [44]Maller JB, McVean G, Byrnes J, Vukcevic D, Palin K, Su Z, Howson JMM, Auton A, Myers S, Morris A, Pirinen M, Brown MA, Burton PR, Caulfield MJ, Compston A, Farrall M, Hall AS, Hattersley AT, Hill AVS, Mathew CG, Pembrey M, Satsangi J, Stratton MR, Worthington J, Craddock N, Hurles M, Ouwehand WH, Parkes M, Rahman N, Duncanson A, et al.: Bayesian refinement of association signals for 14 loci in 3 common diseases. Nat Genet 2012, 44:1294-1301.
- [45]Edwards SL, Beesley J, French JD, Dunning AM: Beyond GWASs: illuminating the dark road from association to function. Am J Hum Genet 2013, 93:779-797.
- [46]Hedrick PW: Gametic disequilibrium measures: proceed with caution. Genetics 1987, 117:331-41.
- [47]Wray NR, Purcell SM, Visscher PM: Synthetic associations created by rare variants do not explain most GWAS results. PLoS Biol 2011, 9:e1000579.
- [48]Yang J, Ferreira T, Morris AP, Medland SE, Madden PAF, Heath AC, Martin NG, Montgomery GW, Weedon MN, Loos RJ, Frayling TM, McCarthy MI, Hirschhorn JN, Goddard ME, Visscher PM, Genetic Investigation of Anthropometric Traits Consortium, Diabetes Genetics Replication and Meta-Analysis Consortium: Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet 2012, 44:369-375.
- [49]Candès EJ, Romberg JK, Tao T: Stable signal recovery from incomplete and inaccurate measurements. Commun Pure Appl Math 2006, 59:1207-1223.
- [50]Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007, 447:661-678.
- [51]Turchin MC, Chiang CWK, Palmer CD, Sankararaman S, Reich D, Hirschhorn JN, Genetic Investigation of Anthropometric Traits Consortium: Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat Genet 2012, 44:1015-1019.
- [52]Vila J, Schniter P: Expectation-maximization gaussian-mixture approximate message passing. IEEE Trans Signal Process 2013, 61:4858-4672.
- [53]Friedman J, Hastie T, Höfling H, Tibshirani R: Pathwise coordinate optimization. Ann Appl Stat 2007, 1:302-332.
- [54]Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010, 33:1-22.
- [55]Vattikuti S, Lee JJ, Chang CC, Hsu SDH, Chow CC: Software and supporting material for: “Applying compressed sensing to genome-wide association studies”. GigaScience Database 2014. http://dx.doi.org/10.5524/100094 webcite
- [56]Vattikuti S, Lee JJ, Chang CC, Hsu SDH, Chow CC: Software and supporting material for: “Applying compressed sensing to genome-wide association studies”. GitHubhttps://github.com/ShashaankV/CS webcite and https://github.com/ShashaankV/GD webcite