期刊论文详细信息
BMC Bioinformatics
Robust methods for population stratification in genome wide association studies
Li Liu2  Donghui Zhang2  Hong Liu1  Christopher Arendt1 
[1] Bio-Innovation Group of Sanofi Biotherapeutics, 38 Sidney Street, Sanofi, Cambridge, MA, 02142, USA
[2] Department of Biostatistics and Programming, Mail Stop 55C-305A, 55 Corporate Drive, Sanofi, Bridgewater, NJ, 08807, USA
关键词: GWA studies;    Outlier detection;    Resampling by half means;    Robust principal component analysis;    Population stratification;    Population structure;   
Others  :  1087903
DOI  :  10.1186/1471-2105-14-132
 received in 2012-10-22, accepted in 2013-03-26,  发布年份 2013
PDF
【 摘 要 】

Background

Genome-wide association studies can provide novel insights into diseases of interest, as well as to the responsiveness of an individual to specific treatments. In such studies, it is very important to correct for population stratification, which refers to allele frequency differences between cases and controls due to systematic ancestry differences. Population stratification can cause spurious associations if not adjusted properly. The principal component analysis (PCA) method has been relied upon as a highly useful methodology to adjust for population stratification in these types of large-scale studies. Recently, the linear mixed model (LMM) has also been proposed to account for family structure or cryptic relatedness. However, neither of these approaches may be optimal in properly correcting for sample structures in the presence of subject outliers.

Results

We propose to use robust PCA combined with k-medoids clustering to deal with population stratification. This approach can adjust for population stratification for both continuous and discrete populations with subject outliers, and it can be considered as an extension of the PCA method and the multidimensional scaling (MDS) method. Through simulation studies, we compare the performance of our proposed methods with several widely used stratification methods, including PCA and MDS. We show that subject outliers can greatly influence the analysis results from several existing methods, while our proposed robust population stratification methods perform very well for both discrete and admixed populations with subject outliers. We illustrate the new method using data from a rheumatoid arthritis study.

Conclusions

We demonstrate that subject outliers can greatly influence the analysis result in GWA studies, and propose robust methods for dealing with population stratification that outperform existing population stratification methods in the presence of subject outliers.

【 授权许可】

   
2013 Liu et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150117054624817.pdf 653KB PDF download
Figure 3. 83KB Image download
Figure 2. 49KB Image download
Figure 1. 49KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

【 参考文献 】
  • [1]Meng J, Rosenwasser LJ: Unraveling the Genetic Basis of Asthma and Allergic Diseases. Allergy Asthma Immunol Res 2010, 2(4):215-227.
  • [2]Carvalho B, Bengtsson H, Speed TP, Irizarry RA: Exploration, normalization, and genotype calls of high density oligonucleotide SNP array data. Biostatistics 2007, 8:485-499.
  • [3]Teo YY, Inouye M, Small KS, Gwilliam R, Deloukas P, Kwiatkowski DP, Clark TG: A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics 2007, 23:2741-2746.
  • [4]Balding D: A tutorial on statistical methods for population association studies. Nat Rev Genet 2006, 7:781-791.
  • [5]Gordon D, Finch SJ: Factors affecting statistical power in the detection of genetic association. J Clin Invest 2005, 115:1408-1418.
  • [6]Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, Altshuler D, Ardlie KG, Hirschhorn JN: Demonstrating stratification in a European American population. Nat Genet 2005, 37:868-872.
  • [7]Xu H, Sarkar B, George V: A new measure of population structure using multiple single nucleotide polymorphisms and its relationship with FST. BMC Res Notes 2009, 2:21. BioMed Central Full Text
  • [8]Li Q, Yu K: Improved correction for population stratification in genomewide association studies by identifying hidden population structures. Genet Epidemiol 2008, 32:215-226.
  • [9]Devlin B, Roeder K: Genomic control for association studies. Biometrics 1999, 55(4):997-1004.
  • [10]Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 2006, 38:904-909.
  • [11]Pritchard JK, Stephens M, Rosenberg NA, Donnelly P: Association mapping in structured populations. Am J Hum Genet 2000, 67:170-181.
  • [12]Tse KP, Su WH, Chang KP, Tsang NM, Yu CJ: Genome-wide association study reveals multiple nasopharyngeal carcinoma-associated loci within the HLA region at chromosome 6p21.3. Am J Hum Genet 2009, 85(2):194-203.
  • [13]Bishop DT, Demenais F, Iles MM, Harland M, Taylor JC: Genome-wide association study identifies three loci associated with melanoma risk. Nat Genet 2009, 41:920-925.
  • [14]Zhang Z, Ersoz E, Lai C-Q, Todhunter RJ, Tiwari HK: Mixed linear model approach adapted for genome-wide association studies. Nat Genet 2010, 42:355-360.
  • [15]Kang HM, Sul JH, Service SK, Zaitlen NA, Kong S-Y: Variance component model to account for sample structure in genome-wide association studies. Nat Gene 2010, 42:348-354.
  • [16]Croux C, Filzmoser P, Oliveira MR: Algorithms for Projection-Pursuit Robust Principal Component Analysis. Chemometr Intell Lab 2007, 87:218-225.
  • [17]Egan WJ, Morgan SL: Outlier detection in multivariate analytical chemical data. Ana Chem 1998, 79:2372-2379.
  • [18]Kaufman L, Rousseeuw PJ: Finding Groups in Data. New York: Wiley; 1990.
  • [19]Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 2001, 2:411-423.
  • [20]Croux C, Haesbroeck G, Rousseeuw PJ: Location adjustment for the minimum volume ellipsoid estimator. Statist Comput 2002, 12(3):191-200.
  • [21]Rousseeuw P, VanDriessen K: A fast algorithm for the minimum covariance determinant estimator. Technometrics 1999, 41:212-223.
  • [22]Li G, Chen Z: Projection-Pursuit Approach to Robust Dispersion Matrices and Principal Components: Primary Theory and Monte Carlo. J Am Stat Assoc 1985, 80:759-766.
  • [23]Croux C, Ruiz-Gazen A: High Breakdown Estimators for Principal Components: The Projection-Pursuit Approach Revisited. J Multivariate Anal 2005, 95:206-226.
  • [24]Hubert M, Rousseeuw PJ, Vanden Branden K: ROBPCA: A New Approach to Robust Principal Component Analysis. Technometrics 2005, 47:64-79.
  • [25]Hubert M, Van Driessen K: Fast and Robust Discriminant Analysis. Comput Stat Data Anal 2004, 45:301-320.
  • [26]Wilson EB, Hilferty MM: The distribution of chi-squared. Proc Natl Acad Sci 1931, 17:684-688.
  • [27]Todorov V, Filzmoser P: An Object Oriented Framework for Robust Multivariate Analysis. J Stat Softw 2009, 32(3):1-47.
  • [28]Tracy CA, Widom H: Level-spacing distributions and the airy kernel. Commun Math Phys 1994, 159:151-174.
  • [29]Gabriel KR, Zamir S: Lower rank approximation of matrices by least squares with any choice of weights. Technometrics 1979, 21:489-498.
  • [30]Liu L, Hawkins D, Ghost S, Young SS: Robust Singular Value Decomposition Analysis of Microarray Data. Proc Natl Acad Sci 2003, 100(23):13167-13172.
  • [31]Holm S: A Simple Sequentially Rejective Bonferroni Test Procedure. Scandinavian J of Stat 1979, 6:65-70.
  • [32]Benjamini Y, Hochberg Y: Controlling the false discovery rate-a practical and powerful approach to multiple testing. J R Stat Soc Ser B 1995, 57(1):289-300.
  • [33]Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007, 81:559-575.
  • [34]Gregersen PK, Silver J, Winchester RJ: The shared epitope hypothesis. An approach to understanding the molecular genetics of susceptibility to rheumatoid arthritis. Arthritis Rheum 1987, 30:1205-1213.
  • [35]Newton JL, Harney SM, Wordsworth BP, Brown MA: A review of the MHC genetics of rheumatoid arthritis. Genes Immun 2004, 5:151-157.
  • [36]Jawaheer D, Seldin MF, Amos CI, Chen WV, Shigeta R: Screening the genome for rheumatoid arthritis susceptibility genes: a replication study and combined analysis of 512 multicase families. Arthritis Rheum 2003, 48:906-916.
  • [37]Irigoyen P, Lee AT, Wener MH, Li W, Kern M: Regulation of anticyclic citrullinated peptide antibodies in rheumatoid arthritis: contrasting effects of HLA-DR3 and the shared epitope alleles. Arthritis Rheum 2005, 52:3813-3818.
  • [38]Zhernakova A, Stahl EA, Trynka G, Raychaudhuri S, Festen EA: Meta- analysis of genome-wide association studies in celiac disease and rheumatoid arthritis identifies fourteen non-HLA shared loci. PLoS Genet 2011, 7(2):e1002004.
  • [39]Stahl EA, Raychaudhuri S, Remmers EF, Xie G, Eyre S: Genome-wide association study meta-analysis identifies seven new rheumatoid arthritis risk loci. Nat Genet 2010, 42(6):508-514.
  • [40]Gregersen PK, Amos CI, Lee AT, Lu Y, Remmers EF: REL, encoding a member of the NF-kappaB family of transcription factors, is a newly defined risk locus for rheumatoid arthritis. Nat Genet 2009, 41(7):820-823.
  • [41]Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF: TRAF1–C5 as a risk locus for rheumatoid arthritis–a genomewide study. N Engl J Med 2007, 357(12):1199-1209.
  • [42]Price AL, Zaitlen NA, Reich D, Patterson N: New approaches to population stratification in genome-wide association studies. Nat Rev Genet 2010, 11:459-463.
  文献评价指标  
  下载次数:59次 浏览次数:14次