期刊论文详细信息
BMC Bioinformatics
Effective filtering strategies to improve data quality from population-based whole exome sequencing studies
Andrew R Carson2  Erin N Smith2  Hiroko Matsui2  Sigrid K Brækkan1  Kristen Jepsen2  John-Bjarne Hansen1  Kelly A Frazer3 
[1] Division of Internal Medicine, University Hospital of North Norway, Tromsø, Norway
[2] Department of Pediatrics and Rady Children’s Hospital, University of California San Diego, San Diego, USA
[3] Moores UCSD Cancer Center, University of California San Diego, La Jolla, CA, USA
关键词: Genomics;    Imputation;    Genotyping;    Single nucleotide variants;    Next generation sequencing;   
Others  :  818628
DOI  :  10.1186/1471-2105-15-125
 received in 2013-10-25, accepted in 2014-04-16,  发布年份 2014
PDF
【 摘 要 】

Background

Genotypes generated in next generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests. These genotyping errors are not explicitly filtered by the standard GATK Variant Quality Score Recalibration (VQSR) tool and thus remain a source of errors in whole exome sequencing (WES) projects that follow GATK’s recommended best practices. Therefore, additional data filtering methods are required to effectively remove these errors before performing association analyses with complex phenotypes. Here we empirically derive thresholds for genotype and variant filters that, when used in conjunction with the VQSR tool, achieve higher data quality than when using VQSR alone.

Results

The detailed filtering strategies improve the concordance of sequenced genotypes with array genotypes from 99.33% to 99.77%; improve the percent of discordant genotypes removed from 10.5% to 69.5%; and improve the Ti/Tv ratio from 2.63 to 2.75. We also demonstrate that managing batch effects by separating samples based on different target capture and sequencing chemistry protocols results in a final data set containing 40.9% more high-quality variants. In addition, imputation is an important component of WES studies and is used to estimate common variant genotypes to generate additional markers for association analyses. As such, we demonstrate filtering methods for imputed data that improve genotype concordance from 79.3% to 99.8% while removing 99.5% of discordant genotypes.

Conclusions

The described filtering methods are advantageous for large population-based WES studies designed to identify common and rare variation associated with complex diseases. Compared to data processed through standard practices, these strategies result in substantially higher quality data for common and rare association analyses.

【 授权许可】

   
2014 Carson et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20140711124924231.pdf 1145KB PDF download
Figure 4. 105KB Image download
Figure 3. 86KB Image download
Figure 2. 142KB Image download
Figure 1. 120KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

【 参考文献 】
  • [1]Pritchard JK: Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet 2001, 69(1):124-137.
  • [2]Pritchard JK, Cox NJ: The allelic architecture of human disease genes: common disease-common variant…or not? Hum Mol Genet 2002, 11(20):2417-2423.
  • [3]Kryukov GV, Pennacchio LA, Sunyaev SR: Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am J Hum Genet 2007, 80(4):727-739.
  • [4]Kryukov GV, Shpunt A, Stamatoyannopoulos JA, Sunyaev SR: Power of deep, all-exon resequencing for discovery of human trait genes. Proc Natl Acad Sci U S A 2009, 106(10):3871-3876.
  • [5]Kiezun A, Garimella K, Do R, Stitziel NO, Neale BM, McLaren PJ, Gupta N, Sklar P, Sullivan PF, Moran JL, Hultman CM, Lichtenstein P, Magnusson P, Lehner T, Shugart YY, Price AL, de Bakker PI, Purcell SM, Sunyaev SR: Exome sequencing and the genetic basis of complex traits. Nat Genet 2012, 44(6):623-630.
  • [6]Veltman JA, Brunner HG: De novo mutations in human genetic disease. Nat Rev Genet 2012, 13(8):565-575.
  • [7]Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J: Targeted capture and massively parallel sequencing of 12 human exomes. Nature 2009, 461(7261):272-276.
  • [8]Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, Shendure J, Bamshad MJ: Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 2010, 42(1):30-35.
  • [9]Bilguvar K, Ozturk AK, Louvi A, Kwan KY, Choi M, Tatli B, Yalnizoglu D, Tuysuz B, Caglayan AO, Gokben S, Kaymakcalan H, Barak T, Bakircioglu M, Yasuno K, Ho W, Sanders S, Zhu Y, Yilmaz S, Dincer A, Johnson MH, Bronen RA, Kocer N, Per H, Mane S, Pamir MN, Yalcinkaya C, Kumandas S, Topcu M, Ozmen M, Sestan N, et al.: Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Nature 2010, 467(7312):207-210.
  • [10]Cancer Genome Atlas N: Comprehensive molecular portraits of human breast tumours. Nature 2012, 490(7418):61-70.
  • [11]Cancer Genome Atlas Research N: Integrated genomic analyses of ovarian carcinoma. Nature 2011, 474(7353):609-615.
  • [12]Agrawal N, Frederick MJ, Pickering CR, Bettegowda C, Chang K, Li RJ, Fakhry C, Xie TX, Zhang J, Wang J, Zhang N, El-Naggar AK, Jasser SA, Weinstein JN, Trevino L, Drummond JA, Muzny DM, Wu Y, Wood LD, Hruban RH, Westra WH, Koch WM, Califano JA, Gibbs RA, Sidransky D, Vogelstein B, Velculescu VE, Papadopoulos N, Wheeler DA, Kinzler KW, et al.: Exome sequencing of head and neck squamous cell carcinoma reveals inactivating mutations in NOTCH1. Science 2011, 333(6046):1154-1157.
  • [13]Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J: Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 2011, 12(11):745-755.
  • [14]Gilissen C, Hoischen A, Brunner HG, Veltman JA: Unlocking Mendelian disease using exome sequencing. Genome Biol 2011, 12(9):228. BioMed Central Full Text
  • [15]Duncan JL, Roorda A, Navani M, Vishweswaraiah S, Syed R, Soudry S, Ratnam K, Gudiseva HV, Lee P, Gaasterland T, Ayyagari R: Identification of a novel mutation in the CDHR1 gene in a family with recessive retinal degeneration. Arch Ophthalmol 2012, 130(10):1301-1308.
  • [16]Wang K, Kim C, Bradfield J, Guo Y, Toskala E, Otieno FG, Hou C, Thomas K, Cardinale C, Lyon GL, Golhar R, Hakonarson H: Whole-genome DNA/RNA sequencing identifies truncating mutations in RBCK1 in a novel Mendelian disease with neuromuscular and cardiac involvement. Genome med 2013, 5(7):67. BioMed Central Full Text
  • [17]Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 2008, 83(3):311-321.
  • [18]Morris AP, Zeggini E: An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol 2010, 34(2):188-193.
  • [19]Li B, Liu DJ, Leal SM: Identifying rare variants associated with complex traits via sequencing. In Current protocols in human genetics Chapter 1 edition. Edited by Haines JL. 2013, 1-26.
  • [20]Koboldt DC, Ding L, Mardis ER, Wilson RK: Challenges of sequencing human genomes. Brief Bioinform 2010, 11(5):484-498.
  • [21]Dunning MJ, Barbosa-Morais NL, Lynch AG, Tavare S, Ritchie ME: Statistical issues in the analysis of Illumina data. BMC Bioinforma 2008, 9:85. BioMed Central Full Text
  • [22]Nielsen R, Paul JS, Albrechtsen A, Song YS: Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 2011, 12(6):443-451.
  • [23]Ledergerber C, Dessimoz C: Base-calling for next-generation sequencing platforms. Brief Bioinform 2011, 12(5):489-497.
  • [24]McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20(9):1297-1303.
  • [25]Powers S, Gopalakrishnan S, Tintle N: Assessing the impact of non-differential genotyping errors on rare variant tests of association. Hum Hered 2011, 72(3):153-160.
  • [26]Mayer-Jochimsen M, Fast S, Tintle NL: Assessing the impact of differential genotyping errors on rare variant tests of association. PLoS One 2013, 8(3):e56626.
  • [27]Kang SJ, Gordon D, Finch SJ: What SNP genotyping errors are most costly for genetic association studies? Genet Epidemiol 2004, 26(2):132-141.
  • [28]Kang SJ, Finch SJ, Haynes C, Gordon D: Quantifying the percent increase in minimum sample size for SNP genotyping errors in genetic model-based association studies. Hum Hered 2004, 58(3–4):139-144.
  • [29]O’Rawe J, Jiang T, Sun G, Wu Y, Wang W, Hu J, Bodily P, Tian L, Hakonarson H, Johnson WE, Wei Z, Wang K, Lyon GJ: Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome med 2013, 5(3):28. BioMed Central Full Text
  • [30]Auer PL, Johnsen JM, Johnson AD, Logsdon BA, Lange LA, Nalls MA, Zhang G, Franceschini N, Fox K, Lange EM, Rich SS, O'Donnell CJ, Jackson RD, Wallace RB, Chen Z, Graubert TA, Wilson JG, Tang H, Lettre G, Reiner AP, Ganesh SK, Li Y: Imputation of exome sequence variants into population- based samples and blood-cell-trait-associated loci in African Americans: NHLBI GO Exome Sequencing Project. Am J Hum Genet 2012, 91(5):794-808.
  • [31]Nho K, Corneveaux JJ, Kim S, Lin H, Risacher SL, Shen L, Swaminathan S, Ramanan VK, Liu Y, Foroud T, Inlow MH, Siniard AL, Reiman RA, Aisen PS, Petersen RC, Green RC, Jack CR, Weiner MW, Baldwin CT, Lunetta K, Farrer LA, Furney SJ, Lovestone S, Simmons A, Mecocci P, Vellas B, Tsolaki M, Kloszewska I, Soininen H, Multi-Institutional Research on Alzheimer Genetic Epidemiology S, et al.: Whole-exome sequencing and imaging genetics identify functional variants for rate of change in hippocampal volume in mild cognitive impairment. Mol Psychiatry 2013, 18(7):781-787.
  • [32]Consortium EP, Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder M: An integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489(7414):57-74.
  • [33]Howie BN, Donnelly P, Marchini J: A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 2009, 5(6):e1000529.
  • [34]Hancock DB, Levy JL, Gaddis NC, Bierut LJ, Saccone NL, Page GP, Johnson EO: Assessment of genotype imputation performance using 1000 Genomes in African American studies. PLoS One 2012, 7(11):e50610.
  • [35]Huang L, Wang C, Rosenberg NA: The relationship between imputation error and statistical power in genetic association studies in diverse populations. Am J Hum Genet 2009, 85(5):692-698.
  • [36]Liu X, Han S, Wang Z, Gelernter J, Yang BZ: Variant callers for next-generation sequencing data: a comparison study. PLoS One 2013, 8(9):e75619.
  • [37]DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011, 43(5):491-498.
  • [38]Ebersberger I, Metzler D, Schwarz C, Paabo S: Genomewide comparison of DNA sequences between humans and chimpanzees. Am J Hum Genet 2002, 70(6):1490-1497.
  • [39]Freudenberg-Hua Y, Freudenberg J, Kluck N, Cichon S, Propping P, Nothen MM: Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population. Genome Res 2003, 13(10):2271-2276.
  • [40]Jacobsen BK, Eggen AE, Mathiesen EB, Wilsgaard T, Njolstad I: Cohort profile: the Tromso Study. Int J Epidemiol 2012, 41(4):961-967.
  • [41]Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14):1754-1760.
  • [42]Pasaniuc B, Rohland N, McLaren PJ, Garimella K, Zaitlen N, Li H, Gupta N, Neale BM, Daly MJ, Sklar P, Sullivan PF, Bergen S, Moran JL, Hultman CM, Lichtenstein P, Magnusson P, Purcell SM, Haas DW, Liang L, Sunyaev S, Patterson N, de Bakker PI, Reich D, Price AL: Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat Genet 2012, 44(6):631-635.
  • [43]Browning BL, Browning SR: A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 2009, 84(2):210-223.
  • [44]Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, Genomes Project Analysis G: The variant call format and VCFtools. Bioinformatics 2011, 27(15):2156-2158.
  文献评价指标  
  下载次数:445次 浏览次数:280次