期刊论文详细信息
BMC Genomics
Evaluation of variant identification methods for whole genome sequencing data in dairy cattle
Birgit Gredler3  James M Reecy2  Dorian J Garrick2  Juerg Moll3  Ruedi Fries5  Rohan Fernando2  Christian Stricker4  Heidi Signer-Hasler1  Christine Flury1  Sandra Jansen5  Eric Fritz-Waters2  Beat Bapst3  James E Koltes2  Marlies A Dolezal6  Christine F Baes3 
[1] Bern University of Applied Sciences, School of Agricultural, Forest and Food Sciences HAFL, Länggasse 85, CH-3052 Zollikofen, Switzerland;Department of Animal Science, Iowa State University, 1221 Kildee Hall, 50011-3150 Ames, IA, USA;Qualitas AG, Chamerstrasse 56a, CH-6300 Zug, Switzerland;agn Genetics GmbH, 8b Börtjistrasse, CH-7260 Davos, Switzerland;Technische Universität München, Liesel-Beckmann-Str. 1, D-85354 Freising, Germany;University of Veterinary Medicine Vienna, Veterinärplatz 1, A-1210 Vienna, Austria
关键词: Pipeline;    Single nucleotide variant identification;    Next-generation sequencing analysis;   
Others  :  1127801
DOI  :  10.1186/1471-2164-15-948
 received in 2014-06-02, accepted in 2014-10-14,  发布年份 2014
PDF
【 摘 要 】

Background

Advances in human genomics have allowed unprecedented productivity in terms of algorithms, software, and literature available for translating raw next-generation sequence data into high-quality information. The challenges of variant identification in organisms with lower quality reference genomes are less well documented. We explored the consequences of commonly recommended preparatory steps and the effects of single and multi sample variant identification methods using four publicly available software applications (Platypus, HaplotypeCaller, Samtools and UnifiedGenotyper) on whole genome sequence data of 65 key ancestors of Swiss dairy cattle populations. Accuracy of calling next-generation sequence variants was assessed by comparison to the same loci from medium and high-density single nucleotide variant (SNV) arrays.

Results

The total number of SNVs identified varied by software and method, with single (multi) sample results ranging from 17.7 to 22.0 (16.9 to 22.0) million variants. Computing time varied considerably between software. Preparatory realignment of insertions and deletions and subsequent base quality score recalibration had only minor effects on the number and quality of SNVs identified by different software, but increased computing time considerably. Average concordance for single (multi) sample results with high-density chip data was 58.3% (87.0%) and average genotype concordance in correctly identified SNVs was 99.2% (99.2%) across software. The average quality of SNVs identified, measured as the ratio of transitions to transversions, was higher using single sample methods than multi sample methods. A consensus approach using results of different software generally provided the highest variant quality in terms of transition/transversion ratio.

Conclusions

Our findings serve as a reference for variant identification pipeline development in non-human organisms and help assess the implication of preparatory steps in next-generation sequencing pipelines for organisms with incomplete reference genomes (pipeline code is included). Benchmarking this information should prove particularly useful in processing next-generation sequencing data for use in genome-wide association studies and genomic selection.

【 授权许可】

   
2014 Baes et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150221103040203.pdf 2114KB PDF download
Figure 10. 117KB Image download
Figure 9. 74KB Image download
Figure 8. 155KB Image download
Figure 7. 127KB Image download
Figure 6. 80KB Image download
Figure 5. 102KB Image download
Figure 4. 92KB Image download
Figure 3. 139KB Image download
Figure 2. 115KB Image download
Figure 1. 147KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

【 参考文献 】
  • [1]Jensen J, Su G, Madsen P: Partitioning additive genetic variance into genomic and remaining polygenic components for complex traits in dairy cattle. BMC Genet 2012, 13:44.
  • [2]Van Raden P, O’Connell JR, Wiggans GR, Weigel KA: Genomic evaluations with many more genotypes. Gen Sel Evol 2011, 43(1):10. BioMed Central Full Text
  • [3]Horner DS, Pavesi G, Castrignano T, D’Onorio De Meo P, Liuni S, Sammeth M, Picardi E, Pesole G: Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief Bioinformatics 2009, 11:181-197.
  • [4]Stratton M: Genome resequencing and genetic variation. Nat Biotechnol 2009, 26:65-66.
  • [5]Flicek P, Birney E: Sense from sequence reads: methods for alignment and assembly. Nat Methods 2009, 6:S6-S12.
  • [6]McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20:1297-1303.
  • [7]DePristo M, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Ricas MA, Hanna M, McKenna A, Fennel TJ, Kernytsky AM, Sicachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011, 43:491-498.
  • [8]Rimmer A, Mathieson I, Lunter G, McVean G: Platypus: an integrated variant caller. http://www.well.ox.ac.uk/platypus webcite
  • [9]Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The sequence alignment/Map format and SAMtools. Bioinformatics 2009, 25:2078-2079.
  • [10]Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, et al.: Accurate whole genome sequencing using reversible terminator chemistry. Nature 2008, 456(7218):53-59.
  • [11]Ellegren H: Genome sequencing and population genomics in non-model organisms. Trends Ecol Evol 2014, 29:51-63.
  • [12]Zimin A, Delcher A, Florea L, Kelley DR, Schatz MC, Puiu D, Hanrahan F, Pertea G, Van Tassel CP, Sonstegard TS, Marçais G, Roberts M, Subramanian P, Yorke JA, Salzberg S: A whole-genome assembly of the domestic cow, Bos taurus. Genome Biol 2009, 10:R42. BioMed Central Full Text
  • [13]Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner D: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G + C)-biased genomes. Nat Methods 2009, 6:291-295.
  • [14]Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M: From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinform 2013, 43:11.10.1-11.10.33.
  • [15]Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008, 18:1851-1858.
  • [16]Li H: Towards better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 2014, 30:2843-2851.
  • [17]Liu Q, Guo Y, Li J, Long J, Zhang B, Shyr Y: Steps to ensure accuracy in genotype and SNP calling from Illumina Sequencing data. BMC Genomics 2012, 13(Suppl 8):S8.
  • [18]Cheng AY, Teo YY, Ong RT: Assessing single nucleotide variant detection and genotype calling on whole-genome sequenced individuals. Bioinformatics 2014, 30(12):1707-1713.
  • [19]The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 2012, 491:56-65.
  • [20]The International HapMap Project http://hapmap.ncbi.nlm.nih.gov/thehapmap.html.en webcite
  • [21]Ebersberger I, Metzler D, Schwarz C, Pääbo S: Genomewide comparison of DNA sequences between humans and chimpanees. Am J Hum Gen 2002, 70(6):1490-1497.
  • [22]Hodges E, Smith AD, Kendall J, Xuan Z, Ravi K, Rooks M, Zhang MQ, Ye K, Bhattacharjee A, Brizuela L, McCombie WR, Wigler M, Hannon GJ, Hicks JB: High definition profiling of mammalian DNA methylation by array capture and single molecule bisulfite sequencing. GenomeRes 2009, 19(9):1593-1605.
  • [23]Omni Array Family http://www.illumina.com/applications/genotyping/human-genotyping-arrays/omni-arrays.ilmn webcite
  • [24]dbSNP Bovine Assembly Bos_taurus_UMD_3.1 ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_3.1/
  • [25]Le Roex N, Noyes H, Brass A: Novel SNP Discovery in African Buffalo, Syncerus caffer. Using High-Throughput Sequencing. PLoS One 2012, 7(11):e48792.
  • [26]Liu X, Han S, Wang Z, Gelernter J, Yang B: Variant callers for next-generation sequencing data: a comparison study. PLoS One 2013, 8(9):e75619.
  • [27]Mullaney JM, Mills RE, Pittard S, Devine S: Small insertions and deletions (INDELs) in human genomes. Hum Mol Genet 2010, 19(R2):R131-R136.
  • [28]Daetwyler HD, Capitan A, Pausch H, Stothard P, van Binsbergen R, Brøndum RF, Liao X, Djari A, Rodriguez SC, Grohs C, Esquerré D, Bouchez O, Rossignol M, Klopp C, Rocha D, Fritz S, Eggen A, Bowman PJ, Coote D, Chamberlain AJ, Anderson C, VanTassell CP, Hulsegge I, Goddard ME, Guldbrandtsen B, Lund MS, Veerkamp RF, Boichard DA, Fries R, Hayes BJ: Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nat Gen 2014, 46:858-865.
  • [29]Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo M, Handsacker RE, Lunter G, Marth GT, Sherry S, McVean G, Durbin R and 1000 Genomes Project Analysis Group: The variant call format and VCFTools. Bioinformatics 2011, 27(15):2156-2158.
  • [30]Jansen S, Aigner B, Pausch H, Wysocki M, Eck S, Benet-Pagès A, Graf E, Wieland T, Strom TM, Meitinger T, Fries R: Assessment of the genomic variation in a cattle population by re-sequencing of key animals at low to medium coverage. BMC Genomics 2013, 14:446. BioMed Central Full Text
  • [31]Abecasis Lab GLF Tools http://www.sph.umich.edu/csg/abecasis/glfTools/ webcite
  • [32]Goddard ME, Hayes BJ: Genomic selection based on dense genotypes inferred from sparse genotypes. Proc Assoc Advmt Anim Breed Genet 2009, 18:26-29.
  • [33]Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, MAller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analysis. Am J Hum Gen 2007, 81(3):559-575.
  • [34]Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM: The sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res 2010, 38(6):1767-1771.
  • [35]Hayes B, Daetwyler H, Fries R, Stothard P, Pausch H, van Binsbergen R, Veerkamp R, Capitan A, Fritz S, Lund M, Boichard D, Van Tassell C, Guldbrandtsen B, Liao X, and the 1000 bull genomes consortium: Sequence Alignment Guidelines for producing bam files for the 1000 bull genomes project Version: 15.07.2013. http://www.1000bullgenomes.com/ webcite
  • [36]Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 2009, 25:1754-1760.
  • [37]Picard Version 1.61 http://picard.sourceforge.net/ webcite
  • [38]Nicolazzi EL, Picciolini M, Strozzi F, Schnabel RD, Lawley C, Pirani A, Brew F, Stella A: SNPchiMp: a database to disentangle the SNPchip jungle in bovine livestock. BMC Genomics 2014, 15:123. BioMed Central Full Text
  • [39]Ensembl release 74 http://www.ensembl.org/info/genome/variation/data_description.html#quality_control webcite
  • [40]NCBI Resource Coordinators: Database resources of the National Center for Biotechnology information. Nucleic Acids Res 2013, 41:D8-D20.
  文献评价指标  
  下载次数:43次 浏览次数:13次