期刊论文详细信息
BMC Bioinformatics
Masking as an effective quality control method for next-generation sequencing data analysis
Sajung Yun2  Sijung Yun1 
[1] Laboratory of Molecular Biology, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
[2] John A. Burns School of Medicine, University of Hawai‘i at Manoa, Honolulu, HI, USA
关键词: Trimming;    Masking;    Preprocessing;    NGS;   
Others  :  1084390
DOI  :  10.1186/s12859-014-0382-2
 received in 2014-03-11, accepted in 2014-11-10,  发布年份 2014
PDF
【 摘 要 】

Background

Next generation sequencing produces base calls with low quality scores that can affect the accuracy of identifying simple nucleotide variation calls, including single nucleotide polymorphisms and small insertions and deletions. Here we compare the effectiveness of two data preprocessing methods, masking and trimming, and the accuracy of simple nucleotide variation calls on whole-genome sequence data from Caenorhabditis elegans. Masking substitutes low quality base calls with ‘N’s (undetermined bases), whereas trimming removes low quality bases that results in a shorter read lengths.

Results

We demonstrate that masking is more effective than trimming in reducing the false-positive rate in single nucleotide polymorphism (SNP) calling. However, both of the preprocessing methods did not affect the false-negative rate in SNP calling with statistical significance compared to the data analysis without preprocessing. False-positive rate and false-negative rate for small insertions and deletions did not show differences between masking and trimming.

Conclusions

We recommend masking over trimming as a more effective preprocessing method for next generation sequencing data analysis since masking reduces the false-positive rate in SNP calling without sacrificing the false-negative rate although trimming is more commonly used currently in the field. The perl script for masking is available at http://code.google.com/p/subn/ webcite. The sequencing data used in the study were deposited in the Sequence Read Archive (SRX450968 and SRX451773).

【 授权许可】

   
2014 Yun and Yun; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150113161229956.pdf 2352KB PDF download
Figure 7. 54KB Image download
Figure 6. 151KB Image download
Figure 5. 35KB Image download
Figure 4. 48KB Image download
Figure 3. 58KB Image download
Figure 2. 72KB Image download
Figure 1. 22KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

【 参考文献 】
  • [1]Kong Y: Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing technologies. Genomics 2011, 98:152-153.
  • [2]Smeds L, Kunstner A: ConDeTri–a content dependent read trimmer for Illumina data. PLoS One 2011, 6:e26314.
  • [3]Blankenberg D, Gordon A, Von Kuster G, Coraor N, Taylor J, Nekrutenko A: Manipulation of FASTQ data with galaxy. Bioinformatics 2010, 26:1783-1785.
  • [4]Falgueras J, Lara AJ, Fernandez-Pozo N, Canton FR, Perez-Trabado G, Claros MG: SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinform 2010, 11:38. BioMed Central Full Text
  • [5]Cox MP, Peterson DA, Biggs PJ: SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinform 2010, 11:485. BioMed Central Full Text
  • [6]FASTX Toolkit [http://hannonlab.cshl.edu/fastx_toolkit/index.html]
  • [7]Liu Q, Guo Y, Li J, Long J, Zhang B, Shyr Y: Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data. BMC Genomics 2012, 13(8):S8.
  • [8]WormBase [http://ws220.wormbase.org]
  • [9]SubN [http://code.google.com/p/subn/]
  • [10]Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25:1754-1760.
  • [11]Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods 2012, 9:357-359.
  • [12]DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011, 43:491-498.
  • [13]McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20:1297-1303.
  • [14]Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010, 38:e164.
  • [15]O'Rawe J, Jiang T, Sun G, Wu Y, Wang W, Hu J, Bodily P, Tian L, Hakonarson H, Johnson WE, Wei Z, Wang K, Lyon GJ: Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med 2013, 5:28. BioMed Central Full Text
  • [16]Yu X, Sun S: Comparing a few SNP calling algorithms using low-coverage sequencing data. BMC Bioinform 2013, 14:274. BioMed Central Full Text
  • [17]R Core Team (2013): R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
  • [18]PICARD version 1.75 [http://picard.sourceforge.net]
  • [19]Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/Map format and SAMtools. Bioinform 2009, 25:2078-2079.
  文献评价指标  
  下载次数:164次 浏览次数:68次