BMC Bioinformatics | |
Rapid evaluation and quality control of next generation sequencing data with FaQCs | |
Chien-Chi Lo1  Patrick S G Chain2  | |
[1] Bioenergy and Biome Sciences Group, Los Alamos National Laboratory, Los Alamos 87545, NM, USA | |
[2] Genome Science Group, Bioscience Division, Los Alamos National Laboratory, Los Alamos 87545, NM, USA | |
关键词: Data preprocessing; Next generation sequencing analysis; Trimming; Quality control; | |
Others : 1085074 DOI : 10.1186/s12859-014-0366-2 |
|
received in 2014-07-03, accepted in 2014-10-29, 发布年份 2014 | |
![]() |
【 摘 要 】
Background
Next generation sequencing (NGS) technologies that parallelize the sequencing process and produce thousands to millions, or even hundreds of millions of sequences in a single sequencing run, have revolutionized genomic and genetic research. Because of the vagaries of any platform’s sequencing chemistry, the experimental processing, machine failure, and so on, the quality of sequencing reads is never perfect, and often declines as the read is extended. These errors invariably affect downstream analysis/application and should therefore be identified early on to mitigate any unforeseen effects.
Results
Here we present a novel FastQ Quality Control Software (FaQCs) that can rapidly process large volumes of data, and which improves upon previous solutions to monitor the quality and remove poor quality data from sequencing runs. Both the speed of processing and the memory footprint of storing all required information have been optimized via algorithmic and parallel processing solutions. The trimmed output compared side-by-side with the original data is part of the automated PDF output. We show how this tool can help data analysis by providing a few examples, including an increased percentage of reads recruited to references, improved single nucleotide polymorphism identification as well as de novo sequence assembly metrics.
Conclusion
FaQCs combines several features of currently available applications into a single, user-friendly process, and includes additional unique capabilities such as filtering the PhiX control sequences, conversion of FASTQ formats, and multi-threading. The original data and trimmed summaries are reported within a variety of graphics and reports, providing a simple way to do data quality control and assurance.
【 授权许可】
2014 Lo and Chain; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150113170452224.pdf | 517KB | ![]() |
|
Figure 3. | 38KB | Image | ![]() |
Figure 2. | 37KB | Image | ![]() |
Figure 1. | 18KB | Image | ![]() |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
【 参考文献 】
- [1]Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 2008, 36(16):e105.
- [2]Kwon S, Park S, Lee B, Yoon S: In-depth analysis of interrelation between quality scores and real errors in illumina reads. Conf Proc IEEE Eng Med Biol Soc 2013, 2013:635-638.
- [3]Cox MP, Peterson DA, Biggs PJ: SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 2010, 11:485. BioMed Central Full Text
- [4]Schmieder R, Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27(6):863-864.
- [5]Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J: Galaxy: a web-based genome analysis tool for experimentalists. Current protocols in molecular biology/edited by Frederick M Ausubel [et al.] 2010. Unit 19 10 11–21
- [6]Goecks J, Nekrutenko A, Taylor J: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010, 11(8):R86. BioMed Central Full Text
- [7]R: A language and environment for statistical computing, reference index version 2.11.1., vol. 1. R Foundation for Statistical Computing, Vienna, Austria; 2012.
- [8]Marcais G, Kingsford C: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 2011, 27(6):764-770.
- [9]Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 2010, 38(6):1767-1771.
- [10]Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14):1754-1760.
- [11]Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 2008, 18(5):821-829.
- [12]Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Res 2008, 18(2):324-330.
- [13]Peng Y, Leung HC, Yiu SM, Chin FY: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 2012, 28(11):1420-1428.
- [14]Khan Z, Bloom JS, Kruglyak L, Singh M: A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays. Bioinformatics 2009, 25(13):1609-1616.
- [15]Li H: A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 2011, 27(21):2987-2993.