Source Code for Biology and Medicine | |
Software for pre-processing Illumina next-generation sequencing short read sequences | |
Cathy H Wu2  Hongzhan Huang2  Sari S Khaleel1  Chuming Chen2  | |
[1] Geisel School of Medicine, Dartmouth College, Hanover, NH USA;Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE USA | |
关键词: Perl; Reference-based assembly; De novo assembly; Trimming; Illumina; Next-generation sequencing; | |
Others : 802572 DOI : 10.1186/1751-0473-9-8 |
|
received in 2013-08-08, accepted in 2014-04-22, 发布年份 2014 | |
【 摘 要 】
Background
When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets.
Methods
We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7.
Results
Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference-based assembly as measured by assembly contiguity and correctness.
Conclusions
Trimming of short read sequences can improve the quality of de novo and reference-based assembly and assembler performance. The parallel processing capability of ngsShoRT reduces trimming time and improves the memory efficiency when dealing with large datasets. We recommend combining sequencing artifacts removal, and quality score based read filtering and base trimming as the most consistent method for improving sequence quality and downstream assemblies.
ngsShoRT source code, user guide and tutorial are available at http://research.bioinformatics.udel.edu/genomics/ngsShoRT/ webcite. ngsShoRT can be incorporated as a pre-processing step in genome and transcriptome assembly projects.
【 授权许可】
2014 Chen et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20140708025318608.pdf | 284KB | download |
【 参考文献 】
- [1]Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotechnol 2008, 26:1135-1145.
- [2]Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S, Frazer KA: Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol 2009, 10:R32. BioMed Central Full Text
- [3]Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, Stadler PF, Hackermüller J: Fast mapping of short sequences with mismatches, Insertions and deletions using index structures. PLoS Comput Biol 2009, 5:e1000502.
- [4]Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak MC, Hirai A, Takahashi H, Altaf-Ul-Amin M, Ogasawara N, Kanaya S: Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 2011, 39:e90.
- [5]Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragments assembly. Proc Natl Acad Sci U S A 2001, 98:9748-9753.
- [6]Flicek P, Birney E: Sense from sequence reads: methods for alignment and assembly. Nat Methods 2009, 6(Suppl 11):S6-S12.
- [7]Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics 2010, 95:315-327.
- [8]Cox MP, Peterson DA, Biggs PJ: Solexa QA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinforma 2010, 11:485. BioMed Central Full Text
- [9]Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods 2008, 5:621-628.
- [10]Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell 2007, 129:823-837.
- [11]Alkan C, Coe BP, Eichler EE: Genome structural variation discovery and genotyping. Nat Rev Genet 2011, 12:363-376.
- [12]Li H: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. http://arxiv.org/abs/1303.3997 webcite
- [13]Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics 2008, 24:713-714.
- [14]Martin M: Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal, North America 2011., 17http://journal.embnet.org/index.php/embnetjournal/article/view/200/479 webcite
- [15]Patel RK, Jain M: NGS QC toolkit: a toolkit for quality control of next generation sequencing data. PLoS One 2012, 7:e30619.
- [16]Bolger AM, Lohse M, Usadel B: Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014. doi:10.1093/bioinformatics/btu170
- [17]Atherton RA, McComish BJ, Shepherd LD, Berry LA, Albert NW, Lockhart PJ: Whole genome sequencing of enriched chloroplast DNA using the Illumina GAII platform. Plant Methods 2010, 6:22. BioMed Central Full Text
- [18]Diguistini S, Liao NY, Platt D, Robertson G, Seidel M, Chan SK, Docking TR, Birol I, Holt RA, Hirst M, Mardis E, Marra MA, Hamelin RC, Bohlmann J, Breuil C, Jones SJ: De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data. Genome Biol 2009, 10:R94. BioMed Central Full Text
- [19]Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol İ, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, et al.: Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 2011, 21:2224-2241.
- [20]Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NC: Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods 2007, 4:495-500.
- [21]Meyer E, Aglyamova GV, Wang S, Buchanan-Carter J, Abrego D, Colbourne JK, Willis BL, Matz MV: Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx. BMC Genomics 2009, 10:219. BioMed Central Full Text
- [22]Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 2008, 18:821-829.
- [23]Illumina, Inc: Casava 1.8 changes. http://supportres.illumina.com/documents/myillumina/354c68ce-32f3-4ea4-9fe5-8cb2d968616c/casava1_8_changes.pdf webcite
- [24]Buffalo V: Scythe - a Bayesian adapter trimmer. https://github.com/vsbuffalo/scythe webcite
- [25]Falgueras J, Lara AJ, Fernández-Pozo N, Cantón FR, Pérez-Trabado G, Claros MG: SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinforma 2010, 11:38. BioMed Central Full Text
- [26]Hannon Lab: FASTX-Toolkit. http://hannonlab.cshl.edu/fastx_toolkit/ webcite
- [27]Kong Y: Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing technologies. Genomics 2011, 98:152-153.
- [28]Nikhil J: Sickle - a windowed adaptive trimming tool for FASTQ files using quality. https://github.com/najoshi/sickle webcite
- [29]Hietaniemi J: String::Approx, version 3.26. http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm webcite
- [30]Haridas S, Breuill C, Bohlmann J, Hsiang T: A biologist's guide to de novo genome assembly using next-generation sequence data: a test with fungal genomes. J Microbiol Methods 2011, 86:368-375.
- [31]Illumina, Inc: De Novo Genome Assembly Using Illumina Reads. http://www.illumina.com/Documents/products/technotes/technote_denovo_assembly_ecoli.pdf webcite
- [32]Garcia TI, Shen Y, Catchen J, Amores A, Schartl M, Postlethwait J, Walter RB: Effects of short read quality and quantity on a de novo vertebrate transcriptome assembly. Comp Biochem Physiol C Toxicol Pharmacol 2012, 155:95-101.
- [33]Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I: ABySS: a parallel assembler for short read sequence data. Genome Res 2009, 19:1117-1123.
- [34]CLC Bio. CLC Bio Genomics Workbench User Manual http://www.clcbio.com/files/usermanuals/CLC_Genomics_Workbench_User_Manual.pdf webcite
- [35]Pandey RV, Nolte V, Schlötterer C: CANGS: a user-friendly utility for processing and analyzing 454 GS-FLX data in biodiversity studies. BMC Res Notes 2010, 3:3. BioMed Central Full Text
- [36]Gladman S, Seemann T: VelvetOptimser. http://www.vicbioinformatics.com/software.velvetoptimiser.shtml webcite
- [37]Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marçais G, Pop M, Yorke JA: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res 2012, 3:557-567.
- [38]Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol 2004, 5:R12. BioMed Central Full Text
- [39]Phillippy AM, Schatz MC, Pop M: Genome assembly forensics: finding the elusive mis-assembly. Genome Biol 2008, 9:R55. BioMed Central Full Text
- [40]García-Alcalde F, Okonechnikov K, Carbonell J, Cruz LM, Götz S, Tarazona S, Dopazo J, Meyer TF, Conesa A: Qualimap: evaluating next-generation sequencing alignment data. Bioinformatics 2012, 28:2678-2679.