BMC Research Notes | |
FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets | |
Anna Shcherbina1  | |
[1] Department of Bioengineering Systems and Technologies, MIT Lincoln Laboratory, 244 Wood St, 02421 Lexington, MA, USA | |
关键词: FASTQ; Next generation sequencing; Algorithm; Simulator; | |
Others : 1130430 DOI : 10.1186/1756-0500-7-533 |
|
received in 2014-04-07, accepted in 2014-08-04, 发布年份 2014 | |
【 摘 要 】
Background
High-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Perfectly characterized in silico datasets are a useful tool for evaluating the performance of such algorithms. Background contaminating organisms are observed in sequenced mixtures of organisms. In silico samples provide exact truth. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible.
Results
FASTQSim is a tool that provides the dual functionality of NGS dataset characterization and metagenomic data generation. FASTQSim is sequencing platform-independent, and computes distributions of read length, quality scores, indel rates, single point mutation rates, indel size, and similar statistics for any sequencing platform. To create training or testing datasets, FASTQSim has the ability to convert target sequences into in silico reads with specific error profiles obtained in the characterization step.
Conclusions
FASTQSim enables users to assess the quality of NGS datasets. The tool provides information about read length, read quality, repetitive and non-repetitive indel profiles, and single base pair substitutions. FASTQSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software. In this regard, in silico datasets generated with the FASTQsim tool hold several advantages over natural datasets: they are sequencing platform independent, extremely well characterized, and less expensive to generate. Such datasets are valuable in a number of applications, including the training of assemblers for multiple platforms, benchmarking bioinformatics algorithm performance, and creating challenge datasets for detecting genetic engineering toolmarks, etc.
【 授权许可】
2014 Shcherbina; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150226223730410.pdf | 3764KB | download | |
Figure 7. | 60KB | Image | download |
Figure 6. | 180KB | Image | download |
20141203071549702.pdf | 842KB | download | |
Figure 2. | 53KB | Image | download |
Figure 3. | 78KB | Image | download |
Figure 2. | 78KB | Image | download |
Figure 1. | 93KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 2.
Figure 6.
Figure 7.
【 参考文献 】
- [1]Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotech 2008, 26:11335-1145.
- [2]Frampton M, Houlston R: Generation of artificial FASTQ files to evaluate the performance of next-generation sequencing pipelines. PLoS ONE 2012., 7[http://www.plosone.org/article/info\%3Adoi\%2F10.1371\%2Fjournal.pone.0049110 webcite]
- [3]Myers G: A dataset generator for whole genome shotgun sequencing. Proc Int Conf Intell Syst Mol Biol 1999, 1999:202-210.
- [4]Engle M, Burks C: Artificially generated data sets for testing DNA sequence assembly algorithms. Genomics 1993, 16:286-288.
- [5]Lin L, Yinhu L, Siliang L, Hu N, He Y, Pong R, Lin D, Lu L, Law M: Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012., 2012[http://dx.doi.org/10.1155/2012/251364 webcite]
- [6]Carneiro M, Russ C, Gross M, Gabriel S, Nusbaum C, DePristo M: Pacific biosciences sequencing technology for genotyping and variant discovery in human data. BMC Genomics 2012., 13[http://www.biomedcentral.com/1471-2164/13/375 webcite]
- [7]PacificBioSciences: Understanding PacBio transcriptome data. 2013. [https://github.com/PacificBiosciences/cDNA_primer/wiki/Understanding-PacBio-transcriptome-data webcite]
- [8]Balzer S, Malde K, Lanzén A, Sharma A, Jonassen I: Characteristics of 454 pyrosequencing data–enabling realistic simulation with flowsim. Bioinformatics 2010, 26:420-425.
- [9]Yukiteru O, Kiyoshi A, Michiaki H: PBSIM: PacBio reads simulator–toward accurate genome assembly. Bioinformatics 2013, 29:119-121.
- [10]Heng L: Whole Genome Simulation. 2012. [http://sourceforge.net/apps/mediawiki/dnaa/index.php?title=Whole_Genome_Simulation webcite]
- [11]Huang W, Li L, Myers J, Marth G: ART: a next-generation sequencing read simulator. Bioinformatics 2012, 28:593-594.
- [12]Maq: Mapping and Assembly with Qualities 2008. [http://maq.sourceforge.net/ webcite]
- [13]Angly FE, Willner D, Rohwer F, Hugenholtz P, Tyson GW: Grinder: a versatile amplicon and shotgun sequence simulator. Nucleic Acids Res 2012, 40:94.
- [14]Richter D, Ott F, Auch AF, Schmid R, Huson DH: MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One 2008, 3:3373.
- [15]McElroy K, Luciani F, Thomas T: GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 2012., 13[http://www.biomedcentral.com/1471-2164/13/74 webcite]
- [16]Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. J Mol Biol 1990, 215:403-410.
- [17]Ricke D: Java BlastParser. [http://sourceforge.net/projects/biotools/files/?source=navbar webcite]
- [18]Innocentive: Identify Organisms from a Stream of DNA Sequences. 2013. [http://www.innocentive.com/dtra webcite]
- [19]Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C: Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods 2012, 9:811-814.
- [20]Wood DE, Salzberg SL: Kraken:ultrafast metagenomic sequence classification using exact alignments. Genome Biol 2014., 15[http://genomebiology.com/2014/15/3/R46 webcite]
- [21]Liu B, Gibbons T, Ghodsi M, Pop M: MetaPhyler: taxonomic profiling for metagenomic sequences. IEEEXplore 2010, 2010:95-100.
- [22]Liu J, Wang H, Yang H, Zhang Y, Wang J, Zhao F, Qi J: Composition-based classification of short metagenomic sequences elucidates the landscapes of taxonomic and functional enrichment of microorganisms. Nucleic Acids Res 2013., 41[http://nar.oxfordjournals.org/content/41/1/e3 webcite]
- [23]Ames SK, Hysom DA, Gardner SN, Lloyd GS, Gokhale MB, Allen JE: Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 2013, 18:2253-2260.
- [24]Wang H, Isaacs F, Carr P, Sun Z, Xu G, Forest C, Church G: Programming cells by multiplex genome engineering and accelerated evolution. Nature 2009, 460:894-898. [http://www.nature.com/nature/journal/v460/n7257/full/nature08187.html webcite]
- [25]Shcherbina A: Codon Substitution Script. 2013. [https://github.com/annashcherbina/CodonSub webcite]
- [26]454 Sequencing: Point-and-click tools for assembly, mapping and amplicon variant analysis. 2013. [http://454.com/products/analysis-software/index.asp webcite]