期刊论文

【摘要】

Background

Despite significant advancement in alignment algorithms, the exponential growth of nucleotide sequencing throughput threatens to outpace bioinformatic analysis. Computation may become the bottleneck of genome analysis if growing alignment costs are not mitigated by further improvement in algorithms. Much gain has been gleaned from indexing and compressing alignment databases, but many widely used alignment tools process input reads sequentially and are oblivious to any underlying redundancy in the reads themselves.

Results

Here we present Oculus, a software package that attaches to standard aligners and exploits read redundancy by performing streaming compression, alignment, and decompression of input sequences. This nearly lossless process (> 99.9%) led to alignment speedups of up to 270% across a variety of data sets, while requiring a modest amount of memory. We expect that streaming read compressors such as Oculus could become a standard addition to existing RNA-Seq and ChIP-Seq alignment pipelines, and potentially other applications in the futureas throughput increases.

Conclusions

Oculus efficiently condenses redundant input reads and wraps existing aligners to provide nearly identical SAM output in a fraction of the aligner runtime. It includes a number of useful features, such as tunable performance and fidelity options, compatibility with FASTA or FASTQ files, and adherence to the SAM format. The platform-independent C++ source code is freely available online, at http://code.google.com/p/oculus-bio webcite.

【授权许可】

2012 Veeneman et al.; licensee BioMed Central Ltd.

【预览】

附件列表
Files	Size	Format	View
20150117072502809.pdf	546KB	PDF	download
Figure 5.	47KB	Image	download
Figure 4.	28KB	Image	download
Figure 3.	41KB	Image	download
Figure 2.	42KB	Image	download
Figure 1.	33KB	Image	download

【图表】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

【参考文献】

[1]Wetterstrand KA: DNA sequencing costs: data from the NHGRI large-scale genome sequencing program. http://www.genome.gov/sequencingcosts webcite
[2]Pennisi E: Human genome 10th anniversary. Will computers crash genomics? Science 2011, 331:666-668.
[3]Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009, 10:R25. BioMed Central Full Text
[4]Li H, Durbin R: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 2009, 25:1754-1760.
[5]Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008, 18:1851-1858.
[6]Weese D, Emde AK, Rausch T, Döring A, Reinert K: RazerS–fast read mapping with sensitivity control. Genome Res 2009, 19:1646-1654.
[7]Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403-410.
[8]Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402.
[9]Schatz MC: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 2009, 25:1363-1369.
[10]Nguyen T, Shi W, Ruden D: CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping. BMC Res Notes 2011, 4:171. BioMed Central Full Text
[11]Pireddu L, Leo S, Zanetti G: SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 2011, 27:2159-2160.
[12]Shimizu K, Tsuda K: SlideSort: all pairs similarity search for short reads. Bioinformatics 2010, 27:464-470.
[13]Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler EE, Sahinalp SC: mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat Methods 2010, 7:576-577.
[14]Burriesci MS, Lehnert EM, Pringle JR: Fulcrum: condensing redundant reads from high-throughput sequencing studies. Bioinformaticsin press
[15]Encode Project Consortium: The ENCODE (ENCyclopedia of DNA elements) project. Science 2004, 306:636-640.
[16]Sun Z, Asmann YW, Kalari KR, Bot B, Eckel-Passow JE, Baker TR, Carr JM, Khrebtukova I, Luo S, Zhang L, Schroth GP, Perez EA, Thompson EA: Integrated analysis of gene expression, CpG island methylation, and gene copy number in breast cancer cells by deep sequencing. PLoS One 2011, 6:e17490.
[17]Łabaj PP, Leparc GG, Linggi BE, Markillie LM, Wiley HS, Kreil DP: Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics 2011, 27:i383-i391.
[18]Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009, 25:1105-1111.
[19]sparsehash: An extremely memory-efficient hash_map implementation. http://code.google.com/p/sparsehash/ webcite
[20]MurmurHashhttp://sites.google.com/site/murmurhash webcite
[21]Kent Informatics, Inc: BLAT and other fine software. http://www.kentinformatics.com webcite
[22]Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147:195-197.

BMC Bioinformatics
Oculus: faster sequence alignment by streaming read compression

Arul M Chinnaiyan² Matthew K Iyer¹ Brendan A Veeneman¹
[1]Michigan Center for Translational Pathology, University of Michigan Medical School, Ann Arbor, MI, 48109, USA
[2]Department of Urology, University of Michigan Medical School, Ann Arbor, MI, 48109, USA
关键词: DNA nucleotide sequence alignment streaming identity redundancy compression software algorithm;
Others : 1088073 DOI : 10.1186/1471-2105-13-297

received in 2012-04-10, accepted in 2012-11-01, 发布年份 2012
PDF


	文献评价指标
	下载次数：86次	浏览次数：49次

【 摘 要 】

Background

Results

Conclusions

【 授权许可】

【 预 览 】

【 图 表 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【图表】

【参考文献】