期刊论文

【摘要】

Background

Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. This data consists of both the nucleotide sequences and per-base quality scores that indicate the level of confidence in the readout of these sequences. Quality scores account for about half of the required disk space in the commonly used FASTQ format (before compression), and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data.

Results

In this paper, we present a new scheme for the lossy compression of the quality scores, to address the problem of storage. Our framework allows the user to specify the rate (bits per quality score) prior to compression, independent of the data to be compressed. Our algorithm can work at any rate, unlike other lossy compression algorithms. We envisage our algorithm as being part of a more general compression scheme that works with the entire FASTQ file. Numerical experiments show that we can achieve a better mean squared error (MSE) for small rates (bits per quality score) than other lossy compression schemes. For the organism PhiX, whose assembled genome is known and assumed to be correct, we show that it is possible to achieve a significant reduction in size with little compromise in performance on downstream applications (e.g., alignment).

Conclusions

QualComp is an open source software package, written in C and freely available for download at https://sourceforge.net/projects/qualcomp webcite.

【授权许可】

2013 Ochoa et al.; licensee BioMed Central Ltd.

【预览】

附件列表
Files	Size	Format	View
20150117051459607.pdf	573KB	PDF	download
Figure 4.	62KB	Image	download
Figure 3.	78KB	Image	download
Figure 2.	73KB	Image	download
Figure 1.	33KB	Image	download

【图表】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

【参考文献】

[1]Lander E, Linton L, Birren B, Nusbaum C, Zody M, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov J, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860-921.
[2]Genome sequencing cost http://www.genome.gov/sequencingcosts/ webcite
[3]Hess M, Sczyrba A, Egan R, Kim T, Chokhawala H, Schroth G, Luo S, Clark D, Chen F, Zhang T, Mackie R, Pennacchio L, Tringe S, Visel A, Woyke T, Wang Z, Rubin E: Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 2011, 331(6016):463.
[4]Qin J, Li R, Raes J, Arumugam M, Burgdorf K, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T, Mende D, Li J, Xu J, Li S, Li D, Cao J, Wang B, Liang H, Zheng H, Xie Y, Tap J, Lepage P, Bertalan M, Batto J, Hansen T, Paslier D, Linneber A, Bjorn Nielsen H, Pelletier E, Renault P et el.: A human gut microbial gene catalogue established by metagenomic sequencing. Nature 2010, 464(7285):59-65.
[5]Leinonen R, Sugawara H, Shumway M: The Sequence Read Archive. Nucleic Acids Res 2011, 39:19-21.
[6]Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM: The Sanger FASTQ format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 2009, 38:1767-1771.
[7]Lohse M, Bolger A, Nagel A, Fernie A, Lunn J, Stitt M, Usadel B: RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res 2012, 40(W1):W622-627.
[8]Cox M, Peterson D, Biggs P: SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 2010, 11:485. BioMed Central Full Text
[9]Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008, 18(11):1851-1858.
[10]Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009, 10(3):R25. BioMed Central Full Text
[11]Li H, Durbin R: Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 2009, 25(14):1754-1760.
[12]Lunter G, Goodson M: Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res 2011, 21(6):936-939.
[13]McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo M: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010, 20(9):1297-1303.
[14]Zhang J, Wheeler D, Yakub I, Wei S, Sood R, Rowe W, Liu P, Gibbs R, Buetow K: SNPdetector: a software tool for sensitive and accurate SNP detection. PLoS Comput Biol 2005, 1(5):e53.
[15]Cao M, Dix T, Allison L, Mears C: A simple statistical algorithm for biological sequence compression. In Data Compression Conference, 2007. DCC’07. Snowbird, UT, USA: IEEE; 2007:43-52.
[16]Chen X, Kwon S, Li M: A compression algorithm for DNA sequences and its applications in genome comparison. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology. Tokyo, Japan: ACM; 2000:107-107.
[17]Chen X, Li M, Ma B, Tromp J: DNACompress: Fast and effective DNA sequence compression. Bioinformatics 2002, 18:1696-1698.
[18]Pinho AJ, Ferreira P, Neves A, Bastos C: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 2011, 6(6):e21588.
[19]Sato H, Yoshioka T, Konagaya A, Toyoda T: DNA data compression in the post genome era. Genome Inf 2001, 12:512-514.
[20]Christley S, Lu Y, Li C, Xie X: Human Genomes as email attachments. Genome Inf 2008, 25:274-275.
[21]Pinho AJ, Pratas D, Garciaa SP: GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res 2011, 40(4):e27-27.
[22]Kuruppu S, Puglisi SJ, Zobel J: Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In String Processing and Information Retrieval. Los Cabos, Mexico: Springer; 2010:201-206.
[23]Kuruppu S, Puglisi SJ, Zobel J: Optimized relative Lempel-Ziv compression of genomes. In Proceeding of ACSC. Perth, Australia: Australasian Computer Science Conference (ACSC); 2011.
[24]Heath LS, Hou A, Xia H, Zhang L: A genome compression algorithm supporting manipulation. Proc LSS Comput Syst Bioinform Conf 2010, 9:38-49.
[25]Ma N, Ramchandran K, Tse D: A Compression Algorithm Using Mis-aligned side information. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on. Cambridge, Massachusetts, USA: IEEE; 2012:16-20.
[26]Wang C, Zhang D: A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res 2011, 39(7):e45-45.
[27]Chern BG, Ochoa I, Manolakos A, No A, Venkat K, Weissman T: Reference based genome compression. In IEEE Inf Theory Workshop, ITW. Lausanne, Switzerland: IEEE; 2012:427-431.
[28]Timothy W, White J, Hendy MD: Compressing DNA sequence databases with coil. Bioinformatics 2008, 9(1):242.
[29]Deorowicz S, Grabowski S: Compression of genomic sequences in FASTQ format. Bioinformatics 2011, 27(6):860-862.
[30]Tembe W, Lowey J, Suh E: G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 2010, 26:2192-2194.
[31]Jones DC, Ruzzo WL, Peng X, Katze MG: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res 2012, 40(22):e171-171.
[32]Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G: Compressing genomic sequence fragments using SlimGene. J Comput Biol J Comput Mol Cell Biol 2011, 18:401-413.
[33]Fritz MH, Leinonen R, Cochrane G, Birney E: Efficient storage of high throughput sequencing data using reference-based compression. Genome Res 2011, 21:734-774.
[34]fastqz http://mattmahoney.net/dc/fastqz/ webcite
[35]Hach F, Numanagić I, Alkan C, Sahinalp SC: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 2012, 28(23):3051-3057.
[36]Cramtools https://github.com/vadimzalunin/crammer webcite
[37]The Pistoia Alliance http://www.sequencesqueeze.org/ webcite
[38]Cochrane G, Cook C, Birney E: The future of DNA sequence archiving. GigaScience 2012, 1:2. http://www.gigasciencejournal.com/content/1/1/2 webcite BioMed Central Full Text
[39]Wan R, Anh VN, Asai K: Transformations for the compression of FASTQ quality scores of next generation sequencing data. Bioinformatics 2011, 28(5):628-635.
[40]Lapidoth A: On the role of mismatch in rate distortion theory. Inf Theory, IEEE Trans 1997, 43(1):38-47.
[41]Cover T, Thomas J: Elements of Information Theory, Volume 6. 1991.
[42]Lloyd S: Least squares quantization in PCM. Inf Theory, IEEE Trans on 1982, 28(2):129-137.
[43]MacQueen J: Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1. California: University of California Press; 1967:14-14.
[44]SRR032209 data http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR032209 webcite
[45]SRR089526 data http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR089526 webcite
[46]PhiX data http://bix.ucsd.edu/projects/singlecell/nbt\_data.html webcite
[47]QualComp website https://sourceforge.net/projects/qualcomp/ webcite
[48]PhiX174 Genome http://www.ncbi.nlm.nih.gov/nuccore/NC webcite\_001422
[49]Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. Bioinformatics 2009, 25(16):2078-2079.

BMC Bioinformatics
QualComp: a new lossy compressor for quality scores based on rate distortion theory

Idoia Ochoa¹ Himanshu Asnani¹ Dinesh Bharadia¹ Mainak Chowdhury¹ Tsachy Weissman¹ Golan Yona¹
[1] Department of Electrical Engineering, Stanford University, Stanford, CA, USA
关键词: Mean squared error; Rate distortion; FASTQ format; Compression; Quality scores; Next generation sequencing;
Others : 1087848 DOI : 10.1186/1471-2105-14-187

received in 2012-11-28, accepted in 2013-06-01, 发布年份 2013
PDF


	文献评价指标
	下载次数：59次	浏览次数：54次

【 摘 要 】

Background

Results

Conclusions

【 授权许可】

【 预 览 】

【 图 表 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【图表】

【参考文献】