期刊论文

【摘要】

Background

Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method.

Results

We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses.

Conclusions

LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20.

LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.

【授权许可】

2015 Benoit et al.

【预览】

附件列表
Files	Size	Format	View
20151030015329703.pdf	1683KB	PDF	download
Fig. 6.	66KB	Image	download
Fig. 5.	24KB	Image	download
Fig. 4.	64KB	Image	download
Fig. 3.	35KB	Image	download
Fig. 2.	39KB	Image	download
Fig. 1.	22KB	Image	download

【图表】

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

【参考文献】

[1]Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2010; 39:1019.
[2]Jones DC, Ruzzo WL, Peng X, Katze MG. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 2012; 40(22):171.
[3]Fritz MHY, Leinonen R, Cochrane G, Birney E. Efficient storage of high throughput sequencing data using reference-based compression. Genome Res. 2011; 21:734-40.
[4]Kingsford C, Patro R. Reference-based compression of short-read sequences using path encoding. Bioinformatics. 2015; 31:071.
[5]Bonfield JK, Mahoney MV. Compression of fastq and sam format sequencing data. PLoS One. 2013; 8(3):59190.
[6]Hach F, Numanagic I, Alkan C, Sahinalp SC. Scalce: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics. 2012; 28(23):3051-057.
[7]Deorowicz S, Grabowski S. Compression of dna sequence reads in fastq format. Bioinformatics. 2011; 27(6):860-2.
[8]Grabowski S, Deorowicz S, Roguski Ł. Disk-based compression of data from genome sequencing. Bioinformatics. 2014; 31:844.
[9]Janin L, Schulz-Trieglaff O, Cox AJ. Beetl-fastq: a searchable compressed archive for dna reads. Bioinformatics. 2014; 30:387.
[10]Patro R, Kingsford C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics. 2015; 31:248.
[11]Cox AJ, Bauer MJ, Jakobi T, Rosone G. Large-scale compression of genomic sequence databases with the burrows–wheeler transform. Bioinformatics. 2012; 28(11):1415-9.
[12]Wan R, Anh VN, Asai K. Transformations for the compression of fastq quality scores of next-generation sequencing data. Bioinformatics. 2012; 28(5):628-35.
[13]Cánovas R, Moffat A, Turpin A. Lossy compression of quality scores in genomic data. Bioinformatics. 2014; 30(15):2130-136.
[14]Janin L, Rosone G, Cox AJ. Adaptive reference-free compression of sequence quality scores. Bioinformatics. 2013; 30:257.
[15]Yu YW, Yorukoglu D, Berger B. Traversing the k-mer landscape of ngs read datasets for quality score sparsification. In: Research in computational molecular biology. Springer: 2014. p. 385–99.
[16]Kirsch A, Mitzenmacher M. Less hashing, same performance: Building a better bloom filter. Algorithms-ESA 2006. 2006:456–67.
[17]Chikhi R, Medvedev P. Informed and automated k-mer size selection for genome assembly. Bioinformatics. 2013; 30:310.
[18]Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de bruijn graphs. Nat Genet. 2012; 44(2):226-32.
[19]Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT. Scaling metagenome sequence assembly with probabilistic de bruijn graphs. Proc Natl Acad Sci. 2012; 109(33):13272-7.
[20]Chikhi R, Rizk G. Space-efficient and exact de bruijn graph representation based on a bloom filter. Algorithms Mol Biol. 2013; 8(1):22.
[21]Salikhov K, Sacomoto G, Kucherov G. Using cascading bloom filters to improve the memory usage for de brujin graphs. Algoritm Bioinforma. 2013; 9:364-76.
[22]Witten I, Neal R, Cleary J. Arithmetic coding for data compression. Commun ACM. 1987; 30:520-540.
[23]Drezen E, Rizk G, Chikhi R, Deltel C, Lemaitre C, Peterlongo P, et al. Gatb: Genome assembly and analysis tool box. Bioinformatics. 2014. doi:10.1093/bioinformatics/btu406.
[24]Rizk G, Lavenier D, Chikhi R. Dsk: k-mer counting with very low memory usage. Bioinformatics. 2013; 29(5):652-3.
[25]Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A. Kmc 2: Fast and resource-frugal k-mer counting. Bioinformatics. 2015; 31:022.
[26]Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al.. The sequence alignment/map format and samtools. Bioinformatics. 2009; 25(16):2078-079.
[27]Li H, Durbin R. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics. 2009; 25(14):1754-60.
[28]Yu YW, Yorukoglu D, Peng J, Berger B. Quality score compression improves genotyping accuracy. Nat Biotechnol. 2015; 33(3):240-3.
[29]Lemaitre C, Ciortuz L, Peterlongo P. Mapping-free and assembly-free discovery of inversion breakpoints from raw ngs reads. Algoritm Comput Biol. 2014; 8542:119-30.
[30]Uricaru R, Rizk G, Lacroix V, Quillery E, Plantard O, Chikhi R et al.. Reference-free detection of isolated snps. Nucleic Acids Res. 2015; 43(2):11.

BMC Bioinformatics
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Gaëtan Benoit¹ Claire Lemaitre¹ Dominique Lavenier¹ Erwan Drezen¹ Thibault Dayris³ Raluca Uricaru² Guillaume Rizk¹
[1] INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes 35042, France
[2] University of Bordeaux, CBiB, Bordeaux F-33000, France
[3] University of Bordeaux, CNRS/LaBRI, Talence F-33405, France
关键词: Bloom filter; NGS; de Bruijn Graph; Compression;
Others : 1229464 DOI : 10.1186/s12859-015-0709-7

received in 2015-04-16, accepted in 2015-08-17, 发布年份 2015
PDF


	文献评价指标
	下载次数：167次	浏览次数：52次

【 摘 要 】

Background

Results

Conclusions

【 授权许可】

【 预 览 】

【 图 表 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【图表】

【参考文献】