期刊论文详细信息
BMC Bioinformatics
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph
Gaëtan Benoit1  Claire Lemaitre1  Dominique Lavenier1  Erwan Drezen1  Thibault Dayris3  Raluca Uricaru2  Guillaume Rizk1 
[1] INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes 35042, France
[2] University of Bordeaux, CBiB, Bordeaux F-33000, France
[3] University of Bordeaux, CNRS/LaBRI, Talence F-33405, France
关键词: Bloom filter;    NGS;    de Bruijn Graph;    Compression;   
Others  :  1229464
DOI  :  10.1186/s12859-015-0709-7
 received in 2015-04-16, accepted in 2015-08-17,  发布年份 2015
PDF
【 摘 要 】

Background

Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method.

Results

We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses.

Conclusions

LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20.

LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.

【 授权许可】

   
2015 Benoit et al.

【 预 览 】
附件列表
Files Size Format View
20151030015329703.pdf 1683KB PDF download
Fig. 6. 66KB Image download
Fig. 5. 24KB Image download
Fig. 4. 64KB Image download
Fig. 3. 35KB Image download
Fig. 2. 39KB Image download
Fig. 1. 22KB Image download
【 图 表 】

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

【 参考文献 】
  • [1]Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2010; 39:1019.
  • [2]Jones DC, Ruzzo WL, Peng X, Katze MG. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 2012; 40(22):171.
  • [3]Fritz MHY, Leinonen R, Cochrane G, Birney E. Efficient storage of high throughput sequencing data using reference-based compression. Genome Res. 2011; 21:734-40.
  • [4]Kingsford C, Patro R. Reference-based compression of short-read sequences using path encoding. Bioinformatics. 2015; 31:071.
  • [5]Bonfield JK, Mahoney MV. Compression of fastq and sam format sequencing data. PLoS One. 2013; 8(3):59190.
  • [6]Hach F, Numanagic I, Alkan C, Sahinalp SC. Scalce: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics. 2012; 28(23):3051-057.
  • [7]Deorowicz S, Grabowski S. Compression of dna sequence reads in fastq format. Bioinformatics. 2011; 27(6):860-2.
  • [8]Grabowski S, Deorowicz S, Roguski Ł. Disk-based compression of data from genome sequencing. Bioinformatics. 2014; 31:844.
  • [9]Janin L, Schulz-Trieglaff O, Cox AJ. Beetl-fastq: a searchable compressed archive for dna reads. Bioinformatics. 2014; 30:387.
  • [10]Patro R, Kingsford C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics. 2015; 31:248.
  • [11]Cox AJ, Bauer MJ, Jakobi T, Rosone G. Large-scale compression of genomic sequence databases with the burrows–wheeler transform. Bioinformatics. 2012; 28(11):1415-9.
  • [12]Wan R, Anh VN, Asai K. Transformations for the compression of fastq quality scores of next-generation sequencing data. Bioinformatics. 2012; 28(5):628-35.
  • [13]Cánovas R, Moffat A, Turpin A. Lossy compression of quality scores in genomic data. Bioinformatics. 2014; 30(15):2130-136.
  • [14]Janin L, Rosone G, Cox AJ. Adaptive reference-free compression of sequence quality scores. Bioinformatics. 2013; 30:257.
  • [15]Yu YW, Yorukoglu D, Berger B. Traversing the k-mer landscape of ngs read datasets for quality score sparsification. In: Research in computational molecular biology. Springer: 2014. p. 385–99.
  • [16]Kirsch A, Mitzenmacher M. Less hashing, same performance: Building a better bloom filter. Algorithms-ESA 2006. 2006:456–67.
  • [17]Chikhi R, Medvedev P. Informed and automated k-mer size selection for genome assembly. Bioinformatics. 2013; 30:310.
  • [18]Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de bruijn graphs. Nat Genet. 2012; 44(2):226-32.
  • [19]Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT. Scaling metagenome sequence assembly with probabilistic de bruijn graphs. Proc Natl Acad Sci. 2012; 109(33):13272-7.
  • [20]Chikhi R, Rizk G. Space-efficient and exact de bruijn graph representation based on a bloom filter. Algorithms Mol Biol. 2013; 8(1):22.
  • [21]Salikhov K, Sacomoto G, Kucherov G. Using cascading bloom filters to improve the memory usage for de brujin graphs. Algoritm Bioinforma. 2013; 9:364-76.
  • [22]Witten I, Neal R, Cleary J. Arithmetic coding for data compression. Commun ACM. 1987; 30:520-540.
  • [23]Drezen E, Rizk G, Chikhi R, Deltel C, Lemaitre C, Peterlongo P, et al. Gatb: Genome assembly and analysis tool box. Bioinformatics. 2014. doi:10.1093/bioinformatics/btu406.
  • [24]Rizk G, Lavenier D, Chikhi R. Dsk: k-mer counting with very low memory usage. Bioinformatics. 2013; 29(5):652-3.
  • [25]Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A. Kmc 2: Fast and resource-frugal k-mer counting. Bioinformatics. 2015; 31:022.
  • [26]Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al.. The sequence alignment/map format and samtools. Bioinformatics. 2009; 25(16):2078-079.
  • [27]Li H, Durbin R. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics. 2009; 25(14):1754-60.
  • [28]Yu YW, Yorukoglu D, Peng J, Berger B. Quality score compression improves genotyping accuracy. Nat Biotechnol. 2015; 33(3):240-3.
  • [29]Lemaitre C, Ciortuz L, Peterlongo P. Mapping-free and assembly-free discovery of inversion breakpoints from raw ngs reads. Algoritm Comput Biol. 2014; 8542:119-30.
  • [30]Uricaru R, Rizk G, Lacroix V, Quillery E, Plantard O, Chikhi R et al.. Reference-free detection of isolated snps. Nucleic Acids Res. 2015; 43(2):11.
  文献评价指标  
  下载次数:92次 浏览次数:52次