期刊论文详细信息
BMC Bioinformatics
Sealer: a scalable gap-closing application for finishing draft genomes
Daniel Paulino1  René L. Warren1  Benjamin P. Vandervalk1  Anthony Raymond1  Shaun D. Jackman1  Inanç Birol2 
[1] Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver V5Z 4S6, BC, Canada
[2] Department of Medical Genetics, University of British Columbia, Vancouver V6H 3N1, BC, Canada
关键词: Bloom filters;    Next-generation sequencing;    Sealer;    Genome finishing;    Gap closing;   
Others  :  1230718
DOI  :  10.1186/s12859-015-0663-4
 received in 2015-02-24, accepted in 2015-07-07,  发布年份 2015
【 摘 要 】

Background

While next-generation sequencing technologies have made sequencing genomes faster and more affordable, deciphering the complete genome sequence of an organism remains a significant bioinformatics challenge, especially for large genomes. Low sequence coverage, repetitive elements and short read length make de novo genome assembly difficult, often resulting in sequence and/or fragment “gaps” – uncharacterized nucleotide (N) stretches of unknown or estimated lengths. Some of these gaps can be closed by re-processing latent information in the raw reads. Even though there are several tools for closing gaps, they do not easily scale up to processing billion base pair genomes.

Results

Here we describe Sealer, a tool designed to close gaps within assembly scaffolds by navigating de Bruijn graphs represented by space-efficient Bloom filter data structures. We demonstrate how it scales to successfully close 50.8 % and 13.8 % of gaps in human (3 Gbp) and white spruce (20 Gbp) draft assemblies in under 30 and 27 h, respectively – a feat that is not possible with other leading tools with the breadth of data used in our study.

Conclusion

Sealer is an automated finishing application that uses the succinct Bloom filter representation of a de Bruijn graph to close gaps in draft assemblies, including that of very large genomes. We expect Sealer to have broad utility for finishing genomes across the tree of life, from bacterial genomes to large plant genomes and beyond. Sealer is available for download at https://github.com/bcgsc/abyss/tree/sealer-release.

【 授权许可】

   
2015 Paulino et al.

附件列表
Files Size Format View
Fig. 2. 23KB Image download
Fig. 1. 20KB Image download
Fig. 2. 23KB Image download
Fig. 1. 20KB Image download
【 图 表 】

Fig. 1.

Fig. 2.

Fig. 1.

Fig. 2.

【 参考文献 】
  • [1]Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009; 19:1117-23.
  • [2]Stein LD. The case for cloud computing in genome informatics. Genome Biol. 2010; 11:207. BioMed Central Full Text
  • [3]Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. A map of human genome variation from population-scale sequencing. Nature. 2010; 467:1061-73.
  • [4]Mardis ER. The $1000 genome, the $100,000 analysis? Genome Med. 2010; 2:84. BioMed Central Full Text
  • [5]Genomic and Epigenomic Landscapes of Adult De Novo Acute Myeloid Leukemia. N Engl J Med. 2013; 368:2059-74.
  • [6]Pugh TJ, Morozova O, Attiyeh EF, Asgharzadeh S, Wei JS, Auclair D, Carter SL, Cibulskis K, Hanna M, Kiezun A et al.. The genetic landscape of high-risk neuroblastoma. Nat Genet. 2013; 45:279-84.
  • [7]Roberts KG, Morin RD, Zhang J, Hirst M, Zhao Y, Su X, Chen SC, Payne-Turner D, Churchman ML, Harvey RC et al.. Genetic Alterations Activating Kinase and Cytokine Receptor Signaling in High-Risk Acute Lymphoblastic Leukemia. Cancer Cell. 2012; 22:153-66.
  • [8]Yip S, Butterfield YS, Morozova O, Chittaranjan S, Blough MD, An J, Birol I, Chesnelong C, Chiu R, Chuah E et al.. Concurrent CIC mutations, IDH mutations, and 1p/19q loss distinguish oligodendrogliomas from other cancers. J Pathol. 2012; 226:7-16.
  • [9]Hunt M, Newbold C, Berriman M, Otto TD. A comprehensive evaluation of assembly scaffolding tools. Genome Biol. 2014; 15:R42. BioMed Central Full Text
  • [10]Boetzer M, Pirovano W. Toward almost closed genomes with GapFiller. Genome Biol. 2012; 13:R56. BioMed Central Full Text
  • [11]Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y et al.. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012; 1:18. BioMed Central Full Text
  • [12]Vandervalk BP, Jackman SD, Raymond A, Mohamadi H, Yang C, Attali DA, Chu J, Warren RL, Birol I. Konnector: Connecting paired-end reads using a bloom filter de Bruijn graph. Bioinformatics Biomedicine (BIBM). 2014.
  • [13]Birol I, Raymond A, Jackman SD, Pleasance S, Coope R, Taylor GA, Yuen MM, Keeling CI, Brand D, Vandervalk BP et al.. Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data. Bioinformatics. 2013; 29:1492-7.
  • [14]Chikhi R, Rizk G. Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol Biol. 2013; 8:22. BioMed Central Full Text
  • [15]Cornishbowden A. Nomenclature For Incompletely Specified Bases In Nucleic-Acid Sequences - Recommendations 1984. Nucleic Acids Res. 1985; 13:3021-30.
  • [16]Slater GS, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005; 6:31. BioMed Central Full Text
  • [17]Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013; 29:1072-5.
  • [18]Tsai IJ, Otto TD, Berriman M. Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol. 2010; 11:R41. BioMed Central Full Text
  • [19]Salmela L, Sahlin K, Mäkinen V, Tomescu AI. Gap Filling as Exact Path Length Problem. In: Przytycka TM, editor. Research in Computational Molecular Biology. Lecture Notes in Computer Science Volume 9029. Warsaw: Springer International Publishing; 2015. p. 281–292.
  • [20]Birol I, Jackman SD, Nielsen CB, Qian JQ, Varhol R, Stazyk G, Morin RD, Zhao Y, Hirst M, Schein JE et al.. De novo transcriptome assembly with ABySS. Bioinformatics. 2009; 25:2872-7.
  • [21]Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0. 2013–2015. http://www. repeatmasker.org webcite
  • [22]Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM et al.. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995; 269:496-512.
  • [23]Genovese G, Handsaker RE, Li H, Kenny EE, McCarroll SA. Mapping the human reference genome’s missing sequence by three-way admixture in Latino genomes. Am J Hum Genet. 2013; 93:411-21.
  • [24]Jamshidi F, Pleasance E, Li Y, Shen Y, Kasaian K, Corbett R, Eirew P, Lum A, Pandoh P, Zhao Y et al.. Diagnostic value of next-generation sequencing in an unusual sphenoid tumor. Oncologist. 2014; 19:623-30.
  文献评价指标  
  下载次数:50次 浏览次数:7次