期刊论文详细信息
BMC Bioinformatics
NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly
Barbara A Methé1  Roger S Lasken1  Derrick E Fouts2  Indresh Singh3  Pratap Venepally3  Jamison M McCorrison3 
[1]Department of Microbial & Environmental Genomics, The J. Craig Venter Institute (JCVI), 9704 Medical Center Drive, Rockville MD20850, USA
[2]Department of Genomic Medicine, The J. Craig Venter Institute (JCVI), 9704 Medical Center Drive, Rockville 20850, MD, USA
[3]Informatics Core Services, The J. Craig Venter Institute (JCVI), 9704 Medical Center Drive, Rockville 20850, MD, USA
关键词: Multiple displacement amplification;    Transcriptomics;    SISPA;    Single cell;    Normalization;    Coverage reduction;    de novo assembly;    Bioinformatics;   
Others  :  1085095
DOI  :  10.1186/s12859-014-0357-3
 received in 2013-12-02, accepted in 2014-10-22,  发布年份 2014
PDF
【 摘 要 】

Background

Deep shotgun sequencing on next generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. However, deep coverage variations in short-read data sets and high sequencing error rates of modern sequencers present new computational challenges in data interpretation, including mapping and de novo assembly. New lab techniques such as multiple displacement amplification (MDA) of single cells and sequence independent single primer amplification (SISPA) allow for sequencing of organisms that cannot be cultured, but generate highly variable coverage due to amplification biases.

Results

Here we introduce NeatFreq, a software tool that reduces a data set to more uniform coverage by clustering and selecting from reads binned by their median kmer frequency (RMKF) and uniqueness. Previous algorithms normalize read coverage based on RMKF, but do not include methods for the preferred selection of (1) extremely low coverage regions produced by extremely variable sequencing of random-primed products and (2) 2-sided paired-end sequences. The algorithm increases the incorporation of the most unique, lowest coverage, segments of a genome using an error-corrected data set. NeatFreq was applied to bacterial, viral plaque, and single-cell sequencing data. The algorithm showed an increase in the rate at which the most unique reads in a genome were included in the assembled consensus while also reducing the count of duplicative and erroneous contigs (strings of high confidence overlaps) in the deliverable consensus. The results obtained from conventional Overlap-Layout-Consensus (OLC) were compared to simulated multi-de Bruijn graph assembly alternatives trained for variable coverage input using sequence before and after normalization of coverage. Coverage reduction was shown to increase processing speed and reduce memory requirements when using conventional bacterial assembly algorithms.

Conclusions

The normalization of deep coverage spikes, which would otherwise inhibit consensus resolution, enables High Throughput Sequencing (HTS) assembly projects to consistently run to completion with existing assembly software. The NeatFreq software package is free, open source and available at https://github.com/bioh4x/NeatFreq webcite.

【 授权许可】

   
2014 McCorrison et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150113170618701.pdf 2486KB PDF download
Figure 3. 27KB Image download
Figure 2. 124KB Image download
Figure 1. 35KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

【 参考文献 】
  • [1]Lasken RS: Genomic sequencing of uncultured microorganisms from single cells. Nat Rev Microbiol 2012, 10:631-640.
  • [2]Lasken RS: Genomic DNA amplification by the multiple displacement amplification (MDA) method. Biochem Soc Trans 2009, 37:450-453.
  • [3]Raghunathan A, Ferguson HR Jr, Bornarth CJ, Song W, Driscoll M, Lasken RS: Genomic DNA amplification from a single bacterium. Appl Environ Microbiol 2005, 71:3342-3347.
  • [4]Zhang K, Martiny AC, Reppas NB, Barry KW, Malek J, Chisholm SW, Church GM: Sequencing genomes from single cells by polymerase cloning. Nat Biotechnol 2006, 24:680-686.
  • [5]Chitsaz H, Yee-Greenbaum JL, Tesler G, Lombardo MJ, Dupont CL, Badger JH, Novotny M, Rusch DB, Fraser LJ, Gormley NA, Schulz-Trieglaff O, Smith GP, Evers DJ, Pevzner PA, Lasken RS: Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat Biotechnol 2011, 29:915-921.
  • [6]Dupont CL, Rusch DB, Yooseph S, Lombardo MJ, Richter RA, Valas R, Novotny M, Yee-Greenbaum J, Selengut JD, Haft DH, Halpern AL, Lasken RS, Nealson K, Friedman R, Venter JC: Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage. ISME J 2012, 6:1186-1199.
  • [7]Dean FB, Nelson JR, Giesler TL, Lasken RS: Rapid amplification of plasmid and phage DNA using Phi 29 DNA polymerase and multiply-primed rolling circle amplification. Genome Res 2001, 11:1095-1099.
  • [8]Allen LZ, Ishoey T, Novotny MA, McLean JS, Lasken RS, Williamson SJ: Single virus genomics: a new tool for virus discovery. PLoS One 2011, 6:e17722.
  • [9]Depew J, Zhou B, McCorrison JM, Wentworth DE, Purushe J, Koroleva G, Fouts DE: Sequencing viral genomes from a single isolated plaque. Virol J 2013, 10:181. BioMed Central Full Text
  • [10]Yokouchi H, Fukuoka Y, Mukoyama D, Calugay R, Takeyama H, Matsunaga T: Whole-metagenome amplification of a microbial community associated with scleractinian coral by multiple displacement amplification using phi29 polymerase. Environ Microbiol 2006, 8:1155-1163.
  • [11]Willner D, Furlan M, Haynes M, Schmieder R, Angly FE, Silva J, Tammadoni S, Nosrat B, Conrad D, Rohwer F: Metagenomic analysis of respiratory tract DNA viral communities in cystic fibrosis and non-cystic fibrosis individuals. PLoS One 2009, 4:e7370.
  • [12]McLean JS, Lombardo MJ, Ziegler MG, Novotny M, Yee-Greenbaum J, Badger JH, Tesler G, Nurk S, Lesin V, Brami D, Hall AP, Edlund A, Allen LZ, Durkin S, Reed S, Torriana F, Nealson KH, Pevzner PA, Friedman R, Venter JC, Lasken RS: Genome of the pathogen Porphyromonas gingivalis recovered from a biofilm in a hospital sink using a high-throughput single-cell genomics platform. Genome Res 2013, 23:867-877.
  • [13]McLean JS, Lombardo MJ, Badger JH, Edlund A, Novotny M, Yee-Greenbaum J, Vyahhi N, Hall AP, Yang Y, Dupont CL, Ziegler MG, Chitsaz H, Allen AE, Yooseph S, Tesler G, Pevzner PA, Friedman RM, Nealson KH, Venter JC, Lasken RS: Candidate phylum TM6 genome recovered from a hospital sink biofilm provides genomic insights into this uncultivated phylum. Proc Natl Acad Sci U S A 2013, 110:E2390-E2399.
  • [14]Brown CT, Howe A, Zhang Q, Pyrkosz AB, Brom TH: A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv 2012, 1203:4802.
  • [15]Howe AC, Jansson JK, Malfatti SA, Tringe SG, Tideje JM, Brown CT: Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci U S A 2014, 111(13):4904-4909.
  • [16]Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB: ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res 2008, 18:810-820.
  • [17]Martin M: Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 2011, 17:10-12.
  • [18]Morgulis A, Gertz EM, Schaffer AA, Agarwala R: A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol 2006, 13:1028-1040.
  • [19]Kelley DR, Schatz MC, Salzberg SL: Quake: quality-aware detection and correction of sequencing errors. Genome Biol 2010, 11:R116. BioMed Central Full Text
  • [20][https://github.com/martijnvermaat/bio-playground] webcite Bio-playground package. In .
  • [21]Quinn NL, Levenkova N, Chow W, Bouffard P, Boroevich KA, Knight JR, Jarvie TP, Lubieniecki KP, Desany BA, Koop BF, Harkins TT, Davidson* WS: Assessing the feasibility of GS FLX pyrosequencing for sequencing the Atlantic salmon genome. BMC Genomics 2008, 9:404. BioMed Central Full Text
  • [22]Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22:1658-1659.
  • [23][http://www.clcbio.com/files/whitepapers/whitepaper-denovo-assembly-4.pdf] webcite White paper on CLC de novo assembler. In .
  • [24]Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, et al.: The sequence of the human genome. Science 2001, 291:1304-1351.
  • [25]Chevreux B: MIRA: An Automated Genome and EST Assembler, PhD Thesis. German Cancer Research Center Heidelberg. Department of Molecular Biophysics; 2005.
  • [26]Kurtz S, Narechania A, Stein JC, Ware D: A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics 2008, 9:517. BioMed Central Full Text
  • [27]Peng Y, Leung HC, Yiu SM, Chin FY: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 2012, 28:1420-1428.
  • [28]Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Res 2008, 18:324-330.
  • [29]Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 2012, 19:455-477.
  • [30]Volfovsky N, Haas BJ, Salzberg SL: A clustering method for repeat analysis in DNA sequences. Genome Biol 2001, 2:RESEARCH0027. BioMed Central Full Text
  文献评价指标  
  下载次数:58次 浏览次数:43次