期刊论文详细信息
BMC Research Notes
Suffix tree searcher: exploration of common substrings in large DNA sequence sets
Chris Upton1  Chris Kelly1  Marina G Barsky2  Song-Han Lin1  Michael J Whitney1  David Minkley1 
[1] Department of Biochemistry and Microbiology, University of Victoria, Ring Road, Victoria, BC V8W 3P6, Canada;Current address: Ontario Institute for Cancer Research, MaRS Centre, Toronto, ON M5G 0A3, Canada
关键词: STS;    DNA sequence;    Substring;    Genome;    Suffix tree;   
Others  :  1131833
DOI  :  10.1186/1756-0500-7-466
 received in 2014-02-12, accepted in 2014-07-18,  发布年份 2014
PDF
【 摘 要 】

Background

Large DNA sequence data sets require special bioinformatics tools to search and compare them. Such tools should be easy to use so that the data can be easily accessed by a wide array of researchers. In the past, the use of suffix trees for searching DNA sequences has been limited by a practical need to keep the trees in RAM. Newer algorithms solve this problem by using disk-based approaches. However, none of the fastest suffix tree algorithms have been implemented with a graphical user interface, preventing their incorporation into a feasible laboratory workflow.

Results

Suffix Tree Searcher (STS) is designed as an easy-to-use tool to index, search, and analyze very large DNA sequence datasets. The program accommodates very large numbers of very large sequences, with aggregate size reaching tens of billions of nucleotides. The program makes use of pre-sorted persistent "building blocks" to reduce the time required to construct new trees. STS is comprised of a graphical user interface written in Java, and four C modules. All components are automatically downloaded when a web link is clicked. The underlying suffix tree data structure permits extremely fast searching for specific nucleotide strings, with wild cards or mismatches allowed. Complete tree traversals for detecting common substrings are also very fast. The graphical user interface allows the user to transition seamlessly between building, traversing, and searching the dataset.

Conclusions

Thus, STS provides a new resource for the detection of substrings common to multiple DNA sequences or within a single sequence, for truly huge data sets. The re-searching of sequence hits, allowing wild card positions or mismatched nucleotides, together with the ability to rapidly retrieve large numbers of sequence hits from the DNA sequence files, provides the user with an efficient method of evaluating the similarity between nucleotide sequences by multiple alignment or use of Logos. The ability to re-use existing suffix tree pieces considerably shortens index generation time. The graphical user interface enables quick mastery of the analysis functions, easy access to the generated data, and seamless workflow integration.

【 授权许可】

   
2014 Minkley et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150303084004215.pdf 807KB PDF download
Figure 1. 106KB Image download
【 图 表 】

Figure 1.

【 参考文献 】
  • [1]Bentley S: Taming the next-gen beast. Nat Rev Microbiol 2010, 8(3):161.
  • [2]Metzker ML: Sequencing technologies - the next generation. Nat Rev Genet 2010, 11(1):31-46.
  • [3]van Vliet AHM: Next generation sequencing of microbial transcriptomes: challenges and opportunities. FEMS Microbiol Lett 2010, 302(1):1-7.
  • [4]Sadeque A, Barsky M, Marass F, Kruczkiewicz P, Upton C: JaPaFi: A Novel Program for the Identification of Highly Conserved DNA Sequences. Viruses 2010, 2(9):1867-1885.
  • [5]Gusfield D: Introduction to Suffix Trees. In Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge: Cambridge University Press; 1997:89-93.
  • [6]Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol 2004, 5(2):R12. BioMed Central Full Text
  • [7]Nicolas J, Durand P, Ranchy G, Tempel S, Valin A-S: Suffix-tree analyser (STAN): looking for nucleotidic and peptidic patterns in chromosomes. Bioinformatics 2005, 21(24):4408-4410.
  • [8]The Vmatch large scale sequence analysis software. [ http://vmatch.de/ webcite]
  • [9]Phoophakdee B, Zaki MJ: TRELLIS+: an effective approach for indexing genome-scale sequences using suffix trees. Pac Symp Biocomput 2008, 13:90-101.
  • [10]Barsky M, Stege U, Thomo A, Upton C: A new method for indexing genomes using on-disk suffix trees. In CIKM ’08. New York, New York, USA: ACM Press; 2008:649.
  • [11]Mansour E, Allam A, Skiadopoulos S, Kalnis P: ERA: Efficient Serial and Parallel Suffix Tree Construction for Very Long Strings. Proceedings of the VLDB Endowment 2011, 5(1):49-60.
  • [12]Libdivsufsort. [ http://code.google.com/p/libdivsufsort/ webcite]
  • [13]Manzini G: Two Space Saving Tricks for Linear Time LCP Array Computation. In Algorithm Theory - SWAT 2004. Berlin, Heidelberg: Springer-Verlag; 2004:372-383.
  • [14]Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res 2013, 41(D1):D36-D42.
  • [15]Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403-410.
  • [16]Greengenes [ http://greengene.uml.edu/ webcite]
  • [17]Fitzgerald LA, Boucher PT, Yanai-Balser GM, Suhre K, Graves MV, Van Etten JL: Putative gene promoter sequences in the chlorella viruses. Virology 2008, 380(2):388-393.
  • [18]Upton C, Hogg D, Perrin D, Boone M, Harris NL: Viral genome organizer: a system for analyzing complete viral genomes. Virus Res 2000, 70(1–2):55-64.
  • [19]Onimatsu H, Suganuma K, Uenoyama S, Yamada T: C-terminal repetitive motifs in Vp130 present at the unique vertex of the Chlorovirus capsid are essential for binding to the host Chlorella cell wall. Virology 2006, 353(2):433-442.
  • [20]Barsky M, Stege U, Thomo A, Upton C: Suffix Trees for Very Large Genomic Sequences. In CIKM’09, November 2–6, 2009, Hong Kong, China. New York, New York, USA: ACM Press; 2009:1-4.
  文献评价指标  
  下载次数:17次 浏览次数:11次