Algorithms for Molecular Biology | |
Estimating evolutionary distances between genomic sequences from spaced-word matches | |
Burkhard Morgenstern3  Bingyao Zhu1  Sebastian Horwege2  Chris André Leimeister2  | |
[1] University of Göttingen, Department of General Microbiology, Grisebachstr. 8, Göttingen 37073, Germany | |
[2] University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, Göttingen 37073, Germany | |
[3] Université d’Evry Val d’Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA 23 Boulevard de France, Evry 91037, France | |
关键词: Genome comparison; Variance; Distance estimation; Word frequency; Phylogeny; Alignment-free; Spaced words; k-mers; | |
Others : 1126855 DOI : 10.1186/s13015-015-0032-x |
|
received in 2014-11-19, accepted in 2015-01-06, 发布年份 2015 | |
【 摘 要 】
Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator dN of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of ‘match positions’ and ‘don’t care positions’. Our software is available online and as downloadable source code at: http://spaced.gobics.de/ webcite.
【 授权许可】
2015 Morgenstern et al.; licensee BioMed Central.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150219010705258.pdf | 1733KB | download | |
Figure 6. | 18KB | Image | download |
Figure 5. | 46KB | Image | download |
Figure 4. | 41KB | Image | download |
Figure 3. | 28KB | Image | download |
Figure 2. | 25KB | Image | download |
Figure 1. | 74KB | Image | download |
【 图 表 】
Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
【 参考文献 】
- [1]Vinga S: Editorial: Alignment-free methods in computational biology. Briefings Bioinf. 2014, 15:341-2.
- [2]Leslie C, Eskin E, Noble WSS: The spectrum kernel: a string kernel for SVM protein classification. In Pacific Symposium on Biocomputing. World Scientific Publishing, Singapore; 2002.
- [3]Lingner T, Meinicke P: Remote homology detection based on oligomer distances. Bioinformatics. 2006, 22:2224-31.
- [4]Lingner T, Meinicke P: Word correlation matrices for protein sequence analysis and remote homology detection. BMC Bioinf. 2008, 9:259. BioMed Central Full Text
- [5]Comin M, Verzotto D: The irredundant class method for remote homology detection of protein sequences. J Comput Biol. 2011, 18:1819-29.
- [6]Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics. 2008, 24:713-4.
- [7]Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol. 2009, 10:25. BioMed Central Full Text
- [8]Ahmadi A, Behm A, Honnalli N, Li C, Weng L, Xie X: Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res. 2011, 40:1.
- [9]Patro R, Mount SM, Kingsford C: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol. 2014, 32:462-4.
- [10]Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18:821-9.
- [11]Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner F: Tetra: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in dna sequences. BMC Bioinf. 2004, 5:163. BioMed Central Full Text
- [12]Chatterji S, Yamazaki I, Bai Z, Eisen JA: Compostbin: A DNA composition-based algorithm for binning environmental shotgun reads. In Research in Computational Molecular Biology, 12th Annual International Conference, RECOMB 2008, Singapore, March 30 - April 2, 2008. Proceedings. Springer, Berlin, Heidelberg; 2008.
- [13]Wu Y-W, Ye Y: A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J Comput Biol. 2011, 18:523-34.
- [14]Tanaseichuk O, Borneman J, Jiang T: Separating metagenomic short reads into genomes via clustering. Algorithms Mol Biol. 2012, 7:27. BioMed Central Full Text
- [15]Leung HCM, Yiu SM, Yang B, Peng Y, Wang Y, Liu Z, et al.: A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics. 2011, 27:1489-95.
- [16]Wang Y, Leung HCM, Yiu SM, Chin FYL: Metacluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics. 2012, 28:356-62.
- [17]Meinicke P, Tech M, Morgenstern B, Merkl R: Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinf. 2004, 5:169. BioMed Central Full Text
- [18]Kantorovitz M, Robinson G, Sinha S: A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics. 2007, 23:249-55.
- [19]Leung G, Eisen MB: Identifying cis-regulatory sequences by word profile similarity. PloS one. 2009, 4(9):6901.
- [20]Federico M, Leoncini M, Montangero M, Valente P: Direct vs 2-stage approaches to structured motif finding. Algorithms Mol Biol. 2012, 7:20. BioMed Central Full Text
- [21]Blaisdell BE: A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Nat Acad Sci USA. 1986, 83:5155-9.
- [22]Lin J: Divergence measures based on the shannon entropy. IEEE Trans Inf theory. 1991, 37:145-51.
- [23]Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002, 18:440-5.
- [24]Boden M, Schöneich M, Horwege S, Lindner S, Leimeister C-A, Morgenstern B: German Conference on Bioinformatics 2013. [http://drops.dagstuhl.de/opus/volltexte/2013/4233] webciteIn OpenAccess Series in Informatics (OASIcs) Edited by Beißbarth T, Kollmar M, Leha A, Morgenstern B, Schultz A-K, Waack S, Wingender E. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany; 2013. http://drops.dagstuhl.de/opus/volltexte/2013/4233
- [25]Leimeister C-A, Boden M, Horwege S, Lindner S, Morgenstern B: Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics. 2014, 30:1991-9.
- [26]Horwege S, Lindner S, Boden M, Hatje K, Kollmar M, et al.: Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res. 2014, 42:W7-W11.
- [27]Onodera T, Shibuya T: The gapped spectrum kernel for support vector machines. In Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in Computer Science. Edited by Perner P. Springer, Berlin,Heidelberg; 2013.
- [28]Ghandi M, Mohammad-Noori M, Beer MA: Robust k-mer frequency estimation using gapped k-mers. J Math Biol. 2014, 69:469-500.
- [29]Ghandi M, Lee D, Mohammad-Noori M, Beer MA: Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol. 2014, 10(7):1003711.
- [30]Ulitsky I, Burstein D, Tuller T, Chor B: The average common substring approach to phylogenomic reconstruction. J Comput Biol. 2006, 13:336-50.
- [31]Didier G, Debomy L, Pupin M, Zhang M, Grossmann A, Devauchelle C, et al.: Comparing sequences without using alignments: application to HIV/SIV subtyping. BMC Bioinf. 2007, 8:1. BioMed Central Full Text
- [32]Sims GE, Jun S-R, Wu GA, Kim S-H: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Nat Acad Sci. 2009, 106:2677-82.
- [33]Domazet-Loso M, Haubold B: Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics. 2011, 27(11):1466-72.
- [34]Haubold B, Reed FA, Pfaffelhuber P: Alignment-free estimation of nucleotide diversity. Bioinformatics. 2011, 27:449-55.
- [35]Comin M, Verzotto D: Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms Mol Biol. 2012, 7:34. BioMed Central Full Text
- [36]Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4:406-25.
- [37]Haubold B, Pierstorff N, Möller F, Wiehe T: Genome comparison without alignment using shortest unique substrings. BMC Bioinf. 2005, 6:123. BioMed Central Full Text
- [38]Yi H, Jin L: Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res. 2013, 41:75.
- [39]Haubold B, Klötzl F, Pfaffelhuber P. andi: Fast and accurate estimation of evolutionary distances between closely related genomes. Bioinformatics.doi:10.1093/bioinformatics/btu815.
- [40]Noé L, Martin DEK: A coverage criterion for spaced seeds and its applications to SVM string-kernels and k-mer distances. J Comput Biol. 2014, 12:947-63.
- [41]Morgenstern B, Zhu B, Horwege S, Leimeister C: Estimating evolutionary distances from spaced-word matches. In Proc. Workshop on Algorithms in Bioinformatics (WABI’14). Lecture Notes in Bioinformatics. Springer, Berlin Heidelberg.; 2014.
- [42]Lippert RA, Huang H, Waterman MS: Distributional regimes for the number of k-word matches between two random sequences. Proc Nat Acad Sci. 2002, 99:13980-9.
- [43]Reinert G, Chew D, Sun F, Waterman MS: Alignment-free sequence comparison (i): Statistics and power. J Comput Biol. 2009, 16:1615-34.
- [44]Jukes TH, Cantor CR. Evolution of Protein Molecules: Academy Press, NY; 1969.
- [45]Robin S, Rodolphe F, Schbath S: DNA, Words and Models: Statistics of Exceptional Words. Cambridge University Press, Cambridge; 2005.
- [46]Haubold B, Pfaffelhuber P, Domazet-Loso M, Wiehe T: Estimating mutation distances from unaligned genomes. J Comput Biol. 2009, 16:1487-500.
- [47]Leimeister C-A, Morgenstern B: kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics. 2014, 30:2000-8.
- [48]Qi J, Luo H, Hao B: CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res. 2004, 32(suppl 2):45-7.
- [49]Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics. 1989, 5:164-6.
- [50]Bonnet E, de Peer YV: zt: A sofware tool for simple and partial mantel tests. J Stat Software. 2002, 7:1-12.
- [51]Didier G, Laprevotte I, Pupin M, Hénaut A: Local decoding of sequences and alignment-free comparison. J Comput Biol. 2006, 13:1465-76.
- [52]Robinson D, Foulds L: Comparison of phylogenetic trees. Mathematical Biosciences. 1981, 53:131-47.
- [53]Zhou Z, Li X, Liu B, Beutin L, Xu J, Ren Y, et al.: Derivation of Escherichia coli O157:H7 from Its O55:H7 Precursor. PLOS One. 2010, 5:8700.
- [54]Newton RJ, Griffin LE, Bowles KM, Meile C, Gifford S, Givens CE, et al.: Genome characteristics of a generalist marine bacterial lineage. ISME J. 2010, 4:784-98.