期刊论文

【摘要】

Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator d_N of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of ‘match positions’ and ‘don’t care positions’. Our software is available online and as downloadable source code at: http://spaced.gobics.de/ webcite.

【授权许可】

2015 Morgenstern et al.; licensee BioMed Central.

【预览】

附件列表
Files	Size	Format	View
20150219010705258.pdf	1733KB	PDF	download
Figure 6.	18KB	Image	download
Figure 5.	46KB	Image	download
Figure 4.	41KB	Image	download
Figure 3.	28KB	Image	download
Figure 2.	25KB	Image	download
Figure 1.	74KB	Image	download

【图表】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

【参考文献】

[1]Vinga S: Editorial: Alignment-free methods in computational biology. Briefings Bioinf. 2014, 15:341-2.
[2]Leslie C, Eskin E, Noble WSS: The spectrum kernel: a string kernel for SVM protein classification. In Pacific Symposium on Biocomputing. World Scientific Publishing, Singapore; 2002.
[3]Lingner T, Meinicke P: Remote homology detection based on oligomer distances. Bioinformatics. 2006, 22:2224-31.
[4]Lingner T, Meinicke P: Word correlation matrices for protein sequence analysis and remote homology detection. BMC Bioinf. 2008, 9:259. BioMed Central Full Text
[5]Comin M, Verzotto D: The irredundant class method for remote homology detection of protein sequences. J Comput Biol. 2011, 18:1819-29.
[6]Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics. 2008, 24:713-4.
[7]Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol. 2009, 10:25. BioMed Central Full Text
[8]Ahmadi A, Behm A, Honnalli N, Li C, Weng L, Xie X: Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res. 2011, 40:1.
[9]Patro R, Mount SM, Kingsford C: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol. 2014, 32:462-4.
[10]Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18:821-9.
[11]Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner F: Tetra: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in dna sequences. BMC Bioinf. 2004, 5:163. BioMed Central Full Text
[12]Chatterji S, Yamazaki I, Bai Z, Eisen JA: Compostbin: A DNA composition-based algorithm for binning environmental shotgun reads. In Research in Computational Molecular Biology, 12th Annual International Conference, RECOMB 2008, Singapore, March 30 - April 2, 2008. Proceedings. Springer, Berlin, Heidelberg; 2008.
[13]Wu Y-W, Ye Y: A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J Comput Biol. 2011, 18:523-34.
[14]Tanaseichuk O, Borneman J, Jiang T: Separating metagenomic short reads into genomes via clustering. Algorithms Mol Biol. 2012, 7:27. BioMed Central Full Text
[15]Leung HCM, Yiu SM, Yang B, Peng Y, Wang Y, Liu Z, et al.: A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics. 2011, 27:1489-95.
[16]Wang Y, Leung HCM, Yiu SM, Chin FYL: Metacluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics. 2012, 28:356-62.
[17]Meinicke P, Tech M, Morgenstern B, Merkl R: Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites. BMC Bioinf. 2004, 5:169. BioMed Central Full Text
[18]Kantorovitz M, Robinson G, Sinha S: A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics. 2007, 23:249-55.
[19]Leung G, Eisen MB: Identifying cis-regulatory sequences by word profile similarity. PloS one. 2009, 4(9):6901.
[20]Federico M, Leoncini M, Montangero M, Valente P: Direct vs 2-stage approaches to structured motif finding. Algorithms Mol Biol. 2012, 7:20. BioMed Central Full Text
[21]Blaisdell BE: A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Nat Acad Sci USA. 1986, 83:5155-9.
[22]Lin J: Divergence measures based on the shannon entropy. IEEE Trans Inf theory. 1991, 37:145-51.
[23]Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002, 18:440-5.
[24]Boden M, Schöneich M, Horwege S, Lindner S, Leimeister C-A, Morgenstern B: German Conference on Bioinformatics 2013. [http://drops.dagstuhl.de/opus/volltexte/2013/4233] webciteIn OpenAccess Series in Informatics (OASIcs) Edited by Beißbarth T, Kollmar M, Leha A, Morgenstern B, Schultz A-K, Waack S, Wingender E. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany; 2013. http://drops.dagstuhl.de/opus/volltexte/2013/4233
[25]Leimeister C-A, Boden M, Horwege S, Lindner S, Morgenstern B: Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics. 2014, 30:1991-9.
[26]Horwege S, Lindner S, Boden M, Hatje K, Kollmar M, et al.: Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res. 2014, 42:W7-W11.
[27]Onodera T, Shibuya T: The gapped spectrum kernel for support vector machines. In Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in Computer Science. Edited by Perner P. Springer, Berlin,Heidelberg; 2013.
[28]Ghandi M, Mohammad-Noori M, Beer MA: Robust k-mer frequency estimation using gapped k-mers. J Math Biol. 2014, 69:469-500.
[29]Ghandi M, Lee D, Mohammad-Noori M, Beer MA: Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol. 2014, 10(7):1003711.
[30]Ulitsky I, Burstein D, Tuller T, Chor B: The average common substring approach to phylogenomic reconstruction. J Comput Biol. 2006, 13:336-50.
[31]Didier G, Debomy L, Pupin M, Zhang M, Grossmann A, Devauchelle C, et al.: Comparing sequences without using alignments: application to HIV/SIV subtyping. BMC Bioinf. 2007, 8:1. BioMed Central Full Text
[32]Sims GE, Jun S-R, Wu GA, Kim S-H: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Nat Acad Sci. 2009, 106:2677-82.
[33]Domazet-Loso M, Haubold B: Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics. 2011, 27(11):1466-72.
[34]Haubold B, Reed FA, Pfaffelhuber P: Alignment-free estimation of nucleotide diversity. Bioinformatics. 2011, 27:449-55.
[35]Comin M, Verzotto D: Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms Mol Biol. 2012, 7:34. BioMed Central Full Text
[36]Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4:406-25.
[37]Haubold B, Pierstorff N, Möller F, Wiehe T: Genome comparison without alignment using shortest unique substrings. BMC Bioinf. 2005, 6:123. BioMed Central Full Text
[38]Yi H, Jin L: Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res. 2013, 41:75.
[39]Haubold B, Klötzl F, Pfaffelhuber P. andi: Fast and accurate estimation of evolutionary distances between closely related genomes. Bioinformatics.doi:10.1093/bioinformatics/btu815.
[40]Noé L, Martin DEK: A coverage criterion for spaced seeds and its applications to SVM string-kernels and k-mer distances. J Comput Biol. 2014, 12:947-63.
[41]Morgenstern B, Zhu B, Horwege S, Leimeister C: Estimating evolutionary distances from spaced-word matches. In Proc. Workshop on Algorithms in Bioinformatics (WABI’14). Lecture Notes in Bioinformatics. Springer, Berlin Heidelberg.; 2014.
[42]Lippert RA, Huang H, Waterman MS: Distributional regimes for the number of k-word matches between two random sequences. Proc Nat Acad Sci. 2002, 99:13980-9.
[43]Reinert G, Chew D, Sun F, Waterman MS: Alignment-free sequence comparison (i): Statistics and power. J Comput Biol. 2009, 16:1615-34.
[44]Jukes TH, Cantor CR. Evolution of Protein Molecules: Academy Press, NY; 1969.
[45]Robin S, Rodolphe F, Schbath S: DNA, Words and Models: Statistics of Exceptional Words. Cambridge University Press, Cambridge; 2005.
[46]Haubold B, Pfaffelhuber P, Domazet-Loso M, Wiehe T: Estimating mutation distances from unaligned genomes. J Comput Biol. 2009, 16:1487-500.
[47]Leimeister C-A, Morgenstern B: kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics. 2014, 30:2000-8.
[48]Qi J, Luo H, Hao B: CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res. 2004, 32(suppl 2):45-7.
[49]Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics. 1989, 5:164-6.
[50]Bonnet E, de Peer YV: zt: A sofware tool for simple and partial mantel tests. J Stat Software. 2002, 7:1-12.
[51]Didier G, Laprevotte I, Pupin M, Hénaut A: Local decoding of sequences and alignment-free comparison. J Comput Biol. 2006, 13:1465-76.
[52]Robinson D, Foulds L: Comparison of phylogenetic trees. Mathematical Biosciences. 1981, 53:131-47.
[53]Zhou Z, Li X, Liu B, Beutin L, Xu J, Ren Y, et al.: Derivation of Escherichia coli O157:H7 from Its O55:H7 Precursor. PLOS One. 2010, 5:8700.
[54]Newton RJ, Griffin LE, Bowles KM, Meile C, Gifford S, Givens CE, et al.: Genome characteristics of a generalist marine bacterial lineage. ISME J. 2010, 4:784-98.

Algorithms for Molecular Biology
Estimating evolutionary distances between genomic sequences from spaced-word matches

Burkhard Morgenstern³ Bingyao Zhu¹ Sebastian Horwege² Chris André Leimeister²
[1] University of Göttingen, Department of General Microbiology, Grisebachstr. 8, Göttingen 37073, Germany
[2] University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, Göttingen 37073, Germany
[3] Université d’Evry Val d’Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA 23 Boulevard de France, Evry 91037, France
关键词: Genome comparison; Variance; Distance estimation; Word frequency; Phylogeny; Alignment-free; Spaced words; k-mers;
Others : 1126855 DOI : 10.1186/s13015-015-0032-x

received in 2014-11-19, accepted in 2015-01-06, 发布年份 2015
PDF


	文献评价指标
	下载次数：152次	浏览次数：18次

【 摘 要 】

【 授权许可】

【 预 览 】

【 图 表 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【图表】

【参考文献】