期刊论文详细信息
BMC Bioinformatics
The tree alignment problem
Andrés Varón1  Ward C Wheeler1 
[1] Division of Invertebrate Zoology, American Museum of Natural History, New York, NY - 10024, USA
关键词: Direct optimization;    Sequence alignment;    Phylogeny;    Tree search;    Tree alignment;   
Others  :  1088077
DOI  :  10.1186/1471-2105-13-293
 received in 2012-05-17, accepted in 2012-10-22,  发布年份 2012
PDF
【 摘 要 】

Background

The inference of homologies among DNA sequences, that is, positions in multiple genomes that share a common evolutionary origin, is a crucial, yet difficult task facing biologists. Its computational counterpart is known as the multiple sequence alignment problem. There are various criteria and methods available to perform multiple sequence alignments, and among these, the minimization of the overall cost of the alignment on a phylogenetic tree is known in combinatorial optimization as the Tree Alignment Problem. This problem typically occurs as a subproblem of the Generalized Tree Alignment Problem, which looks for the tree with the lowest alignment cost among all possible trees. This is equivalent to the Maximum Parsimony problem when the input sequences are not aligned, that is, when phylogeny and alignments are simultaneously inferred.

Results

For large data sets, a popular heuristic is Direct Optimization (DO). DO provides a good tradeoff between speed, scalability, and competitive scores, and is implemented in the computer program POY. All other (competitive) algorithms have greater time complexities compared to DO. Here, we introduce and present experiments a new algorithm Affine-DO to accommodate the indel (alignment gap) models commonly used in phylogenetic analysis of molecular sequence data. Affine-DO has the same time complexity as DO, but is correctly suited for the affine gap edit distance. We demonstrate its performance with more than 330,000 experimental tests. These experiments show that the solutions of Affine-DO are close to the lower bound inferred from a linear programming solution. Moreover, iterating over a solution produced using Affine-DO shows little improvement.

Conclusions

Our results show that Affine-DO is likely producing near-optimal solutions, with approximations within 10% for sequences with small divergence, and within 30% for random sequences, for which Affine-DO produced the worst solutions. The Affine-DO algorithm has the necessary scalability and optimality to be a significant improvement in the real-world phylogenetic analysis of sequence data.

【 授权许可】

   
2012 Varón and Wheeler; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150117072749365.pdf 974KB PDF download
Figure 9. 106KB Image download
Figure 8. 29KB Image download
Figure 7. 110KB Image download
Figure 6. 82KB Image download
Figure 5. 11KB Image download
Figure 4. 38KB Image download
Figure 3. 26KB Image download
Figure 2. 12KB Image download
Figure 1. 40KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

【 参考文献 】
  • [1]Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions–specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22:4673-4680.
  • [2]Morgenstern B: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 1999, 15:211-218.
  • [3]Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucl Acids Res 2002, 30:3059-3066.
  • [4]Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5:113. [http://www.biomedcentral.com/1471-2105/6/113 webcite] BioMed Central Full Text
  • [5]Fleissner R, Metzler D, von Haeseler A: Simultaneous Statistical Multiple Alignment and Phylogeny Reconstruction. Syst Biol 2005, 54(4):548-561.
  • [6]Redelings BD, Suchard MA: Joint Bayesian Estimation of Alignment and Phylogeny. Syst Biol 2005, 54:401-418.
  • [7]Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res 2005, 15:330-340.
  • [8]Wheeler WC: Dynamic Homology and the Likelihood Criterion. Cladistics 2006, 22:157-170.
  • [9]Nelesen S, Liu K, Zhao D, Linder CR, Warnow T: The effect of the guide tree on multiple sequence alignments and subsequenct phylogenetic analyses. Pac Symp Biocomputing 2008, 13:25-36.
  • [10]Sankoff D: Minimal Mutation Trees of Sequences. SIAM J Appl Mathematics 1975, 28:35-42.
  • [11]Sankoff D, Cedergren RJ, Lapalme G: Frequency of Insertion-Deletion, Transversion, and Transition in the Evolution of 5S Ribosomal RNA. J Mol Evol 1976, 7:133-149.
  • [12]Sankoff D, Cedergren RJ: Simultaneous Comparison of Three or more Sequences Related by a Tree. Addison-Wesley: Reading, MA; 1983:. 253–263
  • [13]Hein J: A New Method That Simultaneously Aligns and Reconstructs Ancestral Sequences for Any Number of Homologous Sequences, When The Phylogeny is Given. Mol Biol Evol 1989, 6(6):649-668.
  • [14]Hein J: Unified approach to alignment and phylogenies. Methods in Enzymology 1990, 183:626-645.
  • [15]Wheeler WC: Optimization Alignment: The End of Multiple Sequence Alignment in Phylogenetics? Cladistics 1996, 12:1-9.
  • [16]Cartwright RA: Logarithmic gap costs decrease alignment accuracy. BMC Bioinformatics 2006, 7:527-539. BioMed Central Full Text
  • [17]Liu K, Nelesen S, Raghavan S, Linder CR, Warnow T: Barking up the wrong treelength: the impact of gap penalty on alignment and tree accuracy. IEEE Trans Comput Biol Bioinf 2009, 6:7-21.
  • [18]Waterman MS, Smith TF, Beyer WA: Some biological sequence metrics. Advances in Mathematics 1976, 20(3):367-387. [http://www.sciencedirect.com/science/article/B6W9F-4CRY72S-1TG/1/ad09f046408307294171dca4c664d801 webcite]
  • [19]Benner SA, Cohen MA: Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J Mol Evol 1993, 229:1065-1082.
  • [20]Gu X, Li WH: The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J Mol Evol 1995, 40(4):464-473. [http://dx.doi.org/10.1007/BF00164032 webcite]
  • [21]Zhang Z, Gerstein M: Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucl Acids Res 2003, 31(18):5338-5348. [http://nar.oxfordjournals.org/cgi/content/abstract/31/18/5338 webcite]
  • [22]Chang MSS, Benner SA: Empirical Analysis of Protein Insertions and Deletions Determining Parameters for the Correct Placement of Gaps in Protein Sequence Alignments. J Mol Biol 2004, 341(2):617-631. [http://www.sciencedirect.com/science/article/B6WK7-4CMHDHJ-6/2/9cbe746387e0610d53e294114342f02c webcite]
  • [23]Wheeler WC, Gladstein D De Laet: POY, Phylogeny Reconstruction via Optimization of DNA and other Data version 3.0.11 (May 6 of 2003). American Museum of Natural History; 2003. [ftp://ftp.amnh.org webcite]
  • [24]Varón A, Vinh LS, Wheeler WC: POY version 4: phylogenetic analysis using dynamic homologies. Cladistics 2009, 26:72-85.
  • [25]Lancia G, Ravi R: GESTALT: Genomic steiner alignments. Lecture Notes in Computer Science 1999, 1645:101.
  • [26]Lancia G, Ravi R: SALSA: Sequence alignment via Steiner Ancestors. 2008. [http://citeseer.ist.psu.edu/356333.html webcite]
  • [27]Schwikowski B, Vingron M: Weighted sequence graphs: boosting iterated dynamic programming using locally suboptimal solutions. Discrete Appl Math 2003, 127:95-117.
  • [28]Ogden TH, Rosenberg MS: Alignment and Topological Accuracy of the Direct Optimization approach via POY and Traditional Phylogenetics via ClustalW + PAUP*. Syst Biol 2007, 56(2):182-193.
  • [29]Lehtonen S: Phylogeny Estimation and Alignment via POY versus Clustal + PAUP*: A Response to Ogden and Rosenberg (2007). Syst Biol 2008, 57(4):653-657.
  • [30]Wheeler WC: Sequence Alignment, edited by M. S. Rosenberg. Berkeley, CA, USA: University of California Press; 2009. chap. Simulation Approaches to Evaluating Alignment Error and Methods for Comparing Alternate Alignments: 179–208
  • [31]Wang L, Jiang T: On the Complexity of Multiple Sequence Alignment. J Comput Biol 1994, 1:337-348.
  • [32]Yue F, Shi J, Tang J: Simultaneous phylogeny reconstruction and multiple sequence alignment. BMC Bioinf 2009, 10(Suppl 1):S11. BioMed Central Full Text
  • [33]Schwikowski B, Vingron M: The deferred path heuristic for the generalized tree alignment problem. In RECOMB ’97: Proceedings of the first annual international conference on Computational molecular biology. New York, NY, USA: ACM Press; 1997:257-266. [http://doi.acm.org/10.1145/267521.267884 webcite]
  • [34]Wang L, Jiang T, Lawler EL: Approximation Algorithms for Tree Alignment with a Given Phylogeny. Algorithmica 1996, 16:302-315.
  • [35]Wang L, Gusfield D: Impoved Approximation Algorithms for Tree Alignment. J Algorithms 1997, 25(2):255-273.
  • [36]Ravi R, Kececioglu JD: Approximation algorithms for multiple sequence alignment under a fixed evolutionary tree. Discret Appl Math 1998, 88:355-366.
  • [37]Wang L, Jiang T, Gusfield D: A More Efficient Approximation Scheme for Tree Alignment. SIAM J Comput 2000, 30:283-299.
  • [38]Wheeler WC, Aagesen L, Arango CP, Faivovich J, Grant T, D’Haese C, Janies D, Smith WL, Varón A, Giribet G: Dynamic Homology and Phylogenetic Systematics: A Unified Approach using POY.. American Museum of Natural History; 2006 pp. 365.
  • [39]Needleman SB, Wunsch CD: A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. J Mol Biol 1970, 48:443-453.
  • [40]Gotoh O: An improved algorithm for matching biological sequences. J Mol Biol 1982, 162:705-708.
  • [41]Ukkonen E: Algorithms for approximate string matching. Inf Control 1985, 64(1-3):100-118.
  • [42]Cartwright R A: DNA Assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics 2005, 21(Suppl. 3):iii31-iii38.
  • [43]Wheeler WC: Fixed Character States and the Optimization of Molecular Sequence Data. Cladistics 1999, 15:379-385.
  • [44]Powell DR, Allison L, Dix TI: Fast optimal alignment of three sequences using linear gap costs. J Theor Biol 2000, 207:325-336.
  • [45]Yue F, Tang J: A divide-and-conquer implementation of three sequence alignment and ancestor inference with affine gap costs. The IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007),143-150.
  • [46]Varón A, Wheeler WC: Application note: on extension gap in POY version 3. Cladistics 2008, 24(6):1070-1070.
  文献评价指标  
  下载次数:136次 浏览次数:19次