BMC Bioinformatics | |
FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets | |
Cuong Cao Dang1  Vinh Sy Le1  Olivier Gascuel4  Bart Hazes3  Quang Si Le2  | |
[1] VNU University of Engineering and Technology, Hanoi, Vietnam | |
[2] The Wellcome Trust Center for Human Genetics, Oxford University, Oxford, UK | |
[3] Department of Medical Microbiology & Immunology, University of Alberta, Alberta, Canada | |
[4] Institut de Biologie Computationnelle, LIRMM, CNRS – Université Montpellier 2, Montpellier, France | |
关键词: Large data sets; Protein alignments; Phylogenetic trees; Maximum likelihood methods; Amino acid replacement rate matrices; | |
Others : 1085423 DOI : 10.1186/1471-2105-15-341 |
|
received in 2014-05-13, accepted in 2014-09-29, 发布年份 2014 | |
【 摘 要 】
Background
Amino acid replacement rate matrices are a crucial component of many protein analysis systems such as sequence similarity search, sequence alignment, and phylogenetic inference. Ideally, the rate matrix reflects the mutational behavior of the actual data under study; however, estimating amino acid replacement rate matrices requires large protein alignments and is computationally expensive and complex. As a compromise, sub-optimal pre-calculated generic matrices are typically used for protein-based phylogeny. Sequence availability has now grown to a point where problem-specific rate matrices can often be calculated if the computational cost can be controlled.
Results
The most time consuming step in estimating rate matrices by maximum likelihood is building maximum likelihood phylogenetic trees from protein alignments. We propose a new procedure, called FastMG, to overcome this obstacle. The key innovation is the alignment-splitting algorithm that splits alignments with many sequences into non-overlapping sub-alignments prior to estimating amino acid replacement rates. Experiments with different large data sets showed that the FastMG procedure was an order of magnitude faster than without splitting. Importantly, there was no apparent loss in matrix quality if an appropriate splitting procedure is used.
Conclusions
FastMG is a simple, fast and accurate procedure to estimate amino acid replacement rate matrices from large data sets. It enables researchers to study the evolutionary relationships for specific groups of proteins or taxa with optimized, data-specific amino acid replacement rate matrices. The programs, data sets, and the new mammalian mitochondrial protein rate matrix are available at http://fastmg.codeplex.com webcite.
【 授权许可】
2014 Dang et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150113173203956.pdf | 921KB | download | |
Figure 1. | 43KB | Image | download |
【 图 表 】
Figure 1.
【 参考文献 】
- [1]Felsenstein J: Inferring Phylogenies. Sunderland, MA, USA: Sinauer Associates; 2004.
- [2]Yang Z: Computational Molecular Evolution. Oxford, UK: Oxford University Press; 2006.
- [3]Thorne JL: Models of protein sequence evolution and their applications. Curr Opin Genet Dev 2000, 10:602-605.
- [4]Dayhoff M, Schwartz R, Orcutt B: A model of evolutionary change in proteins. Atlas Protein Seq Struct 1978, 5:345-351.
- [5]Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci CABIOS 1992, 8:275-282.
- [6]Adachi J, Hasegawa M: Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 1996, 42:459-468.
- [7]Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 2001, 18:691-699.
- [8]Le QS, Gascuel O: An improved general amino acid replacement matrix. Mol Biol Evol 2008, 25:1307-1320.
- [9]Le DV, Dang CC, Le QS, Le VS: A Fast and Efficient Method for Estimating Amino Acid Substitution Models. In Proceedings of The Third International Conference on Knowledge and Systems Engineering. Edited by Ho TB, McKay RI, Nguyen XH, Bui TD. New York, NY, USA: IEEE Publishing; 2011:85-91.
- [10]Dang CC, Lefort V, Le VS, Le QS, Gascuel O: ReplacementMatrix: a web server for maximum-likelihood estimation of amino acid replacement rate matrices. Bioinformatics 2011, 27:2758-2760.
- [11]Dang CC, Le QS, Gascuel O, Le VS: FLU, an amino acid substitution model for influenza proteins. BMC Evol Biol 2010, 10:99-110. BioMed Central Full Text
- [12]Chor B, Tuller T: Maximum likelihood of evolutionary trees: hardness and approximation. Bioinformatics 2005, 21:97-106.
- [13]Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 2003, 52:696-704.
- [14]Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O: New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 2010, 59:307-321.
- [15]Le VS, von Haeseler A: IQPNNI: moving fast through tree space and stopping in time. Mol Biol Evol 2004, 21:1565-1571.
- [16]Stamatakis A, Ludwig T, Meier H: RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 2005, 21:456-463.
- [17]Schneider R, de Daruvar A, Sander C: The HSSP database of protein structure-sequence alignments. Nucleic Acids Res 1997, 25:226-230.
- [18]Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer ELL: The Pfam protein families database. Nucleic Acids Res 2002, 30:276-280.
- [19]Klosterman PS, Uzilov AV, Bendaña YR, Bradley RK, Chao S, Kosiol C, Goldman N, Holmes I: XRate: a fast prototyping, training and annotation tool for phylo-grammars. BMC Bioinformatics 2006, 7:428-453. BioMed Central Full Text
- [20]Kishino H, Hasegawa M: Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J Mol Evol 1989, 29:170-179.
- [21]Kohavi R: A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conferences on Artificial Intelligence. Montreal: Morgan Kaufmann Publishers Inc; 1995:1137-1143.
- [22]Yang Z, Nielsen R, Hasegawa M: Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol 1998, 15:1600-1611.
- [23]Blackshields G, Larkin M, Wallace IM, Wilm A, Higgins DG: Fast embedding methods for clustering tens of thousands of sequences. Comput Biol Chem 2008, 32(4):282-286.
- [24]Price MN, Dehal PS, Arkin AP: FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS ONE 2010, 5:e9490.
- [25]Dereeper A, Guignon V, Blanc G, Audic S, Buffet S, Chevenet F, Dufayard J-F, Guindon S, Lefort V, Lescot M, Claverie J-M, Gascuel O: Phylogeny.fr: robust phylogenetic analysis for the non-specialist. Nucleic Acids Res 2008, 36(suppl 2):W465-W469.
- [26]Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4:406-425.
- [27]Gascuel O: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol 1997, 14:685-695.