期刊论文

【摘要】

Background

Amino acid replacement rate matrices are a crucial component of many protein analysis systems such as sequence similarity search, sequence alignment, and phylogenetic inference. Ideally, the rate matrix reflects the mutational behavior of the actual data under study; however, estimating amino acid replacement rate matrices requires large protein alignments and is computationally expensive and complex. As a compromise, sub-optimal pre-calculated generic matrices are typically used for protein-based phylogeny. Sequence availability has now grown to a point where problem-specific rate matrices can often be calculated if the computational cost can be controlled.

Results

The most time consuming step in estimating rate matrices by maximum likelihood is building maximum likelihood phylogenetic trees from protein alignments. We propose a new procedure, called FastMG, to overcome this obstacle. The key innovation is the alignment-splitting algorithm that splits alignments with many sequences into non-overlapping sub-alignments prior to estimating amino acid replacement rates. Experiments with different large data sets showed that the FastMG procedure was an order of magnitude faster than without splitting. Importantly, there was no apparent loss in matrix quality if an appropriate splitting procedure is used.

Conclusions

FastMG is a simple, fast and accurate procedure to estimate amino acid replacement rate matrices from large data sets. It enables researchers to study the evolutionary relationships for specific groups of proteins or taxa with optimized, data-specific amino acid replacement rate matrices. The programs, data sets, and the new mammalian mitochondrial protein rate matrix are available at http://fastmg.codeplex.com webcite.

【授权许可】

2014 Dang et al.; licensee BioMed Central Ltd.

【预览】

附件列表
Files	Size	Format	View
20150113173203956.pdf	921KB	PDF	download
Figure 1.	43KB	Image	download

【图表】

Figure 1.

【参考文献】

[1]Felsenstein J: Inferring Phylogenies. Sunderland, MA, USA: Sinauer Associates; 2004.
[2]Yang Z: Computational Molecular Evolution. Oxford, UK: Oxford University Press; 2006.
[3]Thorne JL: Models of protein sequence evolution and their applications. Curr Opin Genet Dev 2000, 10:602-605.
[4]Dayhoff M, Schwartz R, Orcutt B: A model of evolutionary change in proteins. Atlas Protein Seq Struct 1978, 5:345-351.
[5]Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci CABIOS 1992, 8:275-282.
[6]Adachi J, Hasegawa M: Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 1996, 42:459-468.
[7]Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 2001, 18:691-699.
[8]Le QS, Gascuel O: An improved general amino acid replacement matrix. Mol Biol Evol 2008, 25:1307-1320.
[9]Le DV, Dang CC, Le QS, Le VS: A Fast and Efficient Method for Estimating Amino Acid Substitution Models. In Proceedings of The Third International Conference on Knowledge and Systems Engineering. Edited by Ho TB, McKay RI, Nguyen XH, Bui TD. New York, NY, USA: IEEE Publishing; 2011:85-91.
[10]Dang CC, Lefort V, Le VS, Le QS, Gascuel O: ReplacementMatrix: a web server for maximum-likelihood estimation of amino acid replacement rate matrices. Bioinformatics 2011, 27:2758-2760.
[11]Dang CC, Le QS, Gascuel O, Le VS: FLU, an amino acid substitution model for influenza proteins. BMC Evol Biol 2010, 10:99-110. BioMed Central Full Text
[12]Chor B, Tuller T: Maximum likelihood of evolutionary trees: hardness and approximation. Bioinformatics 2005, 21:97-106.
[13]Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 2003, 52:696-704.
[14]Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O: New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 2010, 59:307-321.
[15]Le VS, von Haeseler A: IQPNNI: moving fast through tree space and stopping in time. Mol Biol Evol 2004, 21:1565-1571.
[16]Stamatakis A, Ludwig T, Meier H: RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 2005, 21:456-463.
[17]Schneider R, de Daruvar A, Sander C: The HSSP database of protein structure-sequence alignments. Nucleic Acids Res 1997, 25:226-230.
[18]Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer ELL: The Pfam protein families database. Nucleic Acids Res 2002, 30:276-280.
[19]Klosterman PS, Uzilov AV, Bendaña YR, Bradley RK, Chao S, Kosiol C, Goldman N, Holmes I: XRate: a fast prototyping, training and annotation tool for phylo-grammars. BMC Bioinformatics 2006, 7:428-453. BioMed Central Full Text
[20]Kishino H, Hasegawa M: Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J Mol Evol 1989, 29:170-179.
[21]Kohavi R: A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conferences on Artificial Intelligence. Montreal: Morgan Kaufmann Publishers Inc; 1995:1137-1143.
[22]Yang Z, Nielsen R, Hasegawa M: Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol 1998, 15:1600-1611.
[23]Blackshields G, Larkin M, Wallace IM, Wilm A, Higgins DG: Fast embedding methods for clustering tens of thousands of sequences. Comput Biol Chem 2008, 32(4):282-286.
[24]Price MN, Dehal PS, Arkin AP: FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS ONE 2010, 5:e9490.
[25]Dereeper A, Guignon V, Blanc G, Audic S, Buffet S, Chevenet F, Dufayard J-F, Guindon S, Lefort V, Lescot M, Claverie J-M, Gascuel O: Phylogeny.fr: robust phylogenetic analysis for the non-specialist. Nucleic Acids Res 2008, 36(suppl 2):W465-W469.
[26]Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4:406-425.
[27]Gascuel O: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol 1997, 14:685-695.

BMC Bioinformatics
FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets

Cuong Cao Dang¹ Vinh Sy Le¹ Olivier Gascuel⁴ Bart Hazes³ Quang Si Le²
[1] VNU University of Engineering and Technology, Hanoi, Vietnam
[2] The Wellcome Trust Center for Human Genetics, Oxford University, Oxford, UK
[3] Department of Medical Microbiology & Immunology, University of Alberta, Alberta, Canada
[4] Institut de Biologie Computationnelle, LIRMM, CNRS – Université Montpellier 2, Montpellier, France
关键词: Large data sets; Protein alignments; Phylogenetic trees; Maximum likelihood methods; Amino acid replacement rate matrices;
Others : 1085423 DOI : 10.1186/1471-2105-15-341

received in 2014-05-13, accepted in 2014-09-29, 发布年份 2014
PDF


	文献评价指标
	下载次数：36次	浏览次数：22次

【 摘 要 】

Background

Results

Conclusions

【 授权许可】

【 预 览 】

【 图 表 】

【 参考文献 】

【摘要】

【授权许可】

【预览】

【图表】

【参考文献】