Algorithms for Molecular Biology | |
Assessing the efficiency of multiple sequence alignment programs | |
Fabiano Sviatopolk-Mirsky Pais3  Patrícia de Cássia Ruy1  Guilherme Oliveira3  Roney Santos Coimbra2  | |
[1] Center for Excellence in Bioinformatics, Centro de Pesquisas René Rachou (CPqRR), Fundação Oswaldo Cruz (FIOCRUZ/Minas), Belo Horizonte, MG Brazil | |
[2] Biosystems Informatics, CPqRR, FIOCRUZ/Minas, Avenida Augusto de Lima 1715. Barro Preto, Belo Horizonte, MG Brazil | |
[3] Genomics and Computational Biology Group, CPqRR, Instituto Nacional de Ciência e Tecnologia, FIOCRUZ/Minas, Belo Horizonte, MG Brazil | |
关键词: Performance; Accuracy; Computer programs; Multiple sequence alignment; | |
Others : 793027 DOI : 10.1186/1748-7188-9-4 |
|
received in 2013-07-08, accepted in 2014-02-22, 发布年份 2014 | |
【 摘 要 】
Background
Multiple sequence alignment (MSA) is an extremely useful tool for molecular and evolutionary biology and there are several programs and algorithms available for this purpose. Although previous studies have compared the alignment accuracy of different MSA programs, their computational time and memory usage have not been systematically evaluated. Given the unprecedented amount of data produced by next generation deep sequencing platforms, and increasing demand for large-scale data analysis, it is imperative to optimize the application of software. Therefore, a balance between alignment accuracy and computational cost has become a critical indicator of the most suitable MSA program. We compared both accuracy and cost of nine popular MSA programs, namely CLUSTALW, CLUSTAL OMEGA, DIALIGN-TX, MAFFT, MUSCLE, POA, Probalign, Probcons and T-Coffee, against the benchmark alignment dataset BAliBASE and discuss the relevance of some implementations embedded in each program’s algorithm. Accuracy of alignment was calculated with the two standard scoring functions provided by BAliBASE, the sum-of-pairs and total-column scores, and computational costs were determined by collecting peak memory usage and time of execution.
Results
Our results indicate that mostly the consistency-based programs Probcons, T-Coffee, Probalign and MAFFT outperformed the other programs in accuracy. Whenever sequences with large N/C terminal extensions were present in the BAliBASE suite, Probalign, MAFFT and also CLUSTAL OMEGA outperformed Probcons and T-Coffee. The drawback of these programs is that they are more memory-greedy and slower than POA, CLUSTALW, DIALIGN-TX, and MUSCLE. CLUSTALW and MUSCLE were the fastest programs, being CLUSTALW the least RAM memory demanding program.
Conclusions
Based on the results presented herein, all four programs Probcons, T-Coffee, Probalign and MAFFT are well recommended for better accuracy of multiple sequence alignments. T-Coffee and recent versions of MAFFT can deliver faster and reliable alignments, which are specially suited for larger datasets than those encountered in the BAliBASE suite, if multi-core computers are available. In fact, parallelization of alignments for multi-core computers should probably be addressed by more programs in a near future, which will certainly improve performance significantly.
【 授权许可】
2014 Pais et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20140705042623249.pdf | 504KB | download |
【 参考文献 】
- [1]Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48(3):443-453.
- [2]Smith TF, Waterman MS, Fitch WM: Comparative biosequence metrics. J Mol Evol 1981, 18(1):38-46.
- [3]Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 1987, 25(4):351-360.
- [4]Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673-4680.
- [5]Subramanian AR, Kaufmann M, Morgenstern B: DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms Mol Biol 2008, 3:6. BioMed Central Full Text
- [6]Notredame C, Higgins DG, Heringa J: T-coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000, 302(1):205-217.
- [7]Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res 2005, 15(2):330-340.
- [8]Roshan U, Livesay DR: Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics 2006, 22(22):2715-2721.
- [9]Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG: Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Mol Syst Biol 2011, 7:539.
- [10]Lee C, Grasso C, Sharlow MF: Multiple sequence alignment using partial order graphs. Bioinformatics 2002, 18(3):452-464.
- [11]Gotoh O: Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 1996, 264(4):823-838.
- [12]Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinforma 2004, 5:113. BioMed Central Full Text
- [13]Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002, 30(14):3059-3066.
- [14]Hirosawa M, Totoki Y, Hoshida M, Ishikawa M: Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 1995, 11(1):13-18.
- [15]Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 2005, 33(2):511-518.
- [16]Thompson JD, Koehl P, Ripp R, Poch O: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 2005, 61(1):127-136.
- [17]Bahr A, Thompson JD, Thierry JC, Poch O: BAliBASE (benchmark alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res 2001, 29(1):323-326.
- [18]Perrodou E, Chica C, Poch O, Gibson TJ, Thompson JD: A new protein linear motif benchmark for multiple sequence alignment software. BMC Bioinforma 2008, 9:213. BioMed Central Full Text
- [19]Lassmann T, Sonnhammer EL: Quality assessment of multiple alignment programs. FEBS Lett 2002, 529(1):126-130.
- [20]Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 1999, 27(13):2682-2690.
- [21]Blackshields G, Wallace IM, Larkin M, Higgins DG: Analysis and comparison of benchmarks for multiple sequence alignment. In Silico Biol 2006, 6(4):321-339.
- [22]Nuin PA, Wang Z, Tillier ER: The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinforma 2006, 7:471. BioMed Central Full Text
- [23]Myers EW, Miller W: Optimal alignments in linear space. Comput Appl Biosci 1988, 4(1):11-17.
- [24]Edgar RC: Optimizing substitution matrix choice and gap parameters for sequence alignment. BMC Bioinforma 2009, 10:396. BioMed Central Full Text
- [25]Katoh K, Toh H: Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform 2008, 9(4):286-298.
- [26]Katoh K, Toh H: Parallelization of the MAFFT multiple sequence alignment program. Bioinformatics 2010, 26(15):1899-1900.
- [27]Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG: Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol Biol 2010, 5:21. BioMed Central Full Text