期刊论文详细信息
BMC Bioinformatics
OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy
Geoffrey J Barton1  Jonathan D Barber2  Patrick C Audley2  Stephen MJ Searle3  GPS Raghava4 
[1]University of Oxford, Laboratory of Molecular Biophysics, Rex Richards Building, South Parks Road, Oxford, OX1 3QU, UK
[2]School of Life Sciences, University of Dundee, Dow St., Dundee, DD1 5EH, Scotland, UK
[3]Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
[4]Bioinformatics Centre, Institute of Microbial Technology, Sector 39A, Chandigarh, India
关键词: structural alignment;    benchmark;    multiple sequence alignment;    protein;   
Others  :  1171853
DOI  :  10.1186/1471-2105-4-47
 received in 2003-05-06, accepted in 2003-10-10,  发布年份 2003
PDF
【 摘 要 】

Background

The alignment of two or more protein sequences provides a powerful guide in the prediction of the protein structure and in identifying key functional residues, however, the utility of any prediction is completely dependent on the accuracy of the alignment. In this paper we describe a suite of reference alignments derived from the comparison of protein three-dimensional structures together with evaluation measures and software that allow automatically generated alignments to be benchmarked. We test the OXBench benchmark suite on alignments generated by the AMPS multiple alignment method, then apply the suite to compare eight different multiple alignment algorithms. The benchmark shows the current state-of-the art for alignment accuracy and provides a baseline against which new alignment algorithms may be judged.

Results

The simple hierarchical multiple alignment algorithm, AMPS, performed as well as or better than more modern methods such as CLUSTALW once the PAM250 pair-score matrix was replaced by a BLOSUM series matrix. AMPS gave an accuracy in Structurally Conserved Regions (SCRs) of 89.9% over a set of 672 alignments. The T-COFFEE method on a data set of families with <8 sequences gave 91.4% accuracy, significantly better than CLUSTALW (88.9%) and all other methods considered here. The complete suite is available from http://www.compbio.dundee.ac.uk webcite.

Conclusions

The OXBench suite of reference alignments, evaluation software and results database provide a convenient method to assess progress in sequence alignment techniques. Evaluation measures that were dependent on comparison to a reference alignment were found to give good discrimination between methods. The STAMP Sc Score which is independent of a reference alignment also gave good discrimination. Application of OXBench in this paper shows that with the exception of T-COFFEE, the majority of the improvement in alignment accuracy seen since 1985 stems from improved pair-score matrices rather than algorithmic refinements. The maximum theoretical alignment accuracy obtained by pooling results over all methods was 94.5% with 52.5% accuracy for alignments in the 0–10 percentage identity range. This suggests that further improvements in accuracy will be possible in the future.

【 授权许可】

   
2003 Raghava et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.

【 预 览 】
附件列表
Files Size Format View
20150420021420242.pdf 977KB PDF download
Figure 10. 38KB Image download
Figure 9. 15KB Image download
Figure 8. 74KB Image download
Figure 7. 46KB Image download
Figure 6. 25KB Image download
Figure 5. 14KB Image download
Figure 4. 16KB Image download
Figure 3. 26KB Image download
Figure 2. 26KB Image download
Figure 1. 50KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

【 参考文献 】
  • [1]Taylor WR: Identification of protein sequence homology by consensus template alignment. J Mol Biol 1986, 188:233-258.
  • [2]Barton GJ: Protein sequence alignment and database scanning. In In Protein structure prediction: A practical approach. Edited by Sternberg MJE. Oxford: IRL Press at Oxford University Press; 1996:31-63.
  • [3]Livingstone CD, Barton GJ: Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Comput Appl Biosci 1993, 9:745-756.
  • [4]Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A: Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 2000, 29:291-325.
  • [5]Barton GJ, Sternberg MJ: Evaluation and improvements in the automatic alignment of protein sequences. Protein Eng 1987, 1:89-94.
  • [6]Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 1987, 25:351-360.
  • [7]Lipman DJ, Altschul SF, Kececioglu JD: A tool for multiple sequence alignment. Proc Natl Acad Sci USA 1989, 86:4412-4415.
  • [8]Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22:4673-4680.
  • [9]Notredame C, Higgins DG, Heringa J: T-coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000, 302:205-17.
  • [10]Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48:443-453.
  • [11]Sellers PH: On the theory and computation of evolutionary distances. J App Math 1974, 26:787-793.
  • [12]Murata M, Richardson JS, Sussman JL: Simultaneous comparison of three protein sequences. Proc Natl Acad Sci USA 1985, 82:3073-3077.
  • [13]Barton GJ: Protein multiple sequence alignment and flexible pattern matching. Methods Enzymol 1990, 183:403-28.
  • [14]Barton GJ, Sternberg MJ: A strategy for the rapid multiple alignment of protein sequences, confidence levels from tertiary structure comparisons. J Mol Biol 1987, 198:327-337.
  • [15]Karplus K, Hu B: Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics 2001, 17:713-20.
  • [16]McClure M, Vasi T, Fitch W: Comparative analysis of multiple protein-sequence alignment methods. Mol Biol Evol 1994, 11:571-592.
  • [17]Gotoh O: Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 1996, 264:823-838.
  • [18]Gotoh O: Optimal alignment between groups of sequences and its application to multiple sequence alignment. Comp App Biosci 1993, 9:361-370.
  • [19]Gotoh O: Further improvement in methods of group-to-group sequence alignment with generalized profile operations. Comp App Biosci 1993, 10:379-387.
  • [20]Gotoh O: A weighting system and algorithm for aligning many phylogenetically related sequences. Comp App Biosci 1995, 11:543-551.
  • [21]Sali A, Overington JP: Derivation of rules for comparative protein modeling from a database of protein structure alignments. Protien Sci 1994, 3:1582-1596.
  • [22]Thompson J, Plewniak F, Poch O: BAliBASE: a benchmark alignment database for the evaluation of multiple sequence alignment programs. Bioinformatics 1999, 15:87-88.
  • [23]Dengler U, Siddiqui AS, Barton GJ: Protein structural domains: analysis of the 3Dee domains database. Proteins 2001, 42:332-344.
  • [24]Siddiqui AS, Dengler U, Barton GJ: 3Dee: a database of protein structural domains. Bioinformatics 2001, 17:200-201.
  • [25]Russell RB, Barton GJ: Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins 1992, 14:309-323.
  • [26]Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogne-bonded and geometrical features. Biopolymers 1983, 22:2577-2637.
  • [27]Laskowski RA, Mac Arthur MW, Moss DS, Thornton JM: PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Cryst 1993, 26:283-291.
  • [28]Godzik A: The structural alignment between two protein: Is there a unique answer? Protien Sci 1996, 5:1325-1338.
  • [29]Feng ZK, Sippl MJ: Optimum superimposition of protein structures: ambiguities and implications. Fold Des 1996, 1:123-132.
  • [30]Barton G: OC – A cluster analysis program. [http://www.compbio.dundee.ac.uk/Software/OC/oc.html] webcite 1993.
  • [31]Hermjakob H, Lang F, Apweiler R: SPTR – A comprehensive, non-redundant and up-to-date view of the protein sequence world. CCP11 Newsletter 1998., 2
  • [32]Bairoch A, Boeckmann B: The SWISS-PROT protein sequence data bank. Nucleic Acids Res 1991, 19 Suppl:2247-2249.
  • [33]Morgenstern B, Dress A, Werner T: Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc Natl Acad Sci USA 1996, 93:12098-12103.
  • [34]Moult J, Hubbard T, Bryant SH, Fidelis K, Pedersen JT: Critical assessment of methods of protein structure prediction (CASP): round II. Proteins 1997, Suppl 1:2-6.
  • [35]Mclachlan AD: A mathematical procedure for superimposing atomic coordinates of proteins. Acta Crystallogr A 1972, A28:656-657.
  • [36]Diamond R: On the comparison of conformations using linear and quadratic transformations. Acta Crystallogr A 1976, A32:l-10.
  • [37]Rossmann MG, Liljas A, Branden CI, Banaszak LJ: Evolutionary and structural relationships among the dehydrogenases. The Enzymes 1975, 11:61-102.
  • [38]Vogt G, Etzold T, Argos P: An assessment of amino acid exchange matrices in aligning protein sequences: The twilight zone revisited. J Mol Biol 1995, 249:816-831.
  • [39]Ihaka R, Gentleman R: R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics 1996, 5:299-314.
  • [40]Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. Matrices for detecting distant relationships. In In Atlas of protein sequence and structure. Volume 5. Edited by Dayhoff MO. Washington DC: National biomedical research foundation; 1978::345-358.
  • [41]Barton GJ: ALSCRIPT: a tool to format multiple sequence alignments. Protein Eng 1993, 6:37-40.
  • [42]Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89:10915-10919.
  • [43]Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science 1992, 256:1443-1445.
  • [44]Webber C, Barton GJ: Estimation of P-values for global alignments of protein sequences. Bioinformatics 2001, 17:1158-67.
  • [45]Gupta SK, Kececioglu J, Schaffer AA: Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J Comput Biol 1995, 2:459-472.
  • [46]Krogh A, Brown M, Mian IS, Sjolander K, Haussler D: Hidden markov models in computational biology: Applications to protein modelling. J Mol Biol 1994, 235:1501-1531.
  • [47]Smith RF, Smith TF: Automatic generation of primary sequence patterns from sets of related protein sequences. Proc Natl Acad Sci USA 1990, 87:118-122.
  • [48]Smith RF, Smith TF: Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparitive protein modelling. Proteins 1992, 5:35-41.
  • [49]Cuff JA, Barton GJ: Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins 1999, 34:508-519.
  • [50]Cuff JA, Barton GJ: Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins 2000, 40:502-511.
  • [51]Murzin A, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database and the investigation of sequences and structures. J Mol Biol 1995, 247:536-540.
  • [52]Vingron M, Argos P: Determination of reliable regions in protein sequence alignments. Protien Eng 1990, 3:565-569.
  • [53]Holmes I, Durbin R: Dynamic programming alignment accuracy. J Comput Biol 1998, 5:493-504.
  • [54]Cline M, Hughey R, Karplus K: Predicting reliable regions in protein sequence alignments. Bioinformatics 2002, 18:306-314.
  文献评价指标  
  下载次数:333次 浏览次数:58次