期刊论文详细信息
BMC Bioinformatics
Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
Calem J Bendell3  Shalon Liu6  Tristan Aumentado-Armstrong5  Bogdan Istrate5  Paul T Cernek2  Samuel Khan2  Sergiu Picioreanu7  Michael Zhao1  Robert A Murgita4 
[1] Department of Physiology, McGill, Montreal, CA
[2] Department of Biology, McGill, Montreal, CA
[3] School of Computer Science, McGill, Montreal, CA
[4] Department of Microbiology and Immunology, McGill, Montreal, CA
[5] Department of Anatomy and Cell Biology, McGill, Montreal, CA
[6] Department of Mathematics and Statistics, McGill, Montreal, CA
[7] Department of Biology and Computer Science, McGill, Montreal, CA
关键词: Protein prediction scoring;    Protein interface identification;    Protein datasets;    Feature selection;    Protein-protein interface;    Protein-protein interaction;    Machine learning;   
Others  :  1087587
DOI  :  10.1186/1471-2105-15-82
 received in 2013-09-30, accepted in 2014-02-14,  发布年份 2014
PDF
【 摘 要 】

Background

Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking generally requires high resolution structures of both of the binding partners and sequence analysis requires that a significant number of recurrent patterns exist for the identification of a potential binding site. Researchers have turned to machine learning to overcome some of the other methods’ restrictions by generalising interface sites with sets of descriptive features. Best practices for dataset generation, features, and learning algorithms have not yet been identified or agreed upon, and an analysis of the overall efficacy of machine learning based PPI predictors is due, in order to highlight potential areas for improvement.

Results

The presence of unknown interaction sites as a result of limited knowledge about protein interactions in the testing set dramatically reduces prediction accuracy. Greater accuracy in labelling the data by enforcing higher interface site rates per domain resulted in an average 44% improvement across multiple machine learning algorithms. A set of 10 biologically unrelated proteins that were consistently predicted on with high accuracy emerged through our analysis. We identify seven features with the most predictive power over multiple datasets and machine learning algorithms. Through our analysis, we created a new predictor, RAD-T, that outperforms existing non-structurally specializing machine learning protein interface predictors, with an average 59% increase in MCC score on a dataset with a high number of interactions.

Conclusion

Current methods of evaluating machine-learning based PPI predictors tend to undervalue their performance, which may be artificially decreased by the presence of un-identified interaction sites. Changes to predictors’ training sets will be integral to the future progress of interface prediction by machine learning methods. We reveal the need for a larger test set of well studied proteins or domain-specific scoring algorithms to compensate for poor interaction site identification on proteins in general.

【 授权许可】

   
2014 Bendell et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150117021821224.pdf 409KB PDF download
Figure 4. 81KB Image download
Figure 3. 60KB Image download
Figure 2. 71KB Image download
Figure 1. 68KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

【 参考文献 】
  • [1]Ozbabacan SEA, Engin HB, Gursoy A, Keskin O: Transient protein–protein interactions. Protein Eng Des Sel 2011, 24(9):635-648.
  • [2]Ng SK, Zhang Z, Tan SH, Lin K: InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res 2003, 31:251-254.
  • [3]Jones S, Thornton JM: Analysis of protein-protein interaction sites using surface patches. J Molecular Biol 1997, 272(1):121-132.
  • [4]Jones S, Thornton JM: Principles of protein-protein interactions. Proceedings of the National Academy of Sciences 1996, 13-20.
  • [5]Chakrabarti P, Janin J: Dissecting protein–protein recognition sites. Proteins: Structure, Function, and Bioinformatics 2002, 47(3):334-343.
  • [6]Chothia C, Janinc J: Principles of protein-protein recognition. Nature 1975, 256(5520):705.
  • [7]Prasad Bahadur R, Chakrabarti P, Rodier F, Janin J: A dissection of specific and non-specific protein–protein interfaces. J Molecular Biol 2004, 336(4):943-955.
  • [8]Conte LL, Chothia C, Janin J: The atomic structure of protein-protein recognition sites. J Molecular Biol 1999, 285(5):2177-2198.
  • [9]de Vries SJ, Bonvin AM: How proteins get in touch: interface prediction in the study of biomolecular complexes. Curr Protein and Peptide Science 2008, 9(4):394-406.
  • [10]Jones S, Thornton JM: Prediction of protein-protein interaction sites using patch analysis. J Molecular Biology 1997, 272(1):133-143.
  • [11]Burgoyne NJ, Jackson RM: Predicting protein interaction sites: binding hot-spots in protein–protein and protein–ligand interfaces. Bioinformatics 2006, 22(11):1335-1342.
  • [12]Bradford JR, Westhead DR: Improved prediction of protein–protein binding sites using a support vector machines approach. Bioinformatics 2005, 21(8):1487-1494.
  • [13]Bordner AJ, Abagyan R: Statistical analysis and prediction of protein–protein interfaces. Proteins: Structure, Function, and Bioinformatics 2005, 60(3):353-366.
  • [14]Fariselli P, Pazos F, Valencia A, Casadio R: Prediction of protein–protein interaction sites in heterocomplexes with neural networks. Euro J Biochem 2002, 269(5):1356-1361.
  • [15]Ofran Y, Rost B: Predicted protein–protein interaction sites from local sequence information. Febs Letters 2003, 544(1):236-239.
  • [16]Pettit FK, Bare E, Tsai A, Bowie JU: Hotpatch: a statistical a pproach to finding biologically relevant features on protein surfaces. J Mol Biol 2007, 369(3):863-879.
  • [17]Bradford JR, Needham CJ, Bulpitt AJ, Westhead DR: Insights into protein–protein interfaces using a bayesian network prediction method. J Mol Biol 2006, 362(2):365-386.
  • [18]Neuvirth H, Raz R, Schreiber G: Promate: a structure based prediction program to identify the location of protein-protein binding sites. J Mol Biol 2004, 338(1):181.
  • [19]Mayrose I, Graur D, Ben-Tal N, Pupko T: Comparison of site-specific rate-inference methods for protein sequences: empirical bayesian methods are superior. Molecular Biology and Evolution 2004, 21(9):1781-1791.
  • [20]Vlahoviček K, Šikić M: Prediction of protein–protein interaction sites in sequences and 3d structures by random forests. PLoS Computational Biology 2009, 5(1):1000278.
  • [21]Yan C, Dobbs D, Honavar V: A two-stage classifier for identification of protein–protein interface residues. Bioinformatics 2004, 20(suppl 1):371-378.
  • [22]Fernandez-Recio J, Totrov M, Skorodumov C, Abagyan R: Optimal docking area: a new method for predicting protein–protein interaction sites. PROTEINS: Structure, Function, and bioinformatics 2005, 58(1):134-143.
  • [23]Li M-H, Lin L, Wang X-L, Liu T: Protein–protein interaction site prediction based on conditional random fields. Bioinformatics 2007, 23(5):597-604.
  • [24]Westbrook J, Feng Z, Jain S, Bhat T, Thanki N, Ravichandran V, Gilliland GL, Bluhm W, Weissig H, Greer DS, Bourne PE, Berman HM: The protein data bank: unifying the archive. Nucleic Acids Research 2002, 30(1):245-248.
  • [25]Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben-Tal N: Consurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res 2005, 33(suppl 2):299-302.
  • [26]Qiu Z, Wang X: Prediction of protein–protein interaction sites using patch-based residue characterization. Journal of Theoretical Biology 2012, 293:143-150.
  • [27]Wang B, Chen P, Huang D-S, Lok T-M, Lyu MR, Li J-j: Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Letters 2006, 580(2):380-384.
  • [28]Kufareva I, Budagyan L, Raush E, Totrov M, Abagyan R: Pier: protein interface recognition for structural proteomics. Proteins: Structure, Function, and Bioinformatics 2007, 67(2):400-417.
  • [29]Mihalek I, Lichtarge O, Reš I: An evolution based classifier for prediction of protein interfaces without using protein structures. Bioinformatics 2005, 21(10):2496-2501.
  • [30]Amoutzias G, Van de Peer Y: Single-gene and whole-genome duplications and the evolution of protein-protein interaction networks. In Evolutionary Genomics and Systems Biology. Hoboken, NJ: John Wiley & Sons, Inc; 2010:413-429.
  • [31]Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215(3):403-410.
  • [32]Wheelan SJ, Marchler-Bauer A, Bryant SH: Domain size distributions can predict domain boundaries. Bioinformatics 2000, 16(7):613-618.
  • [33]Xia K, Fu Z, Hou L, Han JJ: Impacts of protein–protein interaction domains on organism and network complexity. Genome Res 2008, 18(9):1500-1508.
  • [34]Patil A, Kinoshita K, Nakamura H: Domain distribution and intrinsic disorder in hubs in the human protein–protein interaction network. Protein Science 2010, 19(8):1461-1468.
  • [35]Chawla NV: Data mining for imbalanced datasets: An overview. In Data Mining and Knowledge Discovery Handbook. New York Dordrecht Heidelberg London,: Springer; 2005:853-867.
  • [36]Batista GE, Prati RC, Monard MC: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 2004, 6(1):20-29.
  • [37]He H, Garcia EA: Learning from imbalanced data. Knowledge and Data Engineering, IEEE Transactions on 2009, 21(9):1263-1284.
  • [38]Sanner MF, Olson AJ, Spehner J-C: Reduced surface: an efficient way to compute molecular surfaces. Biopolymers 1996, 38(3):305-320.
  • [39]Connolly ML: Analytical molecular surface calculation. Journal of Applied Crystallography 1983, 16(5):548-558.
  • [40]Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE: Ucsf chimera: a visualization system for exploratory research and analysis. J Comput Chem 2004, 25(1):1605-1612.
  • [41]Pintar A, Carugo O, Pongor S: Cx, an algorithm that identifies protruding atoms in proteins. Bioinformatics 2002, 18(7):980-984.
  • [42]Pettit FK, Bowie JU: Protein surface roughness and small molecular binding sites. Journal of Molecular Biology 1999, 285(4):1377-1382.
  • [43]Fauchere J, Pliska V: Hydrophobic parameters pi of amino-acid side chains from the partitioning of n-acetyl-amino-acid amides. Eur J Med Chem 1983, 18(3):369-375.
  • [44]Baker NA, Sept D, Joseph S, Holst MJ, McCammon JA: Electrostatics of nanosystems: application to microtubules and the ribosome. Proceedings of the National Academy of Sciences 2001, 98(18):10037-10041.
  • [45]Coleman RG, Burr MA, Souvaine DL, Cheng AC: An intuitive approach to measuring protein surface curvature. Proteins: Structure, Function, and Bioinformatics 2005, 61(4):1068-1074.
  • [46]Pupko T, Bell RE, Mayrose I, Glaser F, Ben-Tal N: Rate4site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 2002, 18(suppl 1):71-77.
  • [47]Valdar WS: Scoring residue conservation. Proteins: Structure, Function, and Bioinformatics 2002, 48(2):227-241.
  • [48]Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The weka data mining software: an update. ACM SIGKDD Explorations Newsletter 2009, 11(1):10-18.
  • [49]Andrew Moore MSL: Efficient algorithms for minimizing cross validation error. In Proceedings of the 11th International Conference on Machine Learning. Edited by Hirsh H, Cohen WW. Morgan Kaufmann, San Francisco, CA; 1994:190-198.
  • [50]Maron O, Moore AW: Hoeffding races: Accelerating model search for classification and function approximation. In Advances in Neural Information Processing Systems, vol. 6. Edited by Cowan GT.JA, Jack D. Morgan Kaufmann, San Francisco, CA; 1994:59-66.
  • [51]Zhou H-X, Shan Y: Prediction of protein interaction sites from sequence profile and residue neighbor list. Proteins: Structure, Function, and Bioinformatics 2001, 44(3):336-343.
  • [52]Liang S, Zhang C, Liu S, Zhou Y: Protein binding site prediction using an empirical scoring function. Nucleic Acids Research 2006, 34(13):3698-3707.
  • [53]Zhang QC, Deng L, Fisher M, Guan J, Honig B, Petrey D: Predus: a web server for predicting protein interfaces using structural neighbors. Nucleic Acids Research 2011, 39(suppl 2):283-287.
  • [54]Jordan RA, Yasser E-M, Dobbs D, Honavar V: Predicting protein-protein interface residues using local surface structural similarity. BMC bioinformatics 2012, 13(1):41.
  • [55]Wang S, Ma J, Peng J, Xu J: Protein structure alignment beyond spatial proximity. Scientific Reports 2013, 3:1448.
  • [56]Krissinel E, Henrick K: Secondary-structure matching (ssm): a new tool for fast protein structure alignment in three dimensions. Biological Crystallography 2004, 60(1):2256-2268.
  • [57]Micheletti C, Orland H: Mistral: a tool for energy-based multiple structural alignment of proteins. Bioinformatics 2009, 25(20):2663-2669.
  • [58]Hwang H, Pierce J, Mintseris J, Janin J, Weng Z: Protein–protein docking benchmark version 3.0. Proteins: Structure, Function, and Bioinformatics 2008, 73(3):705-709.
  • [59]Qin S, Zhou H-X: meta-ppisp: a meta web server for protein-protein interaction site prediction. Bioinformatics 2007, 23(24):3386-3387.
  文献评价指标  
  下载次数:6次 浏览次数:4次