期刊论文详细信息
BMC Bioinformatics
Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study
Pedro J Ballester2  Man-Hon Wong1  Kwong-Sak Leung1  Hongjian Li1 
[1]Department of Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong, China
[2]Cancer Research Center of Marseille (Inserm U1068, UM105, IPC), 27 Boulevard Lei Roure, 13009 Marseille, France
关键词: Machine learning;    Drug discovery;    Binding affinity;    Molecular docking;   
Others  :  1086294
DOI  :  10.1186/1471-2105-15-291
 received in 2014-05-13, accepted in 2014-08-18,  发布年份 2014
PDF
【 摘 要 】

Background

State-of-the-art protein-ligand docking methods are generally limited by the traditionally low accuracy of their scoring functions, which are used to predict binding affinity and thus vital for discriminating between active and inactive compounds. Despite intensive research over the years, classical scoring functions have reached a plateau in their predictive performance. These assume a predetermined additive functional form for some sophisticated numerical features, and use standard multivariate linear regression (MLR) on experimental data to derive the coefficients.

Results

In this study we show that such a simple functional form is detrimental for the prediction performance of a scoring function, and replacing linear regression by machine learning techniques like random forest (RF) can improve prediction performance. We investigate the conditions of applying RF under various contexts and find that given sufficient training samples RF manages to comprehensively capture the non-linearity between structural features and measured binding affinities. Incorporating more structural features and training with more samples can both boost RF performance. In addition, we analyze the importance of structural features to binding affinity prediction using the RF variable importance tool. Lastly, we use Cyscore, a top performing empirical scoring function, as a baseline for comparison study.

Conclusions

Machine-learning scoring functions are fundamentally different from classical scoring functions because the former circumvents the fixed functional form relating structural features with binding affinities. RF, but not MLR, can effectively exploit more structural features and more training samples, leading to higher prediction performance. The future availability of more X-ray crystal structures will further widen the performance gap between RF-based and MLR-based scoring functions. This further stresses the importance of substituting RF for MLR in scoring function development.

【 授权许可】

   
2014 Li et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150116010543822.pdf 926KB PDF download
Figure 3. 19KB Image download
Figure 2. 56KB Image download
Figure 1. 118KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

【 参考文献 】
  • [1]Cheng T, Li Q, Zhou Z, Wang Y, Bryant S: Structure-based virtual screening for drug discovery: a problem-centric review. AAPS J 2012, 14:133-141.
  • [2]Ma DL, Chan DSH, Leung CH: Drug repositioning by structure-based virtual screening. Chem Soc Rev 2013, 42(5):2130-2141.
  • [3]Jorgensen WL: Efficient drug lead discovery and optimization. Acc Chem Res 2009, 42(6):724-733.
  • [4]Volkamer A, Kuhn D, Rippmann F, Rarey M: DoGSiteScorer: a web server for automatic binding site prediction, analysis and druggability assessment. Bioinformatics 2012, 28(15):2074-2075.
  • [5]Hermann JC, Marti-Arbona R, Fedorov AA, Fedorov E, Almo SC, Shoichet BK, Raushel FM: Structure-based activity prediction for an enzyme of unknown function. Nature 2007, 448(7155):775-779.
  • [6]Trott O, Olson AJ: AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Computat Chem 2010, 31(2):455-461.
  • [7]Li H, Leung KS, Wong MH: idock: A multithreaded virtual screening tool for flexible ligand docking. In 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB). San Diego, CA, USA: IEEE; 2012:77-84. [http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6217214 webcite]
  • [8]Li H, Leung KS, Ballester PJ, Wong MH: istar: A web platform for large-scale protein-ligand docking. PLoS ONE 2014, 9:e85678.
  • [9]Wang R, Fang X, Lu Y, Wang S: The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. J Med Chem 2004, 47(12):2977-2980.
  • [10]Wang R, Fang X, Lu Y, Yang CY, Wang S: The PDBbind database methodologies and updates. J Med Chem 2005, 48(12):4111-4119.
  • [11]Dunbar JB, Smith RD, Yang CY, Ung PMU, Lexa KW, Khazanov NA, Stuckey JA, Wang S, Carlson HA: CSAR benchmark exercise of 2010: selection of the protein-ligand complexes. J Chem Inform Model 2011, 51(9):2036-2046.
  • [12]Dunbar JB, Smith RD, Yang CY, Ung PMU, Lexa KW, Khazanov NA, Stuckey JA, Wang S, Carlson HA: Correction to CSAR benchmark exercise of 2010: selection of the protein-ligand complexes. J Chem Inform Model 2011, 51(9):2146-2146.
  • [13]Cao Y, Li L: Improved protein–ligand binding affinity prediction by using a curvature-dependent surface-area model. Bioinformatics 2014, 30(12):1674-1680.
  • [14]Baum B, Muley L, Smolinski M, Heine A, Hangauer D, Klebe G: Non-additivity of functional group contributions in protein–ligand binding: a comprehensive study by crystallography and isothermal titration calorimetry. J Mol Biol 2010, 397(4):1042-1054.
  • [15]Ballester PJ, Mitchell JBO: A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics 2010, 26(9):1169-1175.
  • [16]Breiman L: Random forests. Mach Learn 2001, 45:5-32.
  • [17]Ballester PJ, Mangold M, Howard NI, Robinson RLM, Abell C, Blumberger J, Mitchell JBO: Hierarchical virtual screening for the discovery of new molecular scaffolds in antibacterial hit identification. J R Soc Interface 2012, 9(77):3196-3207.
  • [18]Durrant JD, McCammon JA: NNScore 2.0: a neural-network receptor–ligand scoring function. J Chem Inform Model 2011, 51(11):2897-2903.
  • [19]Li L, Wang B, Meroueh SO: Support vector regression scoring of receptor-ligand complexes for rank-ordering and virtual screening of chemical libraries. J Chem Inform Model 2011, 51(9):2132-2138.
  • [20]Ouyang X, Handoko SD, Kwoh CK: CScore: a simple yet effective scoring function for protein-ligand binding affinity prediction using modified CMAC learning architecture. J Bioinformatics Comput Biol 2011, 09:1-14.
  • [21]Liu Q, Kwoh CK, Li J: Binding affinity prediction for protein–ligand complexes based onβ contacts and B factor. J Chem Inform Model 2013, 53(11):3076-3085.
  • [22]Zilian D, Sotriffer CA: SFCscoreRF: a random forest-based scoring function for improved affinity prediction of protein–ligand complexes. J Chem Inform Model 2013, 53(8):1923-1933.
  • [23]Li GB, Yang LL, Wang WJ, Li LL, Yang SY: ID-Score: a new empirical scoring function based on a comprehensive set of descriptors related to protein–ligand interactions. J Chem Inform Model 2013, 53(3):592-600.
  • [24]Cheng T, Li X, Li Y, Liu Z, Wang R: Comparative assessment of scoring functions on a diverse test set. J Chem Inform Model 2009, 49(4):1079-1093.
  • [25]Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res 2000, 28:235-242.
  • [26]Berman H, Henrick K, Nakamura H: Announcing the worldwide protein data bank. Nat Struct Mol Biol 2003, 10(12):980-980.
  • [27]Kramer C, Gedeck P: Leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets. J Chem Inform Model 2010, 50(11):1961-1969.
  • [28]Kramer C, Gedeck P: Global free energy scoring functions based on distance-dependent atom-type pair descriptors. J Chem Inform Model 2011, 51(3):707-720.
  • [29]Ross GA, Morris GM, Biggin PC: One size does not fit all: the limits of structure-based models in drug discovery. J Chem Theory Comput 2013, 9(9):4266-4274.
  • [30]Ballester PJ, Mitchell JBO: Comments on ”leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets”: significance for the validation of scoring functions. J Chem Inform Model 2011, 51(8):1739-1741.
  • [31]Ballester PJ, Schreyer A, Blundell TL: Does a more precise chemical description of protein–ligand complexes lead to more accurate prediction of binding affinity? J Chem Inform Model 2014, 54(3):944-955.
  文献评价指标  
  下载次数:42次 浏览次数:32次