期刊论文详细信息
BMC Bioinformatics
Sigma-RF: prediction of the variability of spatial restraints in template-based modeling by random forest
Juyong Lee1  Kiho Lee1  InSuk Joung4  Keehyoung Joo3  Bernard R Brooks2  Jooyoung Lee4 
[1] Center for In Silico Protein Science, Korea Institute for Advanced Study, Seoul, Korea
[2] Laboratory of Computational Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, 5635 Fishers Ln, Bethesda 20852, USA
[3] Center for Advanced Computation, Korea Institute for Advanced Study, Seoul, Korea
[4] School of Computational Sciences, Korea Institute for Advanced Study, Seoul, Korea
关键词: Statistics;    Bioinformatics;    Protein sequence;    Protein structure prediction;    Protein structure;    Machine learning;    Random forest;    Homology modeling;    Template-based modeling;   
Others  :  1142234
DOI  :  10.1186/s12859-015-0526-z
 received in 2014-09-17, accepted in 2015-03-04,  发布年份 2015
PDF
【 摘 要 】

Background

In template-based modeling when using a single template, inter-atomic distances of an unknown protein structure are assumed to be distributed by Gaussian probability density functions, whose center peaks are located at the distances between corresponding atoms in the template structure. The width of the Gaussian distribution, the variability of a spatial restraint, is closely related to the reliability of the restraint information extracted from a template, and it should be accurately estimated for successful template-based protein structure modeling.

Results

To predict the variability of the spatial restraints in template-based modeling, we have devised a prediction model, Sigma-RF, by using the random forest (RF) algorithm. The benchmark results on 22 CASP9 targets show that the variability values from Sigma-RF are of higher correlations with the true distance deviation than those from Modeller. We assessed the effect of new sigma values by performing the single-domain homology modeling of 22 CASP9 targets and 24 CASP10 targets. For most of the targets tested, we could obtain more accurate 3D models from the identical alignments by using the Sigma-RF results than by using Modeller ones.

Conclusions

We find that the average alignment quality of residues located between and at two aligned residues, quasi-local information, is the most contributing factor, by investigating the importance of input features used in the RF machine learning. This average alignment quality is shown to be more important than the previously identified quantity of a local information: the product of alignment qualities at two aligned residues.

【 授权许可】

   
2015 Lee et al.; licensee BioMed Central.

【 预 览 】
附件列表
Files Size Format View
20150328011625748.pdf 3103KB PDF download
Figure 4. 101KB Image download
Figure 3. 75KB Image download
Figure 2. 65KB Image download
Figure 1. 81KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

【 参考文献 】
  • [1]Söding J: Protein homology detection by HMM–HMM comparison. Bioinformatics 2005, 21(7):951-60.
  • [2]Hildebrand A, Remmert M, Biegert A, Söding J: Fast and accurate automatic structure prediction with hhpred. Proteins: Struct, Funct, Bioinf. 2009, 77(S9):128-32.
  • [3]Peng J, Xu J: Boosting protein threading accuracy. [http://link.springer.com/chapter/10.1007%2F978-3-642-02008-7_3#] webciteResearch in Computational Molecular Biology Springer Berlin, Heidelberg; 2009. http://link.springer.com/chapter/10.1007%2F978-3-642-02008-7_3#
  • [4]Peng J, Xu J: RaptorX: Exploiting structure information for protein alignment by statistical inference. Proteins: Struct Funct Bioinf. 2011, 79(S10):161-71.
  • [5]Wu S, Zhang Y: MUSTER: improving protein sequence profile–profile alignments by using multiple sources of structure information. Proteins: Struct Funct Bioinf. 2008, 72(2):547-56.
  • [6]Yang Y, Faraggi E, Zhao H, Zhou Y: Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates. Bioinformatics 2011, 27(15):2076-82.
  • [7]Joo K, Lee J, Kim I, Lee SJ, Lee J: Multiple sequence alignment by conformational space annealing. Bioph J. 2008, 95(10):4813-9.
  • [8]Pei J, Grishin NV: PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 2007, 23(7):802-8.
  • [9]Armougom F, Moretti S, Poirot O, Audic S, Dumas P, Schaeli B, et al.: Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee. Nucleic Acids Res. 2006, 34(suppl 2):604-8.
  • [10]Cozzetto D, Kryshtafovych A, Fidelis K, Moult J, Rost B, Tramontano A: Evaluation of template-based models in CASP8 with standard measures. Proteins: Struct Funct Bioinf. 2009, 77(S9):18-28.
  • [11]Mariani V, Kiefer F, Schmidt T, Haas J, Schwede T: Assessment of template based protein structure predictions in CASP9. Proteins: Struct Funct Bioinf. 2011, 79(S10):37-58.
  • [12]Kryshtafovych A, Fidelis K, Moult J: CASP9 results compared to those of previous casp experiments. Proteins: Struct Funct Bioinf. 2011, 79(S10):196-207.
  • [13]Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A: Critical assessment of methods of protein structure prediction (CASP) - round X. Proteins: Struct Funct Bioinf. 2014, 82:1-6. doi:10.1002/prot.24452
  • [14]Kryshtafovych A, Moult J, Bales P, Bazan JF, Biasini M, Burgin A, et al.: Challenging the state of the art in protein structure prediction: Highlights of experimental target structures for the 10th critical assessment of techniques for protein structure prediction experiment CASP10. Proteins: Struct Funct Bioinf. 2014, 82:26-42. doi:10.1002/prot.24489
  • [15]Joo K, Lee J, Lee S, Seo JH, Lee SJ, Lee J: High accuracy template based modeling by global optimization. Proteins: Struct Funct Bioinf. 2007, 69(S8):83-9.
  • [16]Sali A, Blundell T: Comparative protein modelling by satisfaction of spatial restraints. Protein Struct Distance Anal. 1994, 64:86.
  • [17]Fiser A, Šali A: Modeller: generation and refinement of homology-based protein structure models. Methods Enzymol. 2003, 374:461-91.
  • [18]Krieger E, Joo K, Lee J, Lee J, Raman S, Thompson J, et al.: Improving physical realism, stereochemistry, and side-chain accuracy in homology modeling: Four approaches that performed well in CASP8. Proteins: Struct Funct Bioinf. 2009, 77(S9):114-22.
  • [19]Xu J, Peng J, Zhao F: Template-based and free modeling by RAPTOR++ in CASP8. Proteins: Struct Funct Bioinf. 2009, 77(S9):133-7.
  • [20]Joo K, Lee J, Seo JH, Lee K, Kim BG, Lee J: All-atom chain-building by optimizing modeller energy function using conformational space annealing. Proteins: Struct Funct Bioinf. 2009, 75(4):1010-23.
  • [21]Joo K, Lee J, Sim S, Lee SY, Lee K, Heo S, et al.: Protein structure modeling for CASP10 by multiple layers of global optimization. Proteins: Struct Funct Bioinf. 2014, 82(Suppl 2(April):188-95.
  • [22]Thompson J, Baker D: Incorporation of evolutionary information into rosetta comparative modeling. Proteins: Struct Funct Bioinf. 2011, 79(8):2380-8.
  • [23]Breiman L: Random forests. Mach Learn. 2001, 45(1):5-32.
  • [24]Lee J, Lee J: Hidden information revealed by optimal community structure from a protein-complex bipartite network improves protein function prediction. PLoS ONE 2013, 8(4):60372.
  • [25]Lee J, Gross SP, Lee J: Improved network community structure improves function prediction. Sci Rep. 2013, 3:2197.
  • [26]Ziegler A, König IR: Mining data with random forests: current options for real-world applications. Wiley Interdiscip Rev: Data Min Knowl Discov. 2014, 4(1):55-63.
  • [27]Manavalan B, Lee J, Lee J: Random forest-based protein model quality assessment (RFMQA) using structural features and potential energy terms. PLoS ONE 2014, 9(9):106542.
  • [28]Caruana R, Karampatziakis N, Yessenalina A: An empirical evaluation of supervised learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning. ICML ’08. ACM, New York, NY, USA; 2008.
  • [29]Zhang Y, Skolnick J: Scoring function for automated assessment of protein structure template quality. Proteins 2004, 57(4):702-10.
  • [30]Mariani V, Biasini M, Barbato A, Schwede T: lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics (Oxford, England) 2013, 29(21):2722-8.
  • [31]Wang G, Dunbrack RL: PISCES: a protein sequence culling server. Bioinformatics 2003, 19(12):1589-91.
  • [32]Kopp J, Bordoli L, Battey JND, Kiefer F, Schwede T: Assessment of CASP7 predictions for template-based modeling targets. Proteins: Struct Funct Bioinf. 2007, 69(S8):38-56.
  • [33]Petersen TN, Lundegaard C, Nielsen M, Bohr H, Bohr J, Brunak S, et al.: Prediction of protein secondary structure at 80% accuracy. Proteins: Struct Funct Bioinf. 2000, 41(1):17-20.
  • [34]Joo K, Lee SJ, Lee J: SANN: solvent accessibility prediction of proteins by nearest neighbor method. Proteins: Struct Funct Bioinf. 2012, 80(7):1791-7.
  • [35]Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and regression trees. Statistics/Probability Series. Wadsworth Publishing Company, Belmont, California, USA; 1984.
  • [36]Quinlan JR: Induction of decision trees. Mach Learn. 1986, 1(1):81-106.
  • [37]Fiser A, Do RKG, Sali A: Modeling of loops in protein structures. Protein Sci. 2000, 9(9):1753-73. doi:10.1110/ps.9.9.1753
  • [38]Pastore A, Atkinson RA, Saudek V, Williams RJ: Topological mirror images in protein structure computation: an underestimated problem. Proteins 1991, 10(1):22-32.
  • [39]Liwo A, Lee J, Ripoll DR, Pillardy J, Scheraga HA: Protein structure prediction by global optimization of a potential energy function. Proc Nat Acad Sci USA. 1999, 96(10):5482-5.
  • [40]Kihara D, Lu H, Kolinski A, Skolnick J: TOUCHSTONE: an ab initio protein structure prediction method that uses threading-based tertiary restraints. Proc Nat Acad Sci USA. 2001, 98(18):10125-30.
  • [41]Zhang Y: I-TASSER: fully automated protein structure prediction in CASP8. Proteins 2009, 77(Suppl 9(August):100-13. doi:10.1002/prot.22588
  文献评价指标  
  下载次数:51次 浏览次数:51次