期刊论文详细信息
BMC Bioinformatics
A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli
Narjeskhatoon Habibi1  Siti Z Mohd Hashim1  Alireza Norouzi1  Mohammed Razip Samian2 
[1] Faculty of Computing, Universiti Teknologi Malaysia, Johor, Malaysia
[2] Centre for Chemical Biology, Universiti Sains Malaysia, Penang, Malaysia
关键词: Computational biology;    Bioinformatics;    Machine learning;    Escherichia coli;    Recombinant protein expression;    In silico prediction;    Protein solubility prediction;    Protein solubility;   
Others  :  818611
DOI  :  10.1186/1471-2105-15-134
 received in 2013-09-04, accepted in 2014-03-25,  发布年份 2014
PDF
【 摘 要 】

Background

Over the last 20 years in biotechnology, the production of recombinant proteins has been a crucial bioprocess in both biopharmaceutical and research arena in terms of human health, scientific impact and economic volume. Although logical strategies of genetic engineering have been established, protein overexpression is still an art. In particular, heterologous expression is often hindered by low level of production and frequent fail due to opaque reasons. The problem is accentuated because there is no generic solution available to enhance heterologous overexpression. For a given protein, the extent of its solubility can indicate the quality of its function. Over 30% of synthesized proteins are not soluble. In certain experimental circumstances, including temperature, expression host, etc., protein solubility is a feature eventually defined by its sequence. Until now, numerous methods based on machine learning are proposed to predict the solubility of protein merely from its amino acid sequence. In spite of the 20 years of research on the matter, no comprehensive review is available on the published methods.

Results

This paper presents an extensive review of the existing models to predict protein solubility in Escherichia coli recombinant protein overexpression system. The models are investigated and compared regarding the datasets used, features, feature selection methods, machine learning techniques and accuracy of prediction. A discussion on the models is provided at the end.

Conclusions

This study aims to investigate extensively the machine learning based methods to predict recombinant protein solubility, so as to offer a general as well as a detailed understanding for researches in the field. Some of the models present acceptable prediction performances and convenient user interfaces. These models can be considered as valuable tools to predict recombinant protein overexpression results before performing real laboratory experiments, thus saving labour, time and cost.

【 授权许可】

   
2014 Habibi et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20140711122925676.pdf 257KB PDF download
【 参考文献 】
  • [1]Chan WC, Liang PH, Shih YP, Yang UC, Lin WC, Hsu CN: Learning to predict expression efficacy of vectors in recombinant protein production. BMC Bioinform 2010, 11(Suppl 1):S21. BioMed Central Full Text
  • [2]van den Berg BA, Reinders MJ, Hulsman M, Wu L, Pel HJ, Roubos JA, de Ridder D: Exploring sequence characteristics related to high-level production of secreted proteins in aspergillus Niger. PLoS One 2012, 7(10):e45869.
  • [3]Hirose S, Kawamura Y, Yokota K, Kuroita T, Natsume T, Komiya K, Tsutsumi T, Suwa Y, Isogai T, Goshima N, Noguchi T: Statistical analysis of features associated with protein expression/solubility in an in vivo Escherichia coli expression system and a wheat germ cell-free expression system. J Biochem 2011, 150(1):73-81.
  • [4]Samak T, Gunter D, Wan Z: Prediction of Protein Solubility in E. coli. Chicago, IL: E-Science (e-Science), 2012 IEEE 8th International Conference on Date of Conference: 8-12 Oct. 2012; 2012:1-8.
  • [5]Fang Y, Fang J: Discrimination of soluble and aggregation-prone proteins based on sequence information. Mol BioSyst 2013, 9(4):806-811.
  • [6]Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D: PROSO II-a new method for protein solubility prediction. FEBS J 2012, 279(12):2192-2200.
  • [7]Xiaohui N, Feng S, Xuehai H, Jingbo X, Nana L: Predicting the protein solubility by integrating chaos games representation and entropy in information theory. Expert Syst Appl 2014, 41(4):1672-1679.
  • [8]Huang H, Charoenkwan P, Kao T, Lee H, Chang F, Huang W, Ho S, Shu L, Chen W, Ho S: Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. BMC Bioinfomratics 2012, 13(17):S3.
  • [9]Wilkinson DL, Harrison RG: Predicting the solubility of recombinant proteins in Escherichia coli. Nat Biotechnol 1991, 9(5):443-448.
  • [10]Hirose S, Noguchi T: ESPRESSO: a system for estimating protein expression and solubility in protein expression systems. Proteomics 2013, 13(9):1444-1456.
  • [11]Quinlan JR: C4.5: Programs for Machine Learning. Vol: 1. USA: Morgan Kaufmann; 1993.
  • [12]Cover T, Hart P: Nearest neighbor pattern classification. Inform Theory IEEE Transac 1967, 13(1):21-27.
  • [13]Rosenblatt F: Principles of Neurodynamics. New York: Spartan; 1962.
  • [14]Rumelhart DE, Hinton GE, Williams RJ: Learning Internal Representations by Error Propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition. California University San Diego La Jolla Institute for Cognitive Science; 1985. Technical rept. Mar-Sep 1985. (No. ICS-8506)
  • [15]Cortes C, Vapnik V: Support-vector networks. Mach Learn 1995, 20(3):273-297.
  • [16]Bertone P, Kluger Y, Lan N, Zheng D, Christendat D, Yee A, Edwards AM, Arrowsmith CH, Montelione GT, Gerstein M: SPINE: An integrated tracking database and data mining approach for identifying feasible targets in high throughput structural proteomics. Nucleic Acids Res 2001, 29(13):2884-2898.
  • [17]Magnan CN, Randall A, Baldi P: SOLpro: accurate sequence-based prediction of protein solubility. Bioinformatics 2009, 25(17):2200-2207.
  • [18]Davis GD, Elisee C, Newham DM, Harrison RG: New fusion protein systems designed to give soluble expression in Escherichia coli. Biotechnol Bioeng 1999, 65(4):382-388.
  • [19]Smialowski P, Martin-Galiano AJ, Mikolajka A, Girschick T, Holak TA, Frishman D: Protein solubility: sequence based prediction and experimental verification. Bioinformatics 2007, 23(19):2536-2542.
  • [20]Diaz AA, Tomba E, Lennarson R, Richard R, Bagajewicz MJ, Harrison RG: Prediction of protein solubility in Escherichia coli using logistic regression. Biotechnol Bioeng 2010, 105(2):374-383.
  • [21]Chang CCH, Song J, Tey BT, Ramanan RN: Bioinformatics Approaches for Improved Recombinant Protein Production in Escherichia coli: Protein Solubility Prediction. Oxford: Briefings in bioinformatics, bbt057; 2013. First published online August 7, 2013. doi:10.1093/bib/bbt057
  • [22]Stiglic G, Kocbek S, Pernek I, Kokol P: Comprehensive decision tree models in bioinformatics. PLoS One 2012, 7(3):e33812.
  • [23]Agostini F, Vendruscolo M, Tartaglia GG: Sequence-based prediction of protein solubility. J Mol Biol 2012, 421(2):237-241.
  • [24]Kocbek S, Stiglic G, Pernek I, Kokol P: Stability of different feature selection methods for selecting protein sequence descriptors in protein solubility classification problem. Transition 2010, 7(21):50-55.
  • [25]Niwa T, Ying BW, Saito K, Jin W, Takada S, Ueda T, Taguchi H: Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins. Proc Natl Acad Sci 2009, 106(11):4201-4206.
  • [26]Kumar P, Jayaraman VK, Kulkarni BD: Granular Support Vector Machine Based Method for Prediction of Solubility of Proteins on Overexpression in Escherichia coli. In Pattern Recognition and Machine Intelligence, Second International Conference, PReMI 2007, Kolkata, India. Berlin Heidelberg: Springer; 2007:406-415. Proceedings
  • [27]Idicula-Thomas S, Kulkarni AJ, Kulkarni BD, Jayaraman VK, Balaji PV: A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli. Bioinformatics 2006, 22(3):278-284.
  • [28]Idicula‒Thomas S, Balaji PV: Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli. Protein Sci 2005, 14(3):582-592.
  • [29]Luan C, Qiu S, Finley JB, Carson M, Gray RJ, Huang W, Johnson D, Tsao J, Reboul J, Vaglio P, Hill DE, Vidal M, DeLucas LJ, Luo M: High-throughput expression of C. elegans proteins. Genome Res 2004, 14(10b):2102-2110.
  • [30]Goh C, Lan N, Douglas SM, Wu B, Echols N, Smith A, Milburn D, Montelione GT, Zhao H, Gerstein M: Mining the structural Genomics Pipeline: identification of protein properties that affect high throughput experimental analysis. J Mol Biol 2004, 336(1):115-130.
  • [31]Christendat D, Yee A, Dharamsi A, Kluger Y, Savchenko A, Cort JR, Booth V, Mackereth CD, Saridakis V, Ekiel I, Kozlov G, Maxwell KL, Wu N, McIntosh LP, Gehring K, Kennedy MA, Davidson AR, Pai EF, Gerstein M, Edwards AM, Arrowsmith CH: Structural Proteomics of an archaeon. Nat Struct Mol Biol 2000, 7(10):903-909.
  • [32]Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ: PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 2006, 34(2):W32-W37.
  • [33]Maruyama Y, Wakamatsu A, Kawamura Y, Kimura K, Yamamoto J, Nishikawa T, Kisu Y, Sugano S, Goshima N, Isogai T, Nomura N: Human Gene and Protein Database (HGPD): a novel database presenting a large quantity of experiment-based results in human proteomics. Nucleic Acid Research 2009, 37(1):D762-D766.
  • [34]Kouranov A, Xie L, de la Cruz J, Chen L, Westbrook J, Bourne PE, Berman HM: The RCSB PDB information portal for structural genomics. Nucleic Acids Res 2006, 34(1):D302-D305.
  • [35]Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, Jain S, Fagan P, Marvin J, Padilla D, Ravichandran V, Schneider B, Thanki N, Weissig H, Westbrook JD, Zardecki C: The Protein Data Bank. Acta Crystallographica Section D: Biological Crystallography 2002, 58(6):899-907.
  • [36]Chen L, Oughtred R, Berman HM, Westbrook J: TargetDB: a target registration database for structural genomics projects. Bioinformatics 2004, 20(16):2860-2862.
  • [37]Saeys Y, Inza I, Larrañaga P: A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23(19):2507-2517.
  • [38]Ben-Bassat M: Pattern Recognition and Reduction of Dimensionality. In Handbook of Statistics. Vol: 2. Edited by Krishnaiah P, Kanal L. Amsterdam: North-Holland Publishing Co; 1982:773-910.
  • [39]Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. 2nd edition. USA: Morgan Kaufmann; 2005.
  • [40]Weston J, Pérez-Cruz F, Bousquet O, Chapelle O, Elisseeff A, Schölkopf B: Feature selection and transduction for prediction of molecularbioactivity for drug design. Bioinformatics 2003, 19:764-771.
  • [41]Mann HB, Whitney DR: On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 1947, 18(1):50-60.
  • [42]Kittler J: Feature Set Search Algorithms. In Pattern Recognition and Signal Processing Edited by Chen C. 1978.
  • [43]Siedlecki W, Sklansky J: On automatic feature selection. Int J Pattern Recognit Artif Intell 1998, 2(02):197-220.
  • [44]Kononenko I, Šimec E, Robnik-Šikonja M: Overcoming the Myopia of inductive learning algorithms with RELIEFF. Appl Intell 1997, 7(1):39-55.
  • [45]Breiman L: Random forests. Mach Learn 2001, 5(1):5-32.
  • [46]Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Mach Learn 2002, 46(1-3):389-422.
  • [47]Piatetsky-Shapiro G: Discovery, analysis and presentation of strong rules. In Knowledge Discovery in Databases. Edited by Piatetsky-Shapiro G, Frawley WJ. Cambridge: MA; 1991.
  • [48]de Ridder D, de Ridder J, Reinders MJ: Pattern recognition in bioinformatics. Brief Bioinform 2013, 14(5):633-647.
  文献评价指标  
  下载次数:10次 浏览次数:23次