BMC Bioinformatics | |
Minimalist ensemble algorithms for genome-wide protein localization prediction | |
Jhih-Rong Lin2  Ananda Mohan Mondal1  Rong Liu2  Jianjun Hu2  | |
[1] Department of Mathematics and Computer Science, Claflin University, Columbia, SC, 29115, USA | |
[2] Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, 29208, USA | |
关键词: Logistic regression; Classifiers; Ensemble algorithms; Protein subcellular localization; | |
Others : 1088213 DOI : 10.1186/1471-2105-13-157 |
|
received in 2011-12-26, accepted in 2012-07-03, 发布年份 2012 | |
【 摘 要 】
Background
Computational prediction of protein subcellular localization can greatly help to elucidate its functions. Despite the existence of dozens of protein localization prediction algorithms, the prediction accuracy and coverage are still low. Several ensemble algorithms have been proposed to improve the prediction performance, which usually include as many as 10 or more individual localization algorithms. However, their performance is still limited by the running complexity and redundancy among individual prediction algorithms.
Results
This paper proposed a novel method for rational design of minimalist ensemble algorithms for practical genome-wide protein subcellular localization prediction. The algorithm is based on combining a feature selection based filter and a logistic regression classifier. Using a novel concept of contribution scores, we analyzed issues of algorithm redundancy, consensus mistakes, and algorithm complementarity in designing ensemble algorithms. We applied the proposed minimalist logistic regression (LR) ensemble algorithm to two genome-wide datasets of Yeast and Human and compared its performance with current ensemble algorithms. Experimental results showed that the minimalist ensemble algorithm can achieve high prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemble algorithms, which greatly reduces computational complexity and running time. It was found that the high performance ensemble algorithms are usually composed of the predictors that together cover most of available features. Compared to the best individual predictor, our ensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset. Compared with popular weighted voting based ensemble algorithms, our classifier-based ensemble algorithms achieved much better performance without suffering from inclusion of too many individual predictors.
Conclusions
We proposed a method for rational design of minimalist ensemble algorithms using feature selection and classifiers. The proposed minimalist ensemble algorithm based on logistic regression can achieve equal or better prediction performance while using only half or one-third of individual predictors compared to other ensemble algorithms. The results also suggested that meta-predictors that take advantage of a variety of features by combining individual predictors tend to achieve the best performance. The LR ensemble server and related benchmark datasets are available at http://mleg.cse.sc.edu/LRensemble/cgi-bin/predict.cgi. webcite
【 授权许可】
2012 Lin et al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150117084906822.pdf | 744KB | download | |
Figure 4 . | 76KB | Image | download |
Figure 3 . | 80KB | Image | download |
Figure 2 . | 48KB | Image | download |
Figure 1 . | 57KB | Image | download |
【 图 表 】
Figure 1 .
Figure 2 .
Figure 3 .
Figure 4 .
【 参考文献 】
- [1]Assfalg J, Gong J, Kriegel HP, Pryakhin A, Wei TD, Zimek A: Investigating a Correlation between Subcellular Localization and Fold of Proteins. J Univers Comput Sci 2010, 16(5):604-621.
- [2]Imai K, Nakai K: Prediction of subcellular locations of proteins: where to proceed? Proteomics 2010, 10(22):3970-3983.
- [3]Sprenger J, Fink JL, Teasdale RD: Evaluation and comparison of mammalian subcellular localization prediction methods. BMC Bioinformatics 2006, 7(Suppl 5):S3. BioMed Central Full Text
- [4]Liu J, Kang S, Tang C, Ellis LB, Li T: Meta-prediction of protein subcellular localization with reduced voting. Nucleic Acids Res 2007, 35(15):e96.
- [5]Laurila K, Vihinen M: PROlocalizer: integrated web service for protein subcellular localization prediction. Amino Acids 2010, 40(3):975-980.
- [6]Park S, Yang JS, Jang SK, Kim S: Construction of functional interaction networks through consensus localization predictions of the human proteome. J Proteome Res 2009, 8(7):3367-3376.
- [7]Assfalg J, Gong J, Kriegel HP, Pryakhin A, Wei T, Zimek A: Supervised ensembles of prediction methods for subcellular localization. J Bioinform Comput Biol 2009, 7(2):269-285.
- [8]Shen YQ, Burger G: 'Unite and conquer': enhanced prediction of protein subcellular localization by integrating multiple specialized tools. BMC Bioinformatics 2007, 8:420. BioMed Central Full Text
- [9]Lythgow KT, Hudson G, Andras P, Chinnery PF: A critical analysis of the combined usage of protein localization prediction methods: Increasing the number of independent data sets can reduce the accuracy of predicted mitochondrial localization. Mitochondrion 2011, 11(3):444-449.
- [10]Briesemeister S, Rahnenfuhrer J, Kohlbacher O: Going from where to why–interpretable prediction of protein subcellular localization. Bioinformatics 2010, 26(9):1232-1238.
- [11]Blum T, Briesemeister S, Kohlbacher O: MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction. BMC Bioinformatics 2009, 10:274. BioMed Central Full Text
- [12]Lin HN, Chen CT, Sung TY, Ho SY, Hsu WL: Protein subcellular localization prediction of eukaryotes using a knowledge-based approach. BMC Bioinformatics 2009, 10(Suppl 15):S8. BioMed Central Full Text
- [13]Niu B, Jin YH, Feng KY, Lu WC, Cai YD, Li GZ: Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins. Mol Divers 2008, 12(1):41-45.
- [14]Horton P, Park KJ, Obayashi T, Fujita N, Harada H, Adams-Collier CJ, Nakai K: WoLF PSORT: protein localization predictor. Nucleic Acids Res 2007, 35:W585-W587.
- [15]Pierleoni A, Martelli PL, Fariselli P, Casadio R: BaCelLo: a balanced subcellular localization predictor. Bioinformatics 2006, 22(14):e408-416.
- [16]Yu CS, Chen YC, Lu CH, Hwang JK: Prediction of protein subcellular localization. Proteins 2006, 64(3):643-651.
- [17]Hua S, Sun Z: Support vector machine approach for protein subcellular localization prediction. Bioinformatics 2001, 17(8):721-728.
- [18]Ananda MM, Jianjun H: NetLoc: Network based protein localization prediction using protein-protein interaction and co-expression networks. BIBM 2010, 142-148.
- [19]Hishigaki H, Nakai K, Ono T, Tanigami A, Takagi T: Assessment of prediction accuracy of protein function from protein–protein interaction data. Yeast 2001, 18(6):523-531.
- [20]Lee K, Chuang HY, Beyer A, Sung MK, Huh WK, Lee B, Ideker T: Protein networks markedly improve prediction of subcellular localization in multiple eukaryotic species. Nucleic Acids Res 2008, 36(20):e136.
- [21]Shin CJ, Wong S, Davis MJ, Ragan MA: Protein-protein interaction as a predictor of subcellular location. BMC Syst Biol 2009, 3:28. BioMed Central Full Text
- [22]Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, 34(Database issue):D535-539.
- [23]Lu XW Z, Zhu X, Bongard J: Ensemble pruning via individual contribution ordering. Proc of KDD 2010, 871-880.
- [24]Hall MA: Correlation-based feature subset selection for machine learning.Dissertation. University of Waikato, Hamilton, New Zealand; 1999.
- [25]Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O'Shea EK: Global analysis of protein localization in budding yeast. Nature 2003, 425(6959):686-691.
- [26]Sprenger J, Lynn Fink J, Karunaratne S, Hanson K, Hamilton NA, Teasdale RD: LOCATE: a mammalian protein subcellular localization database. Nucleic Acids Res 2008, 36(Database issue):D230-233.