BMC Bioinformatics | |
Predicting antifreeze proteins with weighted generalized dipeptide composition and multi-regression feature selection ensemble | |
Lin Deng1  Shunfang Wang1  Xinnan Xia1  Zicheng Cao2  Yu Fei3  | |
[1] Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, 650504, Kunming, China;School of Public Health (Shenzhen), Sun Yat-Sen University, 510006, Guangzhou, China;School of Statistics and Mathematics, Yunnan University of Finance and Economics, 650221, Kunming, China; | |
关键词: Antifreeze proteins prediction; Weighted general dipeptide composition; Lasso regression; Ridge regression; Ensemble feature selection; Two-stage multiple regressions; | |
DOI : 10.1186/s12859-021-04251-z | |
来源: Springer | |
【 摘 要 】
BackgroundAntifreeze proteins (AFPs) are a group of proteins that inhibit body fluids from growing to ice crystals and thus improve biological antifreeze ability. It is vital to the survival of living organisms in extremely cold environments. However, little research is performed on sequences feature extraction and selection for antifreeze proteins classification in the structure and function prediction, which is of great significance.ResultsIn this paper, to predict the antifreeze proteins, a feature representation of weighted generalized dipeptide composition (W-GDipC) and an ensemble feature selection based on two-stage and multi-regression method (LRMR-Ri) are proposed. Specifically, four feature selection algorithms: Lasso regression, Ridge regression, Maximal information coefficient and Relief are used to select the feature sets, respectively, which is the first stage of LRMR-Ri method. If there exists a common feature subset among the above four sets, it is the optimal subset; otherwise we use Ridge regression to select the optimal subset from the public set pooled by the four sets, which is the second stage of LRMR-Ri. The LRMR-Ri method combined with W-GDipC was performed both on the antifreeze proteins dataset (binary classification), and on the membrane protein dataset (multiple classification). Experimental results show that this method has good performance in support vector machine (SVM), decision tree (DT) and stochastic gradient descent (SGD). The values of ACC, RE and MCC of LRMR-Ri and W-GDipC with antifreeze proteins dataset and SVM classifier have reached as high as 95.56%, 97.06% and 0.9105, respectively, much higher than those of each single method: Lasso, Ridge, Mic and Relief, nearly 13% higher than single Lasso for ACC.ConclusionThe experimental results show that the proposed LRMR-Ri and W-GDipC method can significantly improve the accuracy of antifreeze proteins prediction compared with other similar single feature methods. In addition, our method has also achieved good results in the classification and prediction of membrane proteins, which verifies its widely reliability to a certain extent.
【 授权许可】
CC BY
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO202107225749071ZK.pdf | 1563KB | download |