| IEEE Access | |
| EnsemPseU: Identifying Pseudouridine Sites With an Ensemble Approach | |
| Dong Jin1  Yue Bi1  Cangzhi Jia1  | |
| [1] School of Science, Dalian Maritime University, Dalian, China; | |
| 关键词: Machine learning; ensemble learning; pseudouridine site prediction; feature selection; | |
| DOI : 10.1109/ACCESS.2020.2989469 | |
| 来源: DOAJ | |
【 摘 要 】
Pseudouridine (Ψ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction. With the increasing availability of genomic and proteomic samples, computer-aided pseudouridine-synthase-specific Ψ site recognition is becoming possible. In this paper, we propose an ensemble approach to identify pseudouridine sites, named EnsemPseU. First, five sequence-encoding strategies, namely, kmer, binary encoding, enhanced nucleic acid composition (ENAC), nucleotide chemical property (NCP), and nucleotide density (ND), were applied to extract sequence information. Then, chi-square feature selection was used to reduce the feature dimensionality and remove redundant information. Finally, an ensemble algorithm integrating support vector machine (SVM), extreme gradient boosting (XGBoost), naïve Bayes (NB), k-nearest neighbor (KNN), and random forest (RF) was used to build our prediction model. Upon testing, the results showed that the accuracy improved 5.3% for H. sapiens, 6.09% for S. cerevisiae, and 5.55% for M. musculus after chi-square feature selection. Moreover, upon evaluation via 10-fold cross-validation and an independent test, our proposed model EnsemPseU outperformed the other best existing model. The source code and data sets are available at https://github.com/biyue1026/EnsemPseU.
【 授权许可】
Unknown