期刊论文详细信息
BMC Bioinformatics
nDNA-prot: identification of DNA-binding proteins based on unbalanced classification
Quan Zou3  Li Guo2  Yunfeng Wu3  Xiangxiang Zeng3  Dapeng Li1  Li Song3 
[1]Department of Internal Medicine-Oncology, The Fourth Hospital in Qinhuangdao, Qinhuangdao, Hebei 066000, China
[2]Department of Epidemiology and Biostatistics and Ministry of Education Key Lab for Modern Toxicology, School of Public Health, Nanjing Medical University, Nanjing, Jiangsu 210029, China
[3]School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
关键词: Bioinformatics;    Unbalanced dataset;    Ensemble classifier;    DNA-binding protein;   
Others  :  1086103
DOI  :  10.1186/1471-2105-15-298
 received in 2014-06-01, accepted in 2014-09-03,  发布年份 2014
PDF
【 摘 要 】

Background

DNA-binding proteins are vital for the study of cellular processes. In recent genome engineering studies, the identification of proteins with certain functions has become increasingly important and needs to be performed rapidly and efficiently. In previous years, several approaches have been developed to improve the identification of DNA-binding proteins. However, the currently available resources are insufficient to accurately identify these proteins. Because of this, the previous research has been limited by the relatively unbalanced accuracy rate and the low identification success of the current methods.

Results

In this paper, we explored the practicality of modelling DNA binding identification and simultaneously employed an ensemble classifier, and a new predictor (nDNA-Prot) was designed. The presented framework is comprised of two stages: a 188-dimension feature extraction method to obtain the protein structure and an ensemble classifier designated as imDC. Experiments using different datasets showed that our method is more successful than the traditional methods in identifying DNA-binding proteins. The identification was conducted using a feature that selected the minimum Redundancy and Maximum Relevance (mRMR). An accuracy rate of 95.80% and an Area Under the Curve (AUC) value of 0.986 were obtained in a cross validation. A test dataset was tested in our method and resulted in an 86% accuracy, versus a 76% using iDNA-Prot and a 68% accuracy using DNA-Prot.

Conclusions

Our method can help to accurately identify DNA-binding proteins, and the web server is accessible at http://datamining.xmu.edu.cn/~songli/nDNA webcite. In addition, we also predicted possible DNA-binding protein sequences in all of the sequences from the UniProtKB/Swiss-Prot database.

【 授权许可】

   
2014 Song et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20150113183227407.pdf 1096KB PDF download
Figure 8. 56KB Image download
Figure 7. 55KB Image download
Figure 6. 55KB Image download
Figure 5. 49KB Image download
Figure 4. 51KB Image download
Figure 3. 71KB Image download
Figure 2. 62KB Image download
Figure 1. 80KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

【 参考文献 】
  • [1]Boutet E, Lieberherr D, Tognolli M, Schneider M, Bairoch A: Uniprotkb/swiss-prot. Plant Bioinformatics. Humana Press 2007, 406:89-112.
  • [2]Lin W-Z, Fang JA, Xiao X, Chou KC: iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One 2011, 6(9):e24756.
  • [3]Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, Zou Q: Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One 2013, 8(2):e56499.
  • [4]Chen W, Liu X, Huang Y, Jiang Y, Zou Q, Lin C: Improved method for predicting the protein fold pattern with ensemble classifiers. Genet Mol Res 2012, 11(1):174-181.
  • [5]Liu B, Wang X, Chen Q, Dong Q, Lan X: Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS One 2012, 7(9):e46633.
  • [6]Patel AK, Patel S, Naik PK: Binary classification of uncharacterized proteins into DNA binding/non-DNA binding proteins from sequence derived features using Ann. Dig J Nanomaterials & Biostructures (DJNB) 2009, 4(4):775-782.
  • [7]Cheng L, Hou Z, Lin Y, Tan M, Zhang W, Wu F: Recurrent neural network for non-smooth convex optimization problems with application to the identification of genetic regulatory networks. IEEE Trans Neural Netw 2011, 22(5):714-726.
  • [8]Bhardwaj N, Lu H: Residue-level prediction of DNA-binding sites and its application on DNA-binding protein predictions. FEBS Lett 2007, 581(5):1058-1066.
  • [9]Zou Q, Li X, Jiang Y, Zhao Y, Wang G: BinMemPredict: a web server and software for predicting membrane protein types. Curr Proteomics 2013, 10(1):2-9.
  • [10]Brown PF, Della Pietra VJ, de Souza PV, Lai JC, Mercer RL: Class-based n-gram models of natural language. Comput Linguist 1992, 18(4):467-479.
  • [11]Nordhoff E, Krogsdam AM, Jorgensen HF, Kallipolitis BH, Clark BF, Roepstorff P, Kristiansen K: Rapid identification of DNA-binding proteins by mass spectrometry. Nat Biotechnol 1999, 17(9):884-888.
  • [12]Nanni L, Lumini A: An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins. Amino Acids 2009, 36(2):167-175.
  • [13]Nimrod G, Schushan M, Szilágyi A, Leslie C, Ben-Tal N: iDBPs: a web server for the identification of DNA binding proteins. Bioinformatics 2010, 26(5):692-693.
  • [14]Langlois RE, Lu H: Boosting the prediction and understanding of DNA-binding domains from sequence. Nucleic Acids Res 2010, 38(10):3149-3158.
  • [15]Ma X, Guo J, Liu HD, Xie JM, Sun X: Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information. IEEE/ACM Trans Comput Biol Bioinform 2012, 9(6):1766-1775.
  • [16]Brown J, Akutsu T: Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology. BMC Bioinformatics 2009, 10(1):25.
  • [17]Fang Y, Guo Y, Feng Y, Li M: Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features. Amino Acids 2008, 34(1):103-109.
  • [18]Cai YD, Lin SL: Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim et Biophys Acta (BBA)-Proteins and Proteomics 2003, 1648(1):127-133.
  • [19]Cai C, Han L, Ji Z, Chen X, Chen Y: SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 2003, 31(13):3692-3697.
  • [20]Kumar M, Gromiha MM, Raghava GP: Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics 2007, 8(1):463.
  • [21]Rashid M, Saha S, Raghava GP: Support Vector Machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs. BMC Bioinformatics 2007, 8(1):337.
  • [22]Liu B, Xu J, Zou Q, Xu R, Wang X, Chen Q: Using distances between Top-n-gram and residue pairs for protein remote homology detection. BMC Bioinformatics 2014, 15(Suppl 2):S3.
  • [23]Zou Q, Wang Z, Wu Y, Liu B, Lin Z, Guan X: An approach for identifying cytokines based on a novel ensemble classifier. BioMed Res Int 2013, 2013:686090.
  • [24]Lin C, Chen W, Qiu C, Wu Y, Krishnan S, Zou Q: LibD3C: ensemble classifiers with a clustering and dynamic selection strategy. Neurocomputing 2014, 123:424-435.
  • [25]Schneider G, Wrede P: Artificial neural networks for computer-based molecular design. Prog Biophys Mol Biol 1998, 70(3):175-222.
  • [26]Molparia B, Goyal K, Sarkar A, Kumar S, Sundar D: ZiF-Predict: a web tool for predicting DNA-binding specificity in C2H2 zinc finger proteins. Genomics Proteomics Bioinformatics 2010, 8(2):122-126.
  • [27]Ahmad S, Sarai A: Moment-based prediction of DNA-binding proteins. J Mol Biol 2004, 341(1):65-71.
  • [28]Keil M, Exner TE, Brickmann J: Pattern recognition strategies for molecular surfaces: III. Binding site prediction with Neural Netw J Comput Chem 2004, 25(6):779-789.
  • [29]Xu R, Zhou J, Liu B, Yao L, He Y, Zou Q, Wang X: enDNA-Prot: identification of DNA-Binding Proteins by applying ensemble learning. BioMed Res Int 2014, 2014:10.
  • [30]Cai Y, He J, Li X, Lu L, Yang X, Feng K, Lu W, Kong X: A novel computational approach to predict transcription factor DNA binding preference. J Proteome Res 2008, 8(2):999-1003.
  • [31]Breiman L: Bagging predictors. Machine Learn 1996, 24(2):123-140.
  • [32]Qian Z, Cai Y-D, Li Y: A novel computational method to predict transcription factor DNA binding preference. Biochem Biophys Res Commun 2006, 348(3):1034-1037.
  • [33]Li W, Jaroszewski L, Godzik A: Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Eng 2002, 15(8):643-649.
  • [34]Cheng X-Y, Huang WJ, Hu SC, Zhang HL, Wang H, Zhang JX, Lin HH, Chen YZ, Zou Q, Ji ZL: A global characterization and identification of multifunctional enzymes. PLoS One 2012, 7(6):e38979.
  • [35]Krogh A, Vedelsby J: Neural network ensembles, cross validation, and active learning. Adv Neural Inf Process Syst 1995, 7:231-238.
  • [36]Zhang Y, Ding C, Li T: Gene selection algorithm by combining reliefF and mRMR. BMC Genomics 2008, 9(Suppl 2):S27.
  文献评价指标  
  下载次数:85次 浏览次数:34次