BMC Bioinformatics | |
HuntMi: an efficient and taxon-specific approach in pre-miRNA identification | |
Adam Gudyś2  Michał Wojciech Szcześniak1  Marek Sikora3  Izabela Makałowska1  | |
[1] Laboratory of Bioinformatics, Faculty of Biology, Adam Mickiewicz University, Poznan, Poland | |
[2] Institute of Informatics, Faculty Of Automatic Control, Electronics And Computer Science, Silesian University of Technology, Gliwice, Poland | |
[3] Institute of Innovative Technologies EMAG, Katowice, Poland | |
关键词: Genome analysis; Imbalanced learning; Random forest; MicroRNA; | |
Others : 1087952 DOI : 10.1186/1471-2105-14-83 |
|
received in 2012-07-02, accepted in 2013-02-21, 发布年份 2013 | |
【 摘 要 】
Background
Machine learning techniques are known to be a powerful way of distinguishing microRNA hairpins from pseudo hairpins and have been applied in a number of recognised miRNA search tools. However, many current methods based on machine learning suffer from some drawbacks, including not addressing the class imbalance problem properly. It may lead to overlearning the majority class and/or incorrect assessment of classification performance. Moreover, those tools are effective for a narrow range of species, usually the model ones. This study aims at improving performance of miRNA classification procedure, extending its usability and reducing computational time.
Results
We present HuntMi, a stand-alone machine learning miRNA classification tool. We developed a novel method of dealing with the class imbalance problem called ROC-select, which is based on thresholding score function produced by traditional classifiers. We also introduced new features to the data representation. Several classification algorithms in combination with ROC-select were tested and random forest was selected for the best balance between sensitivity and specificity. Reliable assessment of classification performance is guaranteed by using large, strongly imbalanced, and taxon-specific datasets in 10-fold cross-validation procedure. As a result, HuntMi achieves a considerably better performance than any other miRNA classification tool and can be applied in miRNA search experiments in a wide range of species.
Conclusions
Our results indicate that HuntMi represents an effective and flexible tool for identification of new microRNAs in animals, plants and viruses. ROC-select strategy proves to be superior to other methods of dealing with class imbalance problem and can possibly be used in other machine learning classification tasks. The HuntMi software as well as datasets used in the research are freely available at http://lemur.amu.edu.pl/share/HuntMi/ webcite.
【 授权许可】
2013 Gudyśet al.; licensee BioMed Central Ltd.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
20150117061253267.pdf | 234KB | download | |
Figure 1. | 13KB | Image | download |
【 图 表 】
Figure 1.
【 参考文献 】
- [1]Laganá A, Forte S, Giudice A, Arena MR, Puglisi PL, Giugno R, Pulvirenti A, Shasha D, Ferro A: MiRó: a MiRNA knowledge base. Database (Oxford) 2009.
- [2]Cai X, Hagedorn CH, Cullen BR: Human MicroRNAs are processed from capped, polyadenylated transcripts that can also function as MRNAs. RNA 2004, 10:1957-1966.
- [3]Davis-Dusenbery BN, Hata A: Mechanisms of control of MicroRNA Biogenesis. J Biochem 2010, 148:381-392.
- [4]Brabletz S, Bajdak K, Meidhof S, Burk U, Niedermann G, Firat E, Wellner U, Dimmler A, Faller G, Schubert J, Brabletz T: The ZEB1/miR-200 feedback loop controls notch signalling in cancer Cells. EMBO J 2011, 30:770-782.
- [5]Friedländer MR, Chen W, Adamidi C, Maaskola J, Einspanier R, Knespel S, Rajewsky N: Discovering MicroRNAs from deep sequencing data using MiRDeep. Nat Biotechnol 2008, 26:407-415.
- [6]Hertel J, Stadler PF: Hairpins in a haystack: recognizing MicroRNA precursors in comparative genomics data. Bioinformatics 2006, 22:197-202.
- [7]Jones-Rhoades MW, Bartel DP: Computational identification of plant MicroRNAs and their targets, including a stress-induced MiRNA. Mol Cell 2004, 14:787-799.
- [8]Ng KL, Mishra SK: De Novo SVM Classification of precursor MicroRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics 2007, 23:1321-1330.
- [9]Bentwich I: Prediction and validation of MicroRNAs and their targets. FEBS Lett 2005, 579:5904-5910.
- [10]Mhuantong W, Wichadakul D: MicroPC (microPC): a comprehensive resource for predicting and comparing plant MicroRNAs. BMC Genomics 2009, 10:366. BioMed Central Full Text
- [11]Szczesniak M, Deorowicz S, Gapski J, Kaczynski L, Makalowska I: MiRNEST database: an integrative approach in MicroRNA search and annotation. Nucleic Acids Res Database Issue 2012, 40(Database issue):D198-D204.
- [12]Doran J, Strauss WM: Bio-informatic trends for the determination of MiRNA-target interactions in mammals. DNA Cell Biol 2007, 26:353-360.
- [13]Kadri S, Hinman V, Benos PV: HHMMiR: Efficient De Novo prediction of MicroRNAs using hierarchical hidden Markov models. BMC Bioinformatics 2009, 10(Suppl 1):S35. BioMed Central Full Text
- [14]Jiang P, Wu H, Wang W, Ma W, Sun X, Lu Z: MiPred: classification of real and pseudo MicroRNA precursors using random forest prediction model with combined features. Nucleic Acids Res 2007, 35:W339-W344.
- [15]Yousef M, Nebozhyn M, Shatkay H, Kanterakis S, Showe LC, Showe MK: Combining multi-species genomic data for MicroRNA identification using a Naïve Bayes classifier. Bioinformatics 2006, 22:1325-1334.
- [16]Xue C, Li F, He T, Liu GP, Li Y, Zhang X: Classification of real and pseudo MicroRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics 2005, 6:310. BioMed Central Full Text
- [17]Batuwita R, Palade V: MicroPred: effective classification of pre-miRNAs for human MiRNA gene prediction. Bioinformatics 2009, 25:989-995.
- [18]Kozomara A, Griffiths-Jones S: miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res 2011, 39(Database Issue):D152-D157.
- [19]Xuan P, Guo M, Liu X, Huang Y, Li W, Huang Y: PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs. Bioinformatics 2011, 27:1368-1376.
- [20]Chawla NV, Japkowicz N, Kotcz A: Editorial: special issue on learning from imbalanced data sets. SIGKDD Expl 2004, 6:1-6.
- [21]He H, Garcia EA: Learning from imabalanced data. IEEE Trans Know and Data Eng 2009, 21:1263-1284.
- [22]Mease D, Wyner AJ, Buja A: Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 2007, 8:409-439.
- [23]Zadrozny B, Elkan C: Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of KDD 2002. New York: ACM; 2002:694-699.
- [24]Domingos P: MetaCost: A general method for making classifiers cost-sensitive. In Proceedings of KDD 1999. New York: ACM; 1999:155-164.
- [25]Fawcett T: An introduction to ROC analysis. Pattern Recogn Lett 2006, 27:861-874.
- [26]Duda RO, Hart PE: Pattern Classification and Scene Analysis. New York: Wiley; 1973.
- [27]Rosenblatt F: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Washington: Spartan Books; 1962.
- [28]Boser BE, Guyon IM, Vapnik VN: A training algorithm for optimal margin classifiers. In Proceedings of COLT 1996. ACM Press; 1992:144-152.
- [29]Brieman L: Random forests. Mach Learn 2001, 45:5-32.
- [30]Keerthi S, Lin CJ: Asymptotic behaviours of support vector machines with gaussian kernel. Neural Comput 2003, 15:1667-1689.
- [31]Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP: SMOTE: Synthetic Minority Over-sampling Technique. J Artif Intell Res 2002, 16:321-357.
- [32]Qu HN, Li GZ, Xu WS: An asymmetric classifier based on partial least squares. Pattern Recogn 2010, 43:3448-3457.
- [33]Kohavi R: A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of IJCAI 1995, Vol. 2. San Mateo: Morgan Kaufmann; 1995:1137-1143.
- [34]Hall M, Eibe F, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an upyear. SIGKDD Expl 2009, 11:10-18.
- [35]Demsar J: Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 2006, 7:1-30.
- [36]Han K: Effective sample selection for classification of Pre-miRNAs. Genet Mol Res 2011, 10:506-518.
- [37]Wang Y, Chen X, Jiang W, Li L, Li W, Yang L, Liao M, Lian B, Lv Y, Wang S, Wang S, Li X: Predicting human MicroRNA precursors based on an optimized feature subset generated by GA-SVM. Genomics 2011, 98:73-78.
- [38]Xuan P, Guo M, Wang J, Wang CY, Liu XY, Liu Y: Genetic algorithm-based efficient feature selection for classification of Pre-miRNAs. Genet Mol Res 2011, 10:588-603.
- [39]Ding J, Zhou S, Guan J: MiRenSVM: towards better prediction of MicroRNA precursors using an ensemble SVM classifier with multi-loop features. BMC Bioinformatics 2010, 11(Suppl 11):S35.