期刊论文详细信息
BioData Mining
LVQ-SMOTE – Learning Vector Quantization based Synthetic Minority Over–sampling Technique for biomedical data
Munehiro Nakamura1  Yusuke Kajiwara2  Atsushi Otsuka1  Haruhiko Kimura1 
[1] Department of Natural Science and Engineering, Kanazawa University, Ishikawa 9200941, Japan
[2] Marine Faculty of Information Science and Technology, Ritsumeikan University, Shiga 5258577, Japan
关键词: Synthetic Minority Over-sampling Technique;    Learning Vector Quantization;    Over-sampling;    Biomedical data;   
Others  :  797174
DOI  :  10.1186/1756-0381-6-16
 received in 2013-03-18, accepted in 2013-09-24,  发布年份 2013
PDF
【 摘 要 】

Background

Over-sampling methods based on Synthetic Minority Over-sampling Technique (SMOTE) have been proposed for classification problems of imbalanced biomedical data. However, the existing over-sampling methods achieve slightly better or sometimes worse result than the simplest SMOTE. In order to improve the effectiveness of SMOTE, this paper presents a novel over-sampling method using codebooks obtained by the learning vector quantization. In general, even when an existing SMOTE applied to a biomedical dataset, its empty feature space is still so huge that most classification algorithms would not perform well on estimating borderlines between classes. To tackle this problem, our over-sampling method generates synthetic samples which occupy more feature space than the other SMOTE algorithms. Briefly saying, our over-sampling method enables to generate useful synthetic samples by referring to actual samples taken from real-world datasets.

Results

Experiments on eight real-world imbalanced datasets demonstrate that our proposed over-sampling method performs better than the simplest SMOTE on four of five standard classification algorithms. Moreover, it is seen that the performance of our method increases if the latest SMOTE called MWMOTE is used in our algorithm. Experiments on datasets for β-turn types prediction show some important patterns that have not been seen in previous analyses.

Conclusions

The proposed over-sampling method generates useful synthetic samples for the classification of imbalanced biomedical data. Besides, the proposed over-sampling method is basically compatible with basic classification algorithms and the existing over-sampling methods.

【 授权许可】

   
2013 Nakamura et al.; licensee BioMed Central Ltd.

【 预 览 】
附件列表
Files Size Format View
20140706042019679.pdf 699KB PDF download
Figure 4. 53KB Image download
Figure 3. 50KB Image download
Figure 2. 61KB Image download
Figure 1. 73KB Image download
【 图 表 】

Figure 1.

Figure 2.

Figure 3.

Figure 4.

【 参考文献 】
  • [1]Batuwita R, palade V: MicroPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 2009, 25(8):989-995.
  • [2]Yu C, Chou L, Chang D: Predicting protein-protein interactions in unbalanced data using the primary structure of proteins. BMC Bioinformatics 2010, 11(167):1-10.
  • [3]Haibo H: Learning from imbalanced data. IEEE Trans Knowledge Data Eng 2009, 21(9):1263-1284.
  • [4]Freund Y: Boosting a weak learning algorithm by majority. Inform Comput 1995, 121(2):256-285.
  • [5]Quinlan R: C4.5: Proggrams for Machine Learning. San Francisco: Morgan Kaufmann Publishers; 1993.
  • [6]Breiman L: Random forests. Mach Learn 2001, 45:5-32.
  • [7]Chawla N, Bowyer K, Hall L, Kegelmeyer W: SMOTE: synthetic minority over-sampling technique. J Art Intell Res 2002, 16:321-357.
  • [8]Han H, Wang WY, Mao BH: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In Proc of the 2005 International Conference on Advances in Intelligent Computing. Hefei: Springer; 2005:878-887.
  • [9]Shen S, He H, Garcia E: RAMOBoost: ranked minority oversampling in boosting. IEEE Trans Neural Netw 2010, 21(10):1624-1642.
  • [10]Baura S, Islam M, Yao X, Murase K: MWMOTE – majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowledge Data Eng 2012 (PrePrint). doi:10.1109/TKDE.2012.232
  • [11]Kohonen T: Learning vector quantization. In The Handbook of Brain Theory and Neural Networks. Cambridge: MIT Press; 1995:537-540.
  • [12]Frank A, Asuncion A: UCI Machine Learning Repository. Irvine; 2010. http://archive.ics.uci.edu/ml/ webcite
  • [13]Kohonen T: LVQ PAK: The Learning Vector Quantization Program Package. 1996. http://www.cis.hut.fi/research/lvq_pak/ webcite
  • [14]Alon U, Barkai N, Notterman D, Gish K, Barra S, Mack D, Levine A: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 1999, 96:6745-6750.
  • [15]Golub T: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286(5439):531-537.
  • [16]Fuchs P, Alix A: High accuracy prediction of beta-turns and their types using propensities and multiple alignments. Proteins 2005, 59(4):828-839.
  • [17]Hutchinson E, Thornton J: A revised set of potentials for beta-turn formation in proteins. Protein Sci 1994, 3(12):2207-2216.
  • [18]Kountouris P, Hirst J: Predicting β -turns and their types using predicted backbone dihedral angles and secondary structures. BMC Bioinformatics 2010, 11(407):1-11.
  • [19]Cortes C, Vapnik V: Support-vector networks. Mach Learn 1995, 20(3):273-297.
  • [20]Marc S, Eibe F, Mark H: Speeding up logistic model tree induction. In Proc of 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. Porto: Springer; 2005:675-683.
  • [21]Rumelhart D, Hinton G, Williams R: Learning Internal Representations by Error Propagation, Volume 1. Cambridge: MIT Press; 1986.
  • [22]George H, Pat L: Etimating continuous distributions in bayesian classifiers. In Proc of the Eleventh Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers Inc.; 1995:338-345.
  • [23]Chang C, Lin J: LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2011, 2(27):531-537.
  • [24]Mark H, Eibe F, Geoffrey H, Bernhard P, Peter R, Ian H: Weka 3: data mining software in Java. ACM SIGKDD Explorations Newsletter; 2009. Machine Learning Group at the University of waikato. http://www.cs.waikato.ac.nz/ml/weka/ webcite
  • [25]Yaov F, Robert E: A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 1995, 55:119-139.
  • [26]Shi X, Hu X, Li S, Liu X: Prediction of β-turn types in protein by using composite vector. J Theor Biol 2011, 286(1):24-30.
  文献评价指标  
  下载次数:46次 浏览次数:37次