Chem-Bio Informatics Journal | |
A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data | |
Lan Anh T. Nguyen2  Xuan Tho Dang2  Kenji Satou3  Thammakorn Saethang2  Mamoru Kubo3  Yoichi Yamada3  Duong Hung Bui1  Tu Kien T. Le2  Osamu Hirose3  Vu Anh Tran2  | |
[1] Faculty of Information Technology, Vietnam Trade Union University;Graduate School of Natural Science and Technology, Kanazawa University;Institute of Science and Engineering, Kanazawa University | |
关键词: Imbalanced dataset; クラス不均衡; SMOTE; Over-sampling; オーバーサンプリング; Cancer classification; がん分類; | |
DOI : 10.1273/cbij.13.19 | |
学科分类:生物化学/生物物理 | |
来源: Chem-Bio Informatics Society | |
【 摘 要 】
References(26)Cited-By(1)One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases it cannot improve or even reduces classification performance. To address these issues, we have developed a novel minority over-sampling method named safe-SMOTE. Experimental results from two gene expression datasets for cancer classification (i.e., colon-cancer and leukemia) and six imbalanced benchmark datasets from the UCI Machine Learning Repository showed that our method achieved better sensitivity and G-mean values than both the control method (i.e., no over-sampling) and SMOTE. For example, in the colon-cancer dataset, although the sensitivity and specificity achieved by SMOTE (81.36% and 88.63%) were lower than for the control method (81.59% and 89.50%), safe-SMOTE in contrast had these values increase (81.82% and 90.50%). Similarly, the G-mean value of the control (85.45%) decreased to 84.91% when SMOTE was employed, but increased to 86.04% when using safe-SMOTE. In the leukemia dataset, SMOTE was able to improve the sensitivity and G-mean values with respect to the control; however, safe-SMOTE achieved noticeable, even greater improvements for both of these criteria.
【 授权许可】
Unknown
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO201911300564192ZK.pdf | 212KB | download |