A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data

IF 0.4 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY
Xuan Tho Dang, Osamu Hirose, Duong Hung Bui, Thammakorn Saethang, Vu Anh Tran, L. A. T. Nguyen, T. K. T. Le, Mamoru Kubo, Yoichi Yamada, K. Satou
{"title":"A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data","authors":"Xuan Tho Dang, Osamu Hirose, Duong Hung Bui, Thammakorn Saethang, Vu Anh Tran, L. A. T. Nguyen, T. K. T. Le, Mamoru Kubo, Yoichi Yamada, K. Satou","doi":"10.1273/CBIJ.13.19","DOIUrl":null,"url":null,"abstract":"One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases it cannot improve or even reduces classification performance. To address these issues, we have developed a novel minority over-sampling method named safe-SMOTE. Experimental results from two gene expression datasets for cancer classification (i.e., colon-cancer and leukemia) and six imbalanced benchmark datasets from the UCI Machine Learning Repository showed that our method achieved better sensitivity and G-mean values than both the control method (i.e., no over-sampling) and SMOTE. For example, in the colon-cancer dataset, although the sensitivity and specificity achieved by SMOTE (81.36% and 88.63%) were lower than for the control method (81.59% and 89.50%), safe-SMOTE in contrast had these values increase (81.82% and 90.50%). Similarly, the G-mean value of the control (85.45%) decreased to 84.91% when SMOTE was employed, but increased to 86.04% when using safe-SMOTE. In the leukemia dataset, SMOTE was able to improve the sensitivity and G-mean values with respect to the control; however, safe-SMOTE achieved noticeable, even greater improvements for both of these criteria.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":null,"pages":null},"PeriodicalIF":0.4000,"publicationDate":"2013-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chem-Bio Informatics Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1273/CBIJ.13.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 8

Abstract

One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases it cannot improve or even reduces classification performance. To address these issues, we have developed a novel minority over-sampling method named safe-SMOTE. Experimental results from two gene expression datasets for cancer classification (i.e., colon-cancer and leukemia) and six imbalanced benchmark datasets from the UCI Machine Learning Repository showed that our method achieved better sensitivity and G-mean values than both the control method (i.e., no over-sampling) and SMOTE. For example, in the colon-cancer dataset, although the sensitivity and specificity achieved by SMOTE (81.36% and 88.63%) were lower than for the control method (81.59% and 89.50%), safe-SMOTE in contrast had these values increase (81.82% and 90.50%). Similarly, the G-mean value of the control (85.45%) decreased to 84.91% when SMOTE was employed, but increased to 86.04% when using safe-SMOTE. In the leukemia dataset, SMOTE was able to improve the sensitivity and G-mean values with respect to the control; however, safe-SMOTE achieved noticeable, even greater improvements for both of these criteria.
一种新的过采样方法及其在基因表达数据癌症分类中的应用
生物医学数据分类中最关键和最常见的问题之一是类别分布不平衡,即多数类别的样本数量明显超过少数类别。SMOTE是一种众所周知的通用过采样方法,用于解决这个问题;然而,在某些情况下,它不能提高甚至降低分类性能。为了解决这些问题,我们开发了一种新的少数派过采样方法,命名为safe-SMOTE。两个用于癌症分类的基因表达数据集(即结肠癌和白血病)和来自UCI机器学习存储库的六个不平衡基准数据集的实验结果表明,我们的方法比对照方法(即无过采样)和SMOTE方法获得了更好的灵敏度和g均值。例如,在结肠癌数据集中,尽管SMOTE的敏感性和特异性(81.36%和88.63%)低于对照方法(81.59%和89.50%),但安全SMOTE相比,这些值增加了(81.82%和90.50%)。同样,使用SMOTE时,对照组的g -平均值(85.45%)下降到84.91%,而使用安全SMOTE时,g -平均值上升到86.04%。在白血病数据集中,SMOTE能够提高相对于对照组的灵敏度和g均值;然而,safe-SMOTE在这两个标准上都取得了显著的、甚至更大的进步。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Chem-Bio Informatics Journal
Chem-Bio Informatics Journal BIOCHEMISTRY & MOLECULAR BIOLOGY-
CiteScore
0.60
自引率
0.00%
发文量
8
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信