A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data

IF 0.4 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Chem-Bio Informatics Journal Pub Date : 2013-01-01 DOI:10.1273/CBIJ.13.19

Xuan Tho Dang, Osamu Hirose, Duong Hung Bui, Thammakorn Saethang, Vu Anh Tran, L. A. T. Nguyen, T. K. T. Le, Mamoru Kubo, Yoichi Yamada, K. Satou

{"title":"A Novel Over-Sampling Method and its Application to Cancer Classification from Gene Expression Data","authors":"Xuan Tho Dang, Osamu Hirose, Duong Hung Bui, Thammakorn Saethang, Vu Anh Tran, L. A. T. Nguyen, T. K. T. Le, Mamoru Kubo, Yoichi Yamada, K. Satou","doi":"10.1273/CBIJ.13.19","DOIUrl":null,"url":null,"abstract":"One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases it cannot improve or even reduces classification performance. To address these issues, we have developed a novel minority over-sampling method named safe-SMOTE. Experimental results from two gene expression datasets for cancer classification (i.e., colon-cancer and leukemia) and six imbalanced benchmark datasets from the UCI Machine Learning Repository showed that our method achieved better sensitivity and G-mean values than both the control method (i.e., no over-sampling) and SMOTE. For example, in the colon-cancer dataset, although the sensitivity and specificity achieved by SMOTE (81.36% and 88.63%) were lower than for the control method (81.59% and 89.50%), safe-SMOTE in contrast had these values increase (81.82% and 90.50%). Similarly, the G-mean value of the control (85.45%) decreased to 84.91% when SMOTE was employed, but increased to 86.04% when using safe-SMOTE. In the leukemia dataset, SMOTE was able to improve the sensitivity and G-mean values with respect to the control; however, safe-SMOTE achieved noticeable, even greater improvements for both of these criteria.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":"4 1","pages":"19-29"},"PeriodicalIF":0.4000,"publicationDate":"2013-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chem-Bio Informatics Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1273/CBIJ.13.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 8

Abstract

One of the most critical and frequent problems in biomedical data classification is imbalanced class distribution, where samples from the majority class significantly outnumber the minority class. SMOTE is a well-known general over-sampling method used to address this problem; however, in some cases it cannot improve or even reduces classification performance. To address these issues, we have developed a novel minority over-sampling method named safe-SMOTE. Experimental results from two gene expression datasets for cancer classification (i.e., colon-cancer and leukemia) and six imbalanced benchmark datasets from the UCI Machine Learning Repository showed that our method achieved better sensitivity and G-mean values than both the control method (i.e., no over-sampling) and SMOTE. For example, in the colon-cancer dataset, although the sensitivity and specificity achieved by SMOTE (81.36% and 88.63%) were lower than for the control method (81.59% and 89.50%), safe-SMOTE in contrast had these values increase (81.82% and 90.50%). Similarly, the G-mean value of the control (85.45%) decreased to 84.91% when SMOTE was employed, but increased to 86.04% when using safe-SMOTE. In the leukemia dataset, SMOTE was able to improve the sensitivity and G-mean values with respect to the control; however, safe-SMOTE achieved noticeable, even greater improvements for both of these criteria.

查看原文本刊更多论文

一种新的过采样方法及其在基因表达数据癌症分类中的应用

生物医学数据分类中最关键和最常见的问题之一是类别分布不平衡，即多数类别的样本数量明显超过少数类别。SMOTE是一种众所周知的通用过采样方法，用于解决这个问题;然而，在某些情况下，它不能提高甚至降低分类性能。为了解决这些问题，我们开发了一种新的少数派过采样方法，命名为safe-SMOTE。两个用于癌症分类的基因表达数据集(即结肠癌和白血病)和来自UCI机器学习存储库的六个不平衡基准数据集的实验结果表明，我们的方法比对照方法(即无过采样)和SMOTE方法获得了更好的灵敏度和g均值。例如，在结肠癌数据集中，尽管SMOTE的敏感性和特异性(81.36%和88.63%)低于对照方法(81.59%和89.50%)，但安全SMOTE相比，这些值增加了(81.82%和90.50%)。同样，使用SMOTE时，对照组的g -平均值(85.45%)下降到84.91%，而使用安全SMOTE时，g -平均值上升到86.04%。在白血病数据集中，SMOTE能够提高相对于对照组的灵敏度和g均值;然而，safe-SMOTE在这两个标准上都取得了显著的、甚至更大的进步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Chem-Bio Informatics Journal BIOCHEMISTRY & MOLECULAR BIOLOGY-

CiteScore

0.60

自引率

0.00%

发文量