Classification of Breast Cancer Risk Factors Using Several Resampling Approaches

2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA) Pub Date : 2018-12-01 DOI:10.1109/ICMLA.2018.00202

Md Faisal Kabir, Simone A. Ludwig

{"title":"Classification of Breast Cancer Risk Factors Using Several Resampling Approaches","authors":"Md Faisal Kabir, Simone A. Ludwig","doi":"10.1109/ICMLA.2018.00202","DOIUrl":null,"url":null,"abstract":"Breast cancer is the most common cancer in women worldwide and the second most common cancer overall. Predicting the risk of breast cancer occurrence is an important challenge for clinical oncologists as it has direct influence in daily practice and clinical service. Classification is one of the supervised learning models that is applied in medical domains. Achieving better performance on real data that contains imbalance characteristics is a very challenging task. Machine learning researchers have been using various techniques to obtain higher accuracy, generally by correctly identifying majority class samples while ignoring the instances of the minority class. However, in most of the cases the concept of the minority class instances usually is of higher interest than the majority class. In this research, we applied three different classification techniques on a real world breast cancer risk factors data set. First, we applied specified classification techniques on breast cancer data without applying any resampling technique. Second, since the data is imbalanced meaning data has an unequal distribution between the classes, we applied several resampling methods to get better performance before applying the classifiers. The experimental results show significant improvement on using a resampling method as compared to applying no resampling technique, particularly for the minority class.","PeriodicalId":6533,"journal":{"name":"2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"8 1","pages":"1243-1248"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2018.00202","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Breast cancer is the most common cancer in women worldwide and the second most common cancer overall. Predicting the risk of breast cancer occurrence is an important challenge for clinical oncologists as it has direct influence in daily practice and clinical service. Classification is one of the supervised learning models that is applied in medical domains. Achieving better performance on real data that contains imbalance characteristics is a very challenging task. Machine learning researchers have been using various techniques to obtain higher accuracy, generally by correctly identifying majority class samples while ignoring the instances of the minority class. However, in most of the cases the concept of the minority class instances usually is of higher interest than the majority class. In this research, we applied three different classification techniques on a real world breast cancer risk factors data set. First, we applied specified classification techniques on breast cancer data without applying any resampling technique. Second, since the data is imbalanced meaning data has an unequal distribution between the classes, we applied several resampling methods to get better performance before applying the classifiers. The experimental results show significant improvement on using a resampling method as compared to applying no resampling technique, particularly for the minority class.

查看原文本刊更多论文

几种重采样方法对乳腺癌危险因素的分类

乳腺癌是全世界女性中最常见的癌症，也是第二常见的癌症。预测乳腺癌的发生风险是临床肿瘤学家面临的一项重要挑战，因为它直接影响到日常实践和临床服务。分类是医学领域中应用较多的监督学习模型之一。在包含不平衡特征的真实数据上实现更好的性能是一项非常具有挑战性的任务。机器学习研究人员一直在使用各种技术来获得更高的准确性，通常是通过正确识别多数类样本而忽略少数类实例。然而，在大多数情况下，少数类实例的概念通常比多数类更有意义。在这项研究中，我们对真实世界的乳腺癌风险因素数据集应用了三种不同的分类技术。首先，我们在不使用任何重采样技术的情况下对乳腺癌数据应用特定的分类技术。其次，由于数据是不平衡的，意味着数据在类别之间的分布是不相等的，我们在应用分类器之前应用了几种重采样方法来获得更好的性能。实验结果表明，与不使用重采样技术相比，使用重采样方法有显著改善，特别是对于少数类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)

自引率

0.00%

发文量