Classification of Breast Cancer Risk Factors Using Several Resampling Approaches

Md Faisal Kabir, Simone A. Ludwig
{"title":"Classification of Breast Cancer Risk Factors Using Several Resampling Approaches","authors":"Md Faisal Kabir, Simone A. Ludwig","doi":"10.1109/ICMLA.2018.00202","DOIUrl":null,"url":null,"abstract":"Breast cancer is the most common cancer in women worldwide and the second most common cancer overall. Predicting the risk of breast cancer occurrence is an important challenge for clinical oncologists as it has direct influence in daily practice and clinical service. Classification is one of the supervised learning models that is applied in medical domains. Achieving better performance on real data that contains imbalance characteristics is a very challenging task. Machine learning researchers have been using various techniques to obtain higher accuracy, generally by correctly identifying majority class samples while ignoring the instances of the minority class. However, in most of the cases the concept of the minority class instances usually is of higher interest than the majority class. In this research, we applied three different classification techniques on a real world breast cancer risk factors data set. First, we applied specified classification techniques on breast cancer data without applying any resampling technique. Second, since the data is imbalanced meaning data has an unequal distribution between the classes, we applied several resampling methods to get better performance before applying the classifiers. The experimental results show significant improvement on using a resampling method as compared to applying no resampling technique, particularly for the minority class.","PeriodicalId":6533,"journal":{"name":"2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"8 1","pages":"1243-1248"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2018.00202","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Breast cancer is the most common cancer in women worldwide and the second most common cancer overall. Predicting the risk of breast cancer occurrence is an important challenge for clinical oncologists as it has direct influence in daily practice and clinical service. Classification is one of the supervised learning models that is applied in medical domains. Achieving better performance on real data that contains imbalance characteristics is a very challenging task. Machine learning researchers have been using various techniques to obtain higher accuracy, generally by correctly identifying majority class samples while ignoring the instances of the minority class. However, in most of the cases the concept of the minority class instances usually is of higher interest than the majority class. In this research, we applied three different classification techniques on a real world breast cancer risk factors data set. First, we applied specified classification techniques on breast cancer data without applying any resampling technique. Second, since the data is imbalanced meaning data has an unequal distribution between the classes, we applied several resampling methods to get better performance before applying the classifiers. The experimental results show significant improvement on using a resampling method as compared to applying no resampling technique, particularly for the minority class.
几种重采样方法对乳腺癌危险因素的分类
乳腺癌是全世界女性中最常见的癌症,也是第二常见的癌症。预测乳腺癌的发生风险是临床肿瘤学家面临的一项重要挑战,因为它直接影响到日常实践和临床服务。分类是医学领域中应用较多的监督学习模型之一。在包含不平衡特征的真实数据上实现更好的性能是一项非常具有挑战性的任务。机器学习研究人员一直在使用各种技术来获得更高的准确性,通常是通过正确识别多数类样本而忽略少数类实例。然而,在大多数情况下,少数类实例的概念通常比多数类更有意义。在这项研究中,我们对真实世界的乳腺癌风险因素数据集应用了三种不同的分类技术。首先,我们在不使用任何重采样技术的情况下对乳腺癌数据应用特定的分类技术。其次,由于数据是不平衡的,意味着数据在类别之间的分布是不相等的,我们在应用分类器之前应用了几种重采样方法来获得更好的性能。实验结果表明,与不使用重采样技术相比,使用重采样方法有显著改善,特别是对于少数类。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信