Machine learning prediction of histological types of breast cancer: a case study in Morocco

Fatima Ezahra Mouas , Latifa Doudach , Achraf Benba , Youssef Bakri , Nasri Issad , Abderrahim Ammar , Hanae Terchoune , Yahia Cherrah , Khan Wen Goh , Abdelhakim Bouyahya , Taoufiq Fechtali
{"title":"Machine learning prediction of histological types of breast cancer: a case study in Morocco","authors":"Fatima Ezahra Mouas ,&nbsp;Latifa Doudach ,&nbsp;Achraf Benba ,&nbsp;Youssef Bakri ,&nbsp;Nasri Issad ,&nbsp;Abderrahim Ammar ,&nbsp;Hanae Terchoune ,&nbsp;Yahia Cherrah ,&nbsp;Khan Wen Goh ,&nbsp;Abdelhakim Bouyahya ,&nbsp;Taoufiq Fechtali","doi":"10.1016/j.ibmed.2025.100275","DOIUrl":null,"url":null,"abstract":"<div><div>Breast cancer remains a major global public health issue, especially among women, as the leading cause of cancer-related death. This study evaluated nine machine learning algorithms including Random Forest, support vector machines with <span>RBF</span>, linear, and polynomial kernels, K-Nearest Neighbors, logistic regression, AdaBoost, XGBoost, and a stacking classifier to predict histological types of breast cancer. The stacking classifier achieved the highest accuracy of 99.1 percent, followed by Random Forest at 98.3 percent and SVM with RBF kernel at 97.68 percent. XGBoost reached 97.4 percent accuracy, while K-Nearest Neighbors and SVM with polynomial kernel showed accuracies of 90.7 and 88.1 percent respectively. AdaBoost obtained 83.6 percent, with SVM linear and logistic regression performing lowest at 56.8 and 53.9 percent respectively. Hyperparameter optimization with Optuna improved Random Forest accuracy from 96.94 percent to 98.3 percent. Using RandomOverSampler to balance classes increased recall for the minority class from 92 percent to 98 percent, improving sensitivity to rare cases.</div><div>The studied cohort had a mean age of 51 years, with 71.6 percent diagnosed with invasive ductal carcinoma. The average tumor size was 3.3 cm, and 11.81 percent of cases were of the triple negative breast cancer type. Postmenopausal women represented 46.24 percent of the sample. Spearman correlation analysis showed positive links between age, menopause, and the presence of invasive ductal carcinoma. Feature importance analysis using Random Forest identified age, menopause, city, and marital status as the main predictive factors.</div><div>To facilitate clinical application, integration of the model into electronic health records is proposed, allowing automated data entry, real time predictions with confidence levels, and a clinician validation interface that ensures continuous model improvement and secure support for diagnosis.</div></div>","PeriodicalId":73399,"journal":{"name":"Intelligence-based medicine","volume":"12 ","pages":"Article 100275"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligence-based medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666521225000791","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Breast cancer remains a major global public health issue, especially among women, as the leading cause of cancer-related death. This study evaluated nine machine learning algorithms including Random Forest, support vector machines with RBF, linear, and polynomial kernels, K-Nearest Neighbors, logistic regression, AdaBoost, XGBoost, and a stacking classifier to predict histological types of breast cancer. The stacking classifier achieved the highest accuracy of 99.1 percent, followed by Random Forest at 98.3 percent and SVM with RBF kernel at 97.68 percent. XGBoost reached 97.4 percent accuracy, while K-Nearest Neighbors and SVM with polynomial kernel showed accuracies of 90.7 and 88.1 percent respectively. AdaBoost obtained 83.6 percent, with SVM linear and logistic regression performing lowest at 56.8 and 53.9 percent respectively. Hyperparameter optimization with Optuna improved Random Forest accuracy from 96.94 percent to 98.3 percent. Using RandomOverSampler to balance classes increased recall for the minority class from 92 percent to 98 percent, improving sensitivity to rare cases.
The studied cohort had a mean age of 51 years, with 71.6 percent diagnosed with invasive ductal carcinoma. The average tumor size was 3.3 cm, and 11.81 percent of cases were of the triple negative breast cancer type. Postmenopausal women represented 46.24 percent of the sample. Spearman correlation analysis showed positive links between age, menopause, and the presence of invasive ductal carcinoma. Feature importance analysis using Random Forest identified age, menopause, city, and marital status as the main predictive factors.
To facilitate clinical application, integration of the model into electronic health records is proposed, allowing automated data entry, real time predictions with confidence levels, and a clinician validation interface that ensures continuous model improvement and secure support for diagnosis.

Abstract Image

机器学习预测乳腺癌的组织学类型:摩洛哥的一个案例研究
乳腺癌仍然是一个主要的全球公共卫生问题,特别是在妇女中,是导致癌症相关死亡的主要原因。本研究评估了9种机器学习算法,包括随机森林、RBF支持向量机、线性和多项式核、k近邻、逻辑回归、AdaBoost、XGBoost和堆叠分类器,用于预测乳腺癌的组织学类型。堆叠分类器的准确率最高,达到99.1%,其次是随机森林(Random Forest),准确率为98.3%,支持向量机(SVM with RBF kernel)的准确率为97.68%。XGBoost的准确率达到97.4%,而k近邻和多项式核SVM的准确率分别为90.7%和88.1%。AdaBoost获得83.6%,SVM线性回归和逻辑回归的表现最低,分别为56.8%和53.9%。Optuna的超参数优化将随机森林的准确率从96.94%提高到98.3%。使用randomoverampler来平衡类别,将少数类别的召回率从92%提高到98%,提高了对罕见病例的敏感性。研究队列的平均年龄为51岁,其中71.6%被诊断为浸润性导管癌。平均肿瘤大小为3.3厘米,11.81%为三阴性乳腺癌。绝经后妇女占样本的46.24%。Spearman相关分析显示年龄、绝经期与浸润性导管癌存在正相关。采用随机森林进行特征重要性分析,确定年龄、绝经期、城市和婚姻状况为主要预测因素。为了促进临床应用,建议将模型集成到电子健康记录中,允许自动数据输入,具有置信度的实时预测以及临床医生验证界面,确保持续的模型改进和对诊断的安全支持。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Intelligence-based medicine
Intelligence-based medicine Health Informatics
CiteScore
5.00
自引率
0.00%
发文量
0
审稿时长
187 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信