Comparative analysis of resampling algorithms in the prediction of stroke diseases

Dauda Sani Abdullahi, Dr. Muhammad Sirajo Aliyu, Usman Musa Abdullahi
{"title":"Comparative analysis of resampling algorithms in the prediction of stroke diseases","authors":"Dauda Sani Abdullahi, Dr. Muhammad Sirajo Aliyu, Usman Musa Abdullahi","doi":"10.56919/usci.2123.011","DOIUrl":null,"url":null,"abstract":"Stroke disease is a serious cause of death globally. Early predictions of the disease will save a lot of lives but most of the clinical datasets are imbalanced in nature including the stroke dataset, making the predictive algorithms biased towards the majority class. The objective of this research is to compare different data resampling algorithms on the stroke dataset to improve the prediction performances of the machine learning models. This paper considered five (5) resampling algorithms namely; Random over Sampling (ROS), Synthetic Minority oversampling Technique (SMOTE), Adaptive Synthetic (ADASYN), hybrid techniques like SMOTE with Edited Nearest Neighbor (SMOTE-ENN), and SMOTE with Tomek Links (SMOTE-TOMEK) and trained on six (6) machine learning classifiers namely; Logistic Regression (LR), Decision Tree (DT), K-nearest Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF), and XGBoost (XGB). The hybrid technique SMOTE-ENN influences the machine learning classifiers the best followed by the SMOTE technique while the combination of SMOTE and XGB perform better with an accuracy of 97.99% and G-mean score of 0.99, and auc_roc score of 0.99. Resampling algorithms balance the dataset and enhanced the predictive power of machine learning algorithms. Therefore, we recommend resampling stroke dataset in predicting stroke disease than modeling on the imbalanced dataset.","PeriodicalId":235595,"journal":{"name":"UMYU Scientifica","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"UMYU Scientifica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.56919/usci.2123.011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Stroke disease is a serious cause of death globally. Early predictions of the disease will save a lot of lives but most of the clinical datasets are imbalanced in nature including the stroke dataset, making the predictive algorithms biased towards the majority class. The objective of this research is to compare different data resampling algorithms on the stroke dataset to improve the prediction performances of the machine learning models. This paper considered five (5) resampling algorithms namely; Random over Sampling (ROS), Synthetic Minority oversampling Technique (SMOTE), Adaptive Synthetic (ADASYN), hybrid techniques like SMOTE with Edited Nearest Neighbor (SMOTE-ENN), and SMOTE with Tomek Links (SMOTE-TOMEK) and trained on six (6) machine learning classifiers namely; Logistic Regression (LR), Decision Tree (DT), K-nearest Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF), and XGBoost (XGB). The hybrid technique SMOTE-ENN influences the machine learning classifiers the best followed by the SMOTE technique while the combination of SMOTE and XGB perform better with an accuracy of 97.99% and G-mean score of 0.99, and auc_roc score of 0.99. Resampling algorithms balance the dataset and enhanced the predictive power of machine learning algorithms. Therefore, we recommend resampling stroke dataset in predicting stroke disease than modeling on the imbalanced dataset.
重采样算法在脑卒中疾病预测中的比较分析
中风在全球是一个严重的死亡原因。对疾病的早期预测将挽救许多生命,但大多数临床数据集在本质上是不平衡的,包括中风数据集,这使得预测算法偏向于大多数类别。本研究的目的是比较不同的笔画数据重采样算法,以提高机器学习模型的预测性能。本文考虑了五种重采样算法,即;随机过采样(ROS),合成少数过采样技术(SMOTE),自适应合成(ADASYN),混合技术,如SMOTE与编辑近邻(SMOTE- enn)和SMOTE与Tomek链接(SMOTE- Tomek),并在六(6)个机器学习分类器上进行训练,即;逻辑回归(LR)、决策树(DT)、k近邻(KNN)、支持向量机(SVM)、随机森林(RF)和XGBoost (XGB)。混合技术SMOTE- enn对机器学习分类器的影响最好,其次是SMOTE技术,SMOTE和XGB结合使用效果更好,准确率为97.99%,G-mean得分为0.99,auc_roc得分为0.99。重采样算法平衡了数据集,增强了机器学习算法的预测能力。因此,我们建议在预测中风疾病时对中风数据进行重新采样,而不是在不平衡的数据集上建模。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信