Comparative analysis of resampling algorithms in the prediction of stroke diseases

UMYU Scientifica Pub Date : 2023-03-30 DOI:10.56919/usci.2123.011

Dauda Sani Abdullahi, Dr. Muhammad Sirajo Aliyu, Usman Musa Abdullahi

{"title":"Comparative analysis of resampling algorithms in the prediction of stroke diseases","authors":"Dauda Sani Abdullahi, Dr. Muhammad Sirajo Aliyu, Usman Musa Abdullahi","doi":"10.56919/usci.2123.011","DOIUrl":null,"url":null,"abstract":"Stroke disease is a serious cause of death globally. Early predictions of the disease will save a lot of lives but most of the clinical datasets are imbalanced in nature including the stroke dataset, making the predictive algorithms biased towards the majority class. The objective of this research is to compare different data resampling algorithms on the stroke dataset to improve the prediction performances of the machine learning models. This paper considered five (5) resampling algorithms namely; Random over Sampling (ROS), Synthetic Minority oversampling Technique (SMOTE), Adaptive Synthetic (ADASYN), hybrid techniques like SMOTE with Edited Nearest Neighbor (SMOTE-ENN), and SMOTE with Tomek Links (SMOTE-TOMEK) and trained on six (6) machine learning classifiers namely; Logistic Regression (LR), Decision Tree (DT), K-nearest Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF), and XGBoost (XGB). The hybrid technique SMOTE-ENN influences the machine learning classifiers the best followed by the SMOTE technique while the combination of SMOTE and XGB perform better with an accuracy of 97.99% and G-mean score of 0.99, and auc_roc score of 0.99. Resampling algorithms balance the dataset and enhanced the predictive power of machine learning algorithms. Therefore, we recommend resampling stroke dataset in predicting stroke disease than modeling on the imbalanced dataset.","PeriodicalId":235595,"journal":{"name":"UMYU Scientifica","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"UMYU Scientifica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.56919/usci.2123.011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Stroke disease is a serious cause of death globally. Early predictions of the disease will save a lot of lives but most of the clinical datasets are imbalanced in nature including the stroke dataset, making the predictive algorithms biased towards the majority class. The objective of this research is to compare different data resampling algorithms on the stroke dataset to improve the prediction performances of the machine learning models. This paper considered five (5) resampling algorithms namely; Random over Sampling (ROS), Synthetic Minority oversampling Technique (SMOTE), Adaptive Synthetic (ADASYN), hybrid techniques like SMOTE with Edited Nearest Neighbor (SMOTE-ENN), and SMOTE with Tomek Links (SMOTE-TOMEK) and trained on six (6) machine learning classifiers namely; Logistic Regression (LR), Decision Tree (DT), K-nearest Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF), and XGBoost (XGB). The hybrid technique SMOTE-ENN influences the machine learning classifiers the best followed by the SMOTE technique while the combination of SMOTE and XGB perform better with an accuracy of 97.99% and G-mean score of 0.99, and auc_roc score of 0.99. Resampling algorithms balance the dataset and enhanced the predictive power of machine learning algorithms. Therefore, we recommend resampling stroke dataset in predicting stroke disease than modeling on the imbalanced dataset.

查看原文本刊更多论文

重采样算法在脑卒中疾病预测中的比较分析

中风在全球是一个严重的死亡原因。对疾病的早期预测将挽救许多生命，但大多数临床数据集在本质上是不平衡的，包括中风数据集，这使得预测算法偏向于大多数类别。本研究的目的是比较不同的笔画数据重采样算法，以提高机器学习模型的预测性能。本文考虑了五种重采样算法，即;随机过采样(ROS)，合成少数过采样技术(SMOTE)，自适应合成(ADASYN)，混合技术，如SMOTE与编辑近邻(SMOTE- enn)和SMOTE与Tomek链接(SMOTE- Tomek)，并在六(6)个机器学习分类器上进行训练，即;逻辑回归(LR)、决策树(DT)、k近邻(KNN)、支持向量机(SVM)、随机森林(RF)和XGBoost (XGB)。混合技术SMOTE- enn对机器学习分类器的影响最好，其次是SMOTE技术，SMOTE和XGB结合使用效果更好，准确率为97.99%，G-mean得分为0.99,auc_roc得分为0.99。重采样算法平衡了数据集，增强了机器学习算法的预测能力。因此，我们建议在预测中风疾病时对中风数据进行重新采样，而不是在不平衡的数据集上建模。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

UMYU Scientifica

自引率

0.00%

发文量