使用 SMOTE 提升取样法预测中风疾病的数据挖掘分类算法比较

JUITA : Jurnal Informatika Pub Date : 2023-11-17 DOI:10.30595/juita.v11i2.17348

Ronald Sebastian, Christina Juliane

{"title":"使用 SMOTE 提升取样法预测中风疾病的数据挖掘分类算法比较","authors":"Ronald Sebastian, Christina Juliane","doi":"10.30595/juita.v11i2.17348","DOIUrl":null,"url":null,"abstract":"Stroke is a circulation disorder in the brain that can cause symptoms and signs related to the affected part of the brain and is the leading cause of death and disability in Indonesia. Everyone is at risk of experiencing a stroke, and it is important to recognize and manage risk factors. Data Mining techniques can help in the extraction and prediction of information, as well as finding hidden patterns in stroke medical data. The dataset used in this research comes from Kaggle and is imbalanced, so the SMOTE Upsampling technique is used to address this imbalance issue. The results of the study conclude that the use of SMOTE technique in the C4.5, NB, and KNN algorithms can increase precision, recall, and AUC. The C4.5 algorithm and SMOTE technique as the best performing algorithm were selected for testing new data, and the results show that the model created can predict stroke risk more accurately than the C4.5 model without SMOTE. However, it should be noted that based on the author's interview with one of the medical practitioners, the model cannot be directly used in medical practice because the observations in the medical field to determine factors related to stroke are highly complex. Thus, a new understanding revealed that predicting stroke in a practical setting is highly complex. While data mining can be used as a predictive tool in the initial stage for predictions in the general population, it is strongly recommended to undergo direct examination by doctors in a hospital to obtain more accurate and comprehensive medical evaluations.","PeriodicalId":151254,"journal":{"name":"JUITA : Jurnal Informatika","volume":"12 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparison of Data Mining Classification Algorithms for Stroke Disease Prediction Using the SMOTE Upsampling Method\",\"authors\":\"Ronald Sebastian, Christina Juliane\",\"doi\":\"10.30595/juita.v11i2.17348\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stroke is a circulation disorder in the brain that can cause symptoms and signs related to the affected part of the brain and is the leading cause of death and disability in Indonesia. Everyone is at risk of experiencing a stroke, and it is important to recognize and manage risk factors. Data Mining techniques can help in the extraction and prediction of information, as well as finding hidden patterns in stroke medical data. The dataset used in this research comes from Kaggle and is imbalanced, so the SMOTE Upsampling technique is used to address this imbalance issue. The results of the study conclude that the use of SMOTE technique in the C4.5, NB, and KNN algorithms can increase precision, recall, and AUC. The C4.5 algorithm and SMOTE technique as the best performing algorithm were selected for testing new data, and the results show that the model created can predict stroke risk more accurately than the C4.5 model without SMOTE. However, it should be noted that based on the author's interview with one of the medical practitioners, the model cannot be directly used in medical practice because the observations in the medical field to determine factors related to stroke are highly complex. Thus, a new understanding revealed that predicting stroke in a practical setting is highly complex. While data mining can be used as a predictive tool in the initial stage for predictions in the general population, it is strongly recommended to undergo direct examination by doctors in a hospital to obtain more accurate and comprehensive medical evaluations.\",\"PeriodicalId\":151254,\"journal\":{\"name\":\"JUITA : Jurnal Informatika\",\"volume\":\"12 6\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JUITA : Jurnal Informatika\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.30595/juita.v11i2.17348\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JUITA : Jurnal Informatika","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30595/juita.v11i2.17348","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

中风是一种脑循环障碍，可引起与受影响的脑部相关的症状和体征，是导致印度尼西亚人死亡和残疾的主要原因。每个人都有中风的风险，因此识别和管理风险因素非常重要。数据挖掘技术可以帮助提取和预测信息，并发现中风医疗数据中隐藏的模式。本研究使用的数据集来自 Kaggle，具有不平衡性，因此使用了 SMOTE 升采样技术来解决这一不平衡性问题。研究结果表明，在 C4.5、NB 和 KNN 算法中使用 SMOTE 技术可以提高精确度、召回率和 AUC。在测试新数据时，选择了 C4.5 算法和 SMOTE 技术作为性能最好的算法，结果表明所创建的模型比不使用 SMOTE 的 C4.5 模型能更准确地预测中风风险。但需要注意的是，根据笔者对其中一位医学从业者的访谈，该模型不能直接用于医疗实践，因为医学领域判断中风相关因素的观察非常复杂。因此，新的认识表明，在实际环境中预测中风是非常复杂的。虽然数据挖掘在初始阶段可作为预测工具用于普通人群的预测，但强烈建议在医院接受医生的直接检查，以获得更准确、更全面的医疗评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparison of Data Mining Classification Algorithms for Stroke Disease Prediction Using the SMOTE Upsampling Method

Stroke is a circulation disorder in the brain that can cause symptoms and signs related to the affected part of the brain and is the leading cause of death and disability in Indonesia. Everyone is at risk of experiencing a stroke, and it is important to recognize and manage risk factors. Data Mining techniques can help in the extraction and prediction of information, as well as finding hidden patterns in stroke medical data. The dataset used in this research comes from Kaggle and is imbalanced, so the SMOTE Upsampling technique is used to address this imbalance issue. The results of the study conclude that the use of SMOTE technique in the C4.5, NB, and KNN algorithms can increase precision, recall, and AUC. The C4.5 algorithm and SMOTE technique as the best performing algorithm were selected for testing new data, and the results show that the model created can predict stroke risk more accurately than the C4.5 model without SMOTE. However, it should be noted that based on the author's interview with one of the medical practitioners, the model cannot be directly used in medical practice because the observations in the medical field to determine factors related to stroke are highly complex. Thus, a new understanding revealed that predicting stroke in a practical setting is highly complex. While data mining can be used as a predictive tool in the initial stage for predictions in the general population, it is strongly recommended to undergo direct examination by doctors in a hospital to obtain more accurate and comprehensive medical evaluations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

JUITA : Jurnal Informatika

自引率

0.00%

发文量