{"title":"增强中风预测模型:机器学习中小规模数据集的数据增强和迁移学习的混合","authors":"Imam Tahyudin , Ade Nurhopipah , Ades Tikaningsih , Puji Lestari , Yaya Suryana , Edi Winarko , Eko Winarto , Nazwan Haza , Hidetaka Nambo","doi":"10.1016/j.cmpbup.2025.100198","DOIUrl":null,"url":null,"abstract":"<div><div>Machine learning is a powerful technique for analysing datasets and making data-driven recommendations. However, in general, the performance of machine learning in recognising patterns is proportional to the size of the dataset. On the other hand, in some cases, such as in the medical field, providing an instance of a dataset takes a lot of work and budget. Therefore, additional data acquisition techniques are needed to increase data size and improve model quality.</div><div>This study applied Data Augmentation and Transfer Learning to solve small-scale dataset problems in analyzing stroke patient information in The Banyumas Regional General Hospital (RSUD Banyumas). The information is utilized to predict the patient's status when discharged from the hospital. The research compared the prediction accuracy from three solutions: Data Augmentation, Transfer Learning, and the mixing of both methods. The classification models employed in this study were four algorithms: Random Forest, Support Vector Machine, Gradient Boosting, and Extreme Gradient Boosting. We implemented the Synthetic Minority Over-sampling Technique for Nominal and Continuous to generate the artificial dataset. In the Transfer Learning process, we used a benchmark stroke dataset with a different target than ours, so we labelled it based on the nearest neighbours of the original dataset. Applying Data Augmentation in this study is a good decision because it leads to better performance than using only the original dataset. However, implementing the Transfer Learning technique does not give a satisfying result for XGBoost and SVM. Mixing Data Augmentation and Transfer Learning provides the best performance with accuracy and recall, both 0.813, the precision of 0.853497, and the F-1 score of 0.826628 given by the Random Forest model. The research can contribute significantly to developing better classification models so physicians can obtain more accurate information and help treat stroke cases more effectively and efficiently.</div></div>","PeriodicalId":72670,"journal":{"name":"Computer methods and programs in biomedicine update","volume":"8 ","pages":"Article 100198"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing stroke prediction models: A mixing of data augmentation and transfer learning for small-scale dataset in machine learning\",\"authors\":\"Imam Tahyudin , Ade Nurhopipah , Ades Tikaningsih , Puji Lestari , Yaya Suryana , Edi Winarko , Eko Winarto , Nazwan Haza , Hidetaka Nambo\",\"doi\":\"10.1016/j.cmpbup.2025.100198\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Machine learning is a powerful technique for analysing datasets and making data-driven recommendations. However, in general, the performance of machine learning in recognising patterns is proportional to the size of the dataset. On the other hand, in some cases, such as in the medical field, providing an instance of a dataset takes a lot of work and budget. Therefore, additional data acquisition techniques are needed to increase data size and improve model quality.</div><div>This study applied Data Augmentation and Transfer Learning to solve small-scale dataset problems in analyzing stroke patient information in The Banyumas Regional General Hospital (RSUD Banyumas). The information is utilized to predict the patient's status when discharged from the hospital. The research compared the prediction accuracy from three solutions: Data Augmentation, Transfer Learning, and the mixing of both methods. The classification models employed in this study were four algorithms: Random Forest, Support Vector Machine, Gradient Boosting, and Extreme Gradient Boosting. We implemented the Synthetic Minority Over-sampling Technique for Nominal and Continuous to generate the artificial dataset. In the Transfer Learning process, we used a benchmark stroke dataset with a different target than ours, so we labelled it based on the nearest neighbours of the original dataset. Applying Data Augmentation in this study is a good decision because it leads to better performance than using only the original dataset. However, implementing the Transfer Learning technique does not give a satisfying result for XGBoost and SVM. Mixing Data Augmentation and Transfer Learning provides the best performance with accuracy and recall, both 0.813, the precision of 0.853497, and the F-1 score of 0.826628 given by the Random Forest model. The research can contribute significantly to developing better classification models so physicians can obtain more accurate information and help treat stroke cases more effectively and efficiently.</div></div>\",\"PeriodicalId\":72670,\"journal\":{\"name\":\"Computer methods and programs in biomedicine update\",\"volume\":\"8 \",\"pages\":\"Article 100198\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer methods and programs in biomedicine update\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666990025000229\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine update","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666990025000229","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Enhancing stroke prediction models: A mixing of data augmentation and transfer learning for small-scale dataset in machine learning
Machine learning is a powerful technique for analysing datasets and making data-driven recommendations. However, in general, the performance of machine learning in recognising patterns is proportional to the size of the dataset. On the other hand, in some cases, such as in the medical field, providing an instance of a dataset takes a lot of work and budget. Therefore, additional data acquisition techniques are needed to increase data size and improve model quality.
This study applied Data Augmentation and Transfer Learning to solve small-scale dataset problems in analyzing stroke patient information in The Banyumas Regional General Hospital (RSUD Banyumas). The information is utilized to predict the patient's status when discharged from the hospital. The research compared the prediction accuracy from three solutions: Data Augmentation, Transfer Learning, and the mixing of both methods. The classification models employed in this study were four algorithms: Random Forest, Support Vector Machine, Gradient Boosting, and Extreme Gradient Boosting. We implemented the Synthetic Minority Over-sampling Technique for Nominal and Continuous to generate the artificial dataset. In the Transfer Learning process, we used a benchmark stroke dataset with a different target than ours, so we labelled it based on the nearest neighbours of the original dataset. Applying Data Augmentation in this study is a good decision because it leads to better performance than using only the original dataset. However, implementing the Transfer Learning technique does not give a satisfying result for XGBoost and SVM. Mixing Data Augmentation and Transfer Learning provides the best performance with accuracy and recall, both 0.813, the precision of 0.853497, and the F-1 score of 0.826628 given by the Random Forest model. The research can contribute significantly to developing better classification models so physicians can obtain more accurate information and help treat stroke cases more effectively and efficiently.