{"title":"利用采样和特征选择技术提高机器学习算法在心脏中风预测应用中的性能","authors":"Naga Sreeharsha Reddy Ambati, Sree Harrsha Singara, Syam Sukesh Konjeti, Selvi C","doi":"10.1109/ICAISS55157.2022.10011040","DOIUrl":null,"url":null,"abstract":"A heart stroke occurs when the flow of blood to a certain area of the heart is restricted, most often by a blood clot. Strokes are a significant contributor to serious impairment in the adult population and a leading cause of fatalities. As a result, many individuals die, and some become permanently disabled. Therefore, the stroke must be precisely predicted to begin treatment as soon as possible. This project uses Kaggle's Stroke Prediction dataset to predict heart stroke where the classes are not balanced. The accuracy of the existing stroke predictions, which used a downsampling technique to balance the data, was 75%. However, the existing models did not employ any Resampling and Feature Selection (FS) techniques to improve their accuracy. In order to achieve the highest level of accuracy for stroke prediction, the stroke dataset has undergone a comparative analysis of several resampling approaches and FS methods across various Machine Learning (ML) algorithms. To obtain better accuracy, classifiers are trained with the K-Fold cross-validation mechanism. Appropriate pre-processing techniques are applied to fill in the missing values and convert the existing categorical data into numerical data. Re-sampling strategies are used to balance the dataset so that the trained model will produce accurate results for all the target variable's classes. Similarly to that, methods for FS are used to extract the best features from the dataset that will aid to improve accuracy. From the experimental results, it has been observed that the Instance Hardness Threshold re-sampling technique along with the Exhaustive feature selection method across the Random Forest classifier yields a better accuracy of 97.9%.","PeriodicalId":243784,"journal":{"name":"2022 International Conference on Augmented Intelligence and Sustainable Systems (ICAISS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Enhancement of Machine Learning Algorithms on Heart Stroke Prediction Application using Sampling and Feature Selection Techniques\",\"authors\":\"Naga Sreeharsha Reddy Ambati, Sree Harrsha Singara, Syam Sukesh Konjeti, Selvi C\",\"doi\":\"10.1109/ICAISS55157.2022.10011040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A heart stroke occurs when the flow of blood to a certain area of the heart is restricted, most often by a blood clot. Strokes are a significant contributor to serious impairment in the adult population and a leading cause of fatalities. As a result, many individuals die, and some become permanently disabled. Therefore, the stroke must be precisely predicted to begin treatment as soon as possible. This project uses Kaggle's Stroke Prediction dataset to predict heart stroke where the classes are not balanced. The accuracy of the existing stroke predictions, which used a downsampling technique to balance the data, was 75%. However, the existing models did not employ any Resampling and Feature Selection (FS) techniques to improve their accuracy. In order to achieve the highest level of accuracy for stroke prediction, the stroke dataset has undergone a comparative analysis of several resampling approaches and FS methods across various Machine Learning (ML) algorithms. To obtain better accuracy, classifiers are trained with the K-Fold cross-validation mechanism. Appropriate pre-processing techniques are applied to fill in the missing values and convert the existing categorical data into numerical data. Re-sampling strategies are used to balance the dataset so that the trained model will produce accurate results for all the target variable's classes. Similarly to that, methods for FS are used to extract the best features from the dataset that will aid to improve accuracy. From the experimental results, it has been observed that the Instance Hardness Threshold re-sampling technique along with the Exhaustive feature selection method across the Random Forest classifier yields a better accuracy of 97.9%.\",\"PeriodicalId\":243784,\"journal\":{\"name\":\"2022 International Conference on Augmented Intelligence and Sustainable Systems (ICAISS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Augmented Intelligence and Sustainable Systems (ICAISS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAISS55157.2022.10011040\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Augmented Intelligence and Sustainable Systems (ICAISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAISS55157.2022.10011040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
当流向心脏某个部位的血液受到限制时,就会发生心脏中风,最常见的原因是血栓。中风是造成成年人严重损伤的一个重要因素,也是导致死亡的一个主要原因。因此,许多人因此而死亡,有些人则终生残疾。因此,必须准确预测中风,以便尽快开始治疗。本项目使用 Kaggle 的中风预测数据集来预测类别不平衡的心脏病中风。现有的中风预测使用了下采样技术来平衡数据,准确率为 75%。但是,现有模型没有采用任何重采样和特征选择(FS)技术来提高准确率。为了实现最高水平的中风预测准确率,中风数据集对各种机器学习(ML)算法中的几种重采样方法和特征选择方法进行了比较分析。为了获得更高的准确性,分类器采用 K 折交叉验证机制进行训练。应用适当的预处理技术来填补缺失值,并将现有的分类数据转换为数值数据。采用重新抽样策略来平衡数据集,以便训练出的模型能对所有目标变量类别产生准确的结果。同样,FS 方法也用于从数据集中提取最佳特征,以帮助提高准确性。从实验结果中可以看出,在随机森林分类器中使用实例硬度阈值重采样技术和穷举特征选择方法,可以获得 97.9% 的较高准确率。
Performance Enhancement of Machine Learning Algorithms on Heart Stroke Prediction Application using Sampling and Feature Selection Techniques
A heart stroke occurs when the flow of blood to a certain area of the heart is restricted, most often by a blood clot. Strokes are a significant contributor to serious impairment in the adult population and a leading cause of fatalities. As a result, many individuals die, and some become permanently disabled. Therefore, the stroke must be precisely predicted to begin treatment as soon as possible. This project uses Kaggle's Stroke Prediction dataset to predict heart stroke where the classes are not balanced. The accuracy of the existing stroke predictions, which used a downsampling technique to balance the data, was 75%. However, the existing models did not employ any Resampling and Feature Selection (FS) techniques to improve their accuracy. In order to achieve the highest level of accuracy for stroke prediction, the stroke dataset has undergone a comparative analysis of several resampling approaches and FS methods across various Machine Learning (ML) algorithms. To obtain better accuracy, classifiers are trained with the K-Fold cross-validation mechanism. Appropriate pre-processing techniques are applied to fill in the missing values and convert the existing categorical data into numerical data. Re-sampling strategies are used to balance the dataset so that the trained model will produce accurate results for all the target variable's classes. Similarly to that, methods for FS are used to extract the best features from the dataset that will aid to improve accuracy. From the experimental results, it has been observed that the Instance Hardness Threshold re-sampling technique along with the Exhaustive feature selection method across the Random Forest classifier yields a better accuracy of 97.9%.