{"title":"Performance Enhancement of Machine Learning Algorithms on Heart Stroke Prediction Application using Sampling and Feature Selection Techniques","authors":"Naga Sreeharsha Reddy Ambati, Sree Harrsha Singara, Syam Sukesh Konjeti, Selvi C","doi":"10.1109/ICAISS55157.2022.10011040","DOIUrl":null,"url":null,"abstract":"A heart stroke occurs when the flow of blood to a certain area of the heart is restricted, most often by a blood clot. Strokes are a significant contributor to serious impairment in the adult population and a leading cause of fatalities. As a result, many individuals die, and some become permanently disabled. Therefore, the stroke must be precisely predicted to begin treatment as soon as possible. This project uses Kaggle's Stroke Prediction dataset to predict heart stroke where the classes are not balanced. The accuracy of the existing stroke predictions, which used a downsampling technique to balance the data, was 75%. However, the existing models did not employ any Resampling and Feature Selection (FS) techniques to improve their accuracy. In order to achieve the highest level of accuracy for stroke prediction, the stroke dataset has undergone a comparative analysis of several resampling approaches and FS methods across various Machine Learning (ML) algorithms. To obtain better accuracy, classifiers are trained with the K-Fold cross-validation mechanism. Appropriate pre-processing techniques are applied to fill in the missing values and convert the existing categorical data into numerical data. Re-sampling strategies are used to balance the dataset so that the trained model will produce accurate results for all the target variable's classes. Similarly to that, methods for FS are used to extract the best features from the dataset that will aid to improve accuracy. From the experimental results, it has been observed that the Instance Hardness Threshold re-sampling technique along with the Exhaustive feature selection method across the Random Forest classifier yields a better accuracy of 97.9%.","PeriodicalId":243784,"journal":{"name":"2022 International Conference on Augmented Intelligence and Sustainable Systems (ICAISS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Augmented Intelligence and Sustainable Systems (ICAISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAISS55157.2022.10011040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
A heart stroke occurs when the flow of blood to a certain area of the heart is restricted, most often by a blood clot. Strokes are a significant contributor to serious impairment in the adult population and a leading cause of fatalities. As a result, many individuals die, and some become permanently disabled. Therefore, the stroke must be precisely predicted to begin treatment as soon as possible. This project uses Kaggle's Stroke Prediction dataset to predict heart stroke where the classes are not balanced. The accuracy of the existing stroke predictions, which used a downsampling technique to balance the data, was 75%. However, the existing models did not employ any Resampling and Feature Selection (FS) techniques to improve their accuracy. In order to achieve the highest level of accuracy for stroke prediction, the stroke dataset has undergone a comparative analysis of several resampling approaches and FS methods across various Machine Learning (ML) algorithms. To obtain better accuracy, classifiers are trained with the K-Fold cross-validation mechanism. Appropriate pre-processing techniques are applied to fill in the missing values and convert the existing categorical data into numerical data. Re-sampling strategies are used to balance the dataset so that the trained model will produce accurate results for all the target variable's classes. Similarly to that, methods for FS are used to extract the best features from the dataset that will aid to improve accuracy. From the experimental results, it has been observed that the Instance Hardness Threshold re-sampling technique along with the Exhaustive feature selection method across the Random Forest classifier yields a better accuracy of 97.9%.