Chi-Squared Based Feature Selection for Stroke Prediction using AzureML

2020 Intermountain Engineering, Technology and Computing (IETC) Pub Date : 2020-10-02 DOI:10.1109/IETC47856.2020.9249117

Sujan Ray, Khaldoon Alshouiliy, A. Roy, Ali AlGhamdi, D. Agrawal

{"title":"Chi-Squared Based Feature Selection for Stroke Prediction using AzureML","authors":"Sujan Ray, Khaldoon Alshouiliy, A. Roy, Ali AlGhamdi, D. Agrawal","doi":"10.1109/IETC47856.2020.9249117","DOIUrl":null,"url":null,"abstract":"In the United States, stroke is the fifth prominent cause of fatality and it is a major reason of serious disability among the adult population [1]. Therefore, it is crucial that we can predict stroke accurately in order to be treated in early stages. Nowadays, use of Machine Learning (ML) algorithms have been in great demand to predict patient's condition in advance and inform the medical staff to avoid the risk of disease progression. Kaggle Healthcare dataset has been widely used by many researchers in this area for developing models for stroke prediction. The dataset has 43,400 instances and 10 features. This paper proposes a method for the analysis and prediction of stroke on the same dataset using Microsoft Azure Machine Learning (AzureML) which is a cloud-based platform. We have applied Chi-Squared test on the dataset for extracting the top features. The experiments are run on AzureML with the top 6 features as well as with all the features. In addition, we compare accuracy between the two models trained by the top 6 features and all the features. The performance of Two-class Decision Jungle with top 6 features has been set as the benchmark in our work. Two-Class Boosted Decision Tree, an ensemble learning method achieves 96.8% accuracy using the top 6 features. Our experimental results show that with the right features, we could improve the accuracy significantly for the stroke prediction, and it also takes less time to train the model.","PeriodicalId":186446,"journal":{"name":"2020 Intermountain Engineering, Technology and Computing (IETC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Intermountain Engineering, Technology and Computing (IETC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IETC47856.2020.9249117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

In the United States, stroke is the fifth prominent cause of fatality and it is a major reason of serious disability among the adult population [1]. Therefore, it is crucial that we can predict stroke accurately in order to be treated in early stages. Nowadays, use of Machine Learning (ML) algorithms have been in great demand to predict patient's condition in advance and inform the medical staff to avoid the risk of disease progression. Kaggle Healthcare dataset has been widely used by many researchers in this area for developing models for stroke prediction. The dataset has 43,400 instances and 10 features. This paper proposes a method for the analysis and prediction of stroke on the same dataset using Microsoft Azure Machine Learning (AzureML) which is a cloud-based platform. We have applied Chi-Squared test on the dataset for extracting the top features. The experiments are run on AzureML with the top 6 features as well as with all the features. In addition, we compare accuracy between the two models trained by the top 6 features and all the features. The performance of Two-class Decision Jungle with top 6 features has been set as the benchmark in our work. Two-Class Boosted Decision Tree, an ensemble learning method achieves 96.8% accuracy using the top 6 features. Our experimental results show that with the right features, we could improve the accuracy significantly for the stroke prediction, and it also takes less time to train the model.

查看原文本刊更多论文

基于Chi-Squared的AzureML脑卒中预测特征选择

在美国，中风是第五大致死原因，也是导致成年人严重残疾的主要原因[1]。因此，准确预测中风是至关重要的，以便在早期阶段进行治疗。目前，使用机器学习(ML)算法来提前预测患者的病情并通知医务人员以避免疾病进展的风险已经非常有需求。Kaggle医疗数据集已被该领域的许多研究人员广泛用于开发中风预测模型。该数据集有43400个实例和10个特征。本文提出了一种利用基于云的微软Azure机器学习(AzureML)平台对同一数据集进行笔划分析和预测的方法。我们对数据集进行了卡方检验，提取了顶部特征。实验是在AzureML上运行的，包含了前6个特性以及所有的特性。此外，我们比较了前6个特征训练的两个模型与所有特征的准确率。在我们的工作中，将具有前6个特征的两类决策丛林(Two-class Decision Jungle)的性能作为基准。两类提升决策树是一种集成学习方法，使用前6个特征，准确率达到96.8%。我们的实验结果表明，通过正确的特征，我们可以显著提高中风预测的准确性，并且可以减少模型的训练时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 Intermountain Engineering, Technology and Computing (IETC)

自引率

0.00%

发文量