利用各种分类方法挖掘糖尿病的重要特征

2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD) Pub Date : 2021-02-27 DOI:10.1109/ICICT4SD50815.2021.9397006

Nurjahan, Mohammad Abu Tareq Rony, Md. Shahriare Satu, Md. Whaiduzzaman

{"title":"利用各种分类方法挖掘糖尿病的重要特征","authors":"Nurjahan, Mohammad Abu Tareq Rony, Md. Shahriare Satu, Md. Whaiduzzaman","doi":"10.1109/ICICT4SD50815.2021.9397006","DOIUrl":null,"url":null,"abstract":"Diabetes is a chronic disease that occurs when blood glucose becomes very high. It is responsible for a number of serious complications in an affected patients body. However, early detection of this harmful disease can reduce critical situations like death as well as minimize the chance of losing valuable organs due to this disease. The aim of this study is to construct a predictive model through examining several machine learning techniques namely Decision tree, K Nearest Neighbour, Naive Bayes, Support Vector Machine, Logistic Regression, extreme Gradient Boosting, Multi-Layer Perceptron and Random Forest on two different datasets of diabetes patients namely Pima Indian diabetes datasets and Sylhet Diabetes Hospital datasets. Several popular and effective feature subset selection procedures have also been utilized for eliminating unnecessary attributes. After analyzing the outputs of the work, it is seen that Random Forest delivers the highest accuracy (97.5%), F-measure (97.5%), Area under Receiver Operating Characteristic Curve (99.80%) for the Gain Ratio Attribute Evaluation feature subset selection technique in case of Sylhet hospital datasets. On the other hand, in case of Pima Indian datasets, Logistic Regression delivers the highest accuracy (77.7%), F-measure (77%) for Information Gain Attribute Evaluation and Area under Receiver Operating Curve (83%) for both of the techniques namely Correlation-based Feature Selection Subset Evaluation and Correlation Attribute Evaluation. However, In this study, 10 fold cross validation technique has been used for the performance measurement.","PeriodicalId":239251,"journal":{"name":"2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Mining Significant Features of Diabetes through Employing Various Classification Methods\",\"authors\":\"Nurjahan, Mohammad Abu Tareq Rony, Md. Shahriare Satu, Md. Whaiduzzaman\",\"doi\":\"10.1109/ICICT4SD50815.2021.9397006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Diabetes is a chronic disease that occurs when blood glucose becomes very high. It is responsible for a number of serious complications in an affected patients body. However, early detection of this harmful disease can reduce critical situations like death as well as minimize the chance of losing valuable organs due to this disease. The aim of this study is to construct a predictive model through examining several machine learning techniques namely Decision tree, K Nearest Neighbour, Naive Bayes, Support Vector Machine, Logistic Regression, extreme Gradient Boosting, Multi-Layer Perceptron and Random Forest on two different datasets of diabetes patients namely Pima Indian diabetes datasets and Sylhet Diabetes Hospital datasets. Several popular and effective feature subset selection procedures have also been utilized for eliminating unnecessary attributes. After analyzing the outputs of the work, it is seen that Random Forest delivers the highest accuracy (97.5%), F-measure (97.5%), Area under Receiver Operating Characteristic Curve (99.80%) for the Gain Ratio Attribute Evaluation feature subset selection technique in case of Sylhet hospital datasets. On the other hand, in case of Pima Indian datasets, Logistic Regression delivers the highest accuracy (77.7%), F-measure (77%) for Information Gain Attribute Evaluation and Area under Receiver Operating Curve (83%) for both of the techniques namely Correlation-based Feature Selection Subset Evaluation and Correlation Attribute Evaluation. However, In this study, 10 fold cross validation technique has been used for the performance measurement.\",\"PeriodicalId\":239251,\"journal\":{\"name\":\"2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)\",\"volume\":\"79 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICICT4SD50815.2021.9397006\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICT4SD50815.2021.9397006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

糖尿病是一种慢性疾病，当血糖变得非常高时就会发生。它会导致患者体内出现许多严重的并发症。然而，这种有害疾病的早期发现可以减少死亡等危急情况，并最大限度地减少因这种疾病而失去宝贵器官的机会。本研究的目的是通过检查几种机器学习技术，即决策树，K近邻，朴素贝叶斯，支持向量机，逻辑回归，极端梯度增强，多层感知器和随机森林，在两个不同的糖尿病患者数据集，即皮马印度糖尿病数据集和Sylhet糖尿病医院数据集上构建预测模型。一些流行和有效的特征子集选择程序也被用于消除不必要的属性。在分析工作输出后，可以看到Random Forest在Sylhet医院数据集的增益比属性评估特征子集选择技术中提供了最高的准确率(97.5%)，F-measure (97.5%)， Receiver Operating Characteristic Curve下面积(99.80%)。另一方面，在皮马印第安人数据集的情况下，对于基于相关性的特征选择子集评估和相关属性评估这两种技术，逻辑回归提供了最高的准确性(77.7%)，信息增益属性评估的F-measure(77%)和接收者工作曲线下的面积(83%)。然而，在本研究中，使用10倍交叉验证技术进行绩效测量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Mining Significant Features of Diabetes through Employing Various Classification Methods

Diabetes is a chronic disease that occurs when blood glucose becomes very high. It is responsible for a number of serious complications in an affected patients body. However, early detection of this harmful disease can reduce critical situations like death as well as minimize the chance of losing valuable organs due to this disease. The aim of this study is to construct a predictive model through examining several machine learning techniques namely Decision tree, K Nearest Neighbour, Naive Bayes, Support Vector Machine, Logistic Regression, extreme Gradient Boosting, Multi-Layer Perceptron and Random Forest on two different datasets of diabetes patients namely Pima Indian diabetes datasets and Sylhet Diabetes Hospital datasets. Several popular and effective feature subset selection procedures have also been utilized for eliminating unnecessary attributes. After analyzing the outputs of the work, it is seen that Random Forest delivers the highest accuracy (97.5%), F-measure (97.5%), Area under Receiver Operating Characteristic Curve (99.80%) for the Gain Ratio Attribute Evaluation feature subset selection technique in case of Sylhet hospital datasets. On the other hand, in case of Pima Indian datasets, Logistic Regression delivers the highest accuracy (77.7%), F-measure (77%) for Information Gain Attribute Evaluation and Area under Receiver Operating Curve (83%) for both of the techniques namely Correlation-based Feature Selection Subset Evaluation and Correlation Attribute Evaluation. However, In this study, 10 fold cross validation technique has been used for the performance measurement.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD)

自引率

0.00%

发文量