Machine learning models for classification and identification of significant attributes to detect type 2 diabetes.

IF 3.4 3区医学 Q1 MEDICAL INFORMATICS

Health Information Science and Systems Pub Date : 2022-02-09 eCollection Date: 2022-12-01 DOI:10.1007/s13755-021-00168-2

Koushik Chandra Howlader, Md Shahriare Satu, Md Abdul Awal, Md Rabiul Islam, Sheikh Mohammed Shariful Islam, Julian M W Quinn, Mohammad Ali Moni

{"title":"Machine learning models for classification and identification of significant attributes to detect type 2 diabetes.","authors":"Koushik Chandra Howlader, Md Shahriare Satu, Md Abdul Awal, Md Rabiul Islam, Sheikh Mohammed Shariful Islam, Julian M W Quinn, Mohammad Ali Moni","doi":"10.1007/s13755-021-00168-2","DOIUrl":null,"url":null,"abstract":"Type 2 Diabetes (T2D) is a chronic disease characterized by abnormally high blood glucose levels due to insulin resistance and reduced pancreatic insulin production. The challenge of this work is to identify T2D-associated features that can distinguish T2D sub-types for prognosis and treatment purposes. We thus employed machine learning (ML) techniques to categorize T2D patients using data from the Pima Indian Diabetes Dataset from the Kaggle ML repository. After data preprocessing, several feature selection techniques were used to extract feature subsets, and a range of classification techniques were used to analyze these. We then compared the derived classification results to identify the best classifiers by considering accuracy, kappa statistics, area under the receiver operating characteristic (AUROC), sensitivity, specificity, and logarithmic loss (logloss). To evaluate the performance of different classifiers, we investigated their outcomes using the summary statistics with a resampling distribution. Therefore, Generalized Boosted Regression modeling showed the highest accuracy (90.91%), followed by kappa statistics (78.77%) and specificity (85.19%). In addition, Sparse Distance Weighted Discrimination, Generalized Additive Model using LOESS and Boosted Generalized Additive Models also gave the maximum sensitivity (100%), highest AUROC (95.26%) and lowest logarithmic loss (30.98%) respectively. Notably, the Generalized Additive Model using LOESS was the top-ranked algorithm according to non-parametric Friedman testing. Of the features identified by these machine learning models, glucose levels, body mass index, diabetes pedigree function, and age were consistently identified as the best and most frequently accurate outcome predictors. These results indicate the utility of ML methods in constructing improved prediction models for T2D and successfully identified outcome predictors for this Pima Indian population.Supplementary information: The online version contains supplementary material available at 10.1007/s13755-021-00168-2.","PeriodicalId":46312,"journal":{"name":"Health Information Science and Systems","volume":" ","pages":"2"},"PeriodicalIF":3.4000,"publicationDate":"2022-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8828812/pdf/","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Information Science and Systems","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s13755-021-00168-2","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/12/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 26

Abstract

Type 2 Diabetes (T2D) is a chronic disease characterized by abnormally high blood glucose levels due to insulin resistance and reduced pancreatic insulin production. The challenge of this work is to identify T2D-associated features that can distinguish T2D sub-types for prognosis and treatment purposes. We thus employed machine learning (ML) techniques to categorize T2D patients using data from the Pima Indian Diabetes Dataset from the Kaggle ML repository. After data preprocessing, several feature selection techniques were used to extract feature subsets, and a range of classification techniques were used to analyze these. We then compared the derived classification results to identify the best classifiers by considering accuracy, kappa statistics, area under the receiver operating characteristic (AUROC), sensitivity, specificity, and logarithmic loss (logloss). To evaluate the performance of different classifiers, we investigated their outcomes using the summary statistics with a resampling distribution. Therefore, Generalized Boosted Regression modeling showed the highest accuracy (90.91%), followed by kappa statistics (78.77%) and specificity (85.19%). In addition, Sparse Distance Weighted Discrimination, Generalized Additive Model using LOESS and Boosted Generalized Additive Models also gave the maximum sensitivity (100%), highest AUROC (95.26%) and lowest logarithmic loss (30.98%) respectively. Notably, the Generalized Additive Model using LOESS was the top-ranked algorithm according to non-parametric Friedman testing. Of the features identified by these machine learning models, glucose levels, body mass index, diabetes pedigree function, and age were consistently identified as the best and most frequently accurate outcome predictors. These results indicate the utility of ML methods in constructing improved prediction models for T2D and successfully identified outcome predictors for this Pima Indian population.

Supplementary information: The online version contains supplementary material available at 10.1007/s13755-021-00168-2.

Abstract Image

查看原文本刊更多论文

用于检测2型糖尿病的重要属性分类和识别的机器学习模型。

2型糖尿病(T2D)是一种慢性疾病，其特征是胰岛素抵抗和胰腺胰岛素分泌减少导致血糖水平异常升高。这项工作的挑战是确定T2D相关特征，可以区分T2D亚型，用于预后和治疗目的。因此，我们采用机器学习(ML)技术，使用来自Kaggle ML存储库的皮马印第安人糖尿病数据集的数据对T2D患者进行分类。在数据预处理后，采用多种特征选择技术提取特征子集，并使用一系列分类技术对这些特征子集进行分析。然后，我们比较了衍生的分类结果，通过考虑准确性、kappa统计量、接收者操作特征下面积(AUROC)、灵敏度、特异性和对数损失(logloss)来确定最佳分类器。为了评估不同分类器的性能，我们使用带有重采样分布的汇总统计来研究它们的结果。因此，广义增强回归模型的准确率最高(90.91%)，其次是kappa统计量(78.77%)和特异性(85.19%)。此外，稀疏距离加权判别、黄土广义加性模型和增强广义加性模型的灵敏度最高(100%)，AUROC最高(95.26%)，对数损失最低(30.98%)。值得注意的是，根据非参数弗里德曼检验，使用黄土的广义加性模型是排名最高的算法。在这些机器学习模型确定的特征中，葡萄糖水平、体重指数、糖尿病谱系函数和年龄一直被认为是最好和最准确的结果预测因子。这些结果表明ML方法在构建改进的T2D预测模型中的实用性，并成功地确定了该皮马印第安人群的结果预测因子。补充信息:在线版本包含补充资料，下载地址:10.1007/s13755-021-00168-2。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Health Information Science and Systems MEDICAL INFORMATICS-

CiteScore

11.30

自引率

5.00%

发文量

期刊介绍： Health Information Science and Systems is a multidisciplinary journal that integrates artificial intelligence/computer science/information technology with health science and services, embracing information science research coupled with topics related to the modeling, design, development, integration and management of health information systems, smart health, artificial intelligence in medicine, and computer aided diagnosis, medical expert systems. The scope includes: i.) smart health, artificial Intelligence in medicine, computer aided diagnosis, medical image processing, medical expert systems ii.) medical big data, medical/health/biomedicine information resources such as patient medical records, devices and equipments, software and tools to capture, store, retrieve, process, analyze, optimize the use of information in the health domain, iii.) data management, data mining, and knowledge discovery, all of which play a key role in decision making, management of public health, examination of standards, privacy and security issues, iv.) development of new architectures and applications for health information systems.