利用数据挖掘和机器学习算法预测糖尿病:一项横断面研究

IF 2.3 Q3 MEDICAL INFORMATICS
Healthcare Informatics Research Pub Date : 2024-01-01 Epub Date: 2024-01-31 DOI:10.4258/hir.2024.30.1.73
Hassan Shojaee-Mend, Farnia Velayati, Batool Tayefi, Ebrahim Babaee
{"title":"利用数据挖掘和机器学习算法预测糖尿病:一项横断面研究","authors":"Hassan Shojaee-Mend, Farnia Velayati, Batool Tayefi, Ebrahim Babaee","doi":"10.4258/hir.2024.30.1.73","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>This study aimed to develop a model to predict fasting blood glucose status using machine learning and data mining, since the early diagnosis and treatment of diabetes can improve outcomes and quality of life.</p><p><strong>Methods: </strong>This crosssectional study analyzed data from 3376 adults over 30 years old at 16 comprehensive health service centers in Tehran, Iran who participated in a diabetes screening program. The dataset was balanced using random sampling and the synthetic minority over-sampling technique (SMOTE). The dataset was split into training set (80%) and test set (20%). Shapley values were calculated to select the most important features. Noise analysis was performed by adding Gaussian noise to the numerical features to evaluate the robustness of feature importance. Five different machine learning algorithms, including CatBoost, random forest, XGBoost, logistic regression, and an artificial neural network, were used to model the dataset. Accuracy, sensitivity, specificity, accuracy, the F1-score, and the area under the curve were used to evaluate the model.</p><p><strong>Results: </strong>Age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important factors for predicting fasting blood glucose status. Though the models achieved similar predictive ability, the CatBoost model performed slightly better overall with 0.737 area under the curve (AUC).</p><p><strong>Conclusions: </strong>A gradient boosted decision tree model accurately identified the most important risk factors related to diabetes. Age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important risk factors for diabetes, respectively. This model can support planning for diabetes management and prevention.</p>","PeriodicalId":12947,"journal":{"name":"Healthcare Informatics Research","volume":null,"pages":null},"PeriodicalIF":2.3000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10879823/pdf/","citationCount":"0","resultStr":"{\"title\":\"Prediction of Diabetes Using Data Mining and Machine Learning Algorithms: A Cross-Sectional Study.\",\"authors\":\"Hassan Shojaee-Mend, Farnia Velayati, Batool Tayefi, Ebrahim Babaee\",\"doi\":\"10.4258/hir.2024.30.1.73\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>This study aimed to develop a model to predict fasting blood glucose status using machine learning and data mining, since the early diagnosis and treatment of diabetes can improve outcomes and quality of life.</p><p><strong>Methods: </strong>This crosssectional study analyzed data from 3376 adults over 30 years old at 16 comprehensive health service centers in Tehran, Iran who participated in a diabetes screening program. The dataset was balanced using random sampling and the synthetic minority over-sampling technique (SMOTE). The dataset was split into training set (80%) and test set (20%). Shapley values were calculated to select the most important features. Noise analysis was performed by adding Gaussian noise to the numerical features to evaluate the robustness of feature importance. Five different machine learning algorithms, including CatBoost, random forest, XGBoost, logistic regression, and an artificial neural network, were used to model the dataset. Accuracy, sensitivity, specificity, accuracy, the F1-score, and the area under the curve were used to evaluate the model.</p><p><strong>Results: </strong>Age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important factors for predicting fasting blood glucose status. Though the models achieved similar predictive ability, the CatBoost model performed slightly better overall with 0.737 area under the curve (AUC).</p><p><strong>Conclusions: </strong>A gradient boosted decision tree model accurately identified the most important risk factors related to diabetes. Age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important risk factors for diabetes, respectively. This model can support planning for diabetes management and prevention.</p>\",\"PeriodicalId\":12947,\"journal\":{\"name\":\"Healthcare Informatics Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2024-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10879823/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Healthcare Informatics Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4258/hir.2024.30.1.73\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/31 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare Informatics Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4258/hir.2024.30.1.73","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/31 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

摘要

研究目的本研究旨在利用机器学习和数据挖掘技术开发一个预测空腹血糖状态的模型,因为糖尿病的早期诊断和治疗可以改善预后和生活质量:这项横断面研究分析了伊朗德黑兰 16 个综合医疗服务中心的 3376 名 30 岁以上成年人的数据,他们都参加了糖尿病筛查项目。数据集采用随机抽样和合成少数群体过度抽样技术(SMOTE)进行平衡。数据集分为训练集(80%)和测试集(20%)。通过计算 Shapley 值,选出最重要的特征。通过向数字特征添加高斯噪声来进行噪声分析,以评估特征重要性的鲁棒性。五种不同的机器学习算法(包括 CatBoost、随机森林、XGBoost、逻辑回归和人工神经网络)被用于数据集建模。准确度、灵敏度、特异性、准确度、F1-分数和曲线下面积被用来评估模型:结果:年龄、腰臀比、体重指数和收缩压是预测空腹血糖状况的最重要因素。虽然模型的预测能力相似,但 CatBoost 模型的总体表现略好,曲线下面积(AUC)为 0.737:结论:梯度提升决策树模型能准确识别与糖尿病相关的最重要风险因素。年龄、腰臀比、体重指数和收缩压分别是糖尿病最重要的风险因素。该模型有助于制定糖尿病管理和预防计划。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Prediction of Diabetes Using Data Mining and Machine Learning Algorithms: A Cross-Sectional Study.

Objectives: This study aimed to develop a model to predict fasting blood glucose status using machine learning and data mining, since the early diagnosis and treatment of diabetes can improve outcomes and quality of life.

Methods: This crosssectional study analyzed data from 3376 adults over 30 years old at 16 comprehensive health service centers in Tehran, Iran who participated in a diabetes screening program. The dataset was balanced using random sampling and the synthetic minority over-sampling technique (SMOTE). The dataset was split into training set (80%) and test set (20%). Shapley values were calculated to select the most important features. Noise analysis was performed by adding Gaussian noise to the numerical features to evaluate the robustness of feature importance. Five different machine learning algorithms, including CatBoost, random forest, XGBoost, logistic regression, and an artificial neural network, were used to model the dataset. Accuracy, sensitivity, specificity, accuracy, the F1-score, and the area under the curve were used to evaluate the model.

Results: Age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important factors for predicting fasting blood glucose status. Though the models achieved similar predictive ability, the CatBoost model performed slightly better overall with 0.737 area under the curve (AUC).

Conclusions: A gradient boosted decision tree model accurately identified the most important risk factors related to diabetes. Age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important risk factors for diabetes, respectively. This model can support planning for diabetes management and prevention.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Healthcare Informatics Research
Healthcare Informatics Research MEDICAL INFORMATICS-
CiteScore
4.90
自引率
6.90%
发文量
44
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信