可解释机器学习模型与Shapley值用于糖尿病预测的比较研究

Healthcare analytics (New York, N.Y.) Pub Date : 2025-03-11 DOI:10.1016/j.health.2025.100390

Keona Pang

{"title":"可解释机器学习模型与Shapley值用于糖尿病预测的比较研究","authors":"Keona Pang","doi":"10.1016/j.health.2025.100390","DOIUrl":null,"url":null,"abstract":"<div><div>Over the years, numerous machine learning models have been developed, leading to successful applications across various fields. This study uses a large dataset related to type 2 diabetes prediction from the Centers for Disease Control and Prevention (CDC) in the United States. The dataset with 70692 samples has 21 input features and one output (non-diabetes or diabetes). In addition to health indicators like Body Mass Index (BMI), blood pressure, and cholesterol level, the features include socioeconomic factors (e.g., income, education) and lifestyle factors such as diet and physical activity. This paper aims to study how these features influence diabetes risk. 80 % of the dataset is used for training and 20 % for testing. Six machine learning models, as well as the Multivariate Adaptive Regression Splines (MARS) model, were used in the investigation. A detailed comparison of the performance of these models is given. Shapley values explain the nature of various machine learning models using visualization by color graphs to demonstrate the reliability of different machine learning models. This paper shows how Shapley values can improve their explainability and interpretability on diabetes prediction. We leverage the SHapley Additive exPlanations (SHAP) scores to provide information about the relative importance of each predictive feature, and these results shed light on the relationship between the features and the risk of developing type 2 diabetes.</div></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"7 ","pages":"Article 100390"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A comparative study of explainable machine learning models with Shapley values for diabetes prediction\",\"authors\":\"Keona Pang\",\"doi\":\"10.1016/j.health.2025.100390\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Over the years, numerous machine learning models have been developed, leading to successful applications across various fields. This study uses a large dataset related to type 2 diabetes prediction from the Centers for Disease Control and Prevention (CDC) in the United States. The dataset with 70692 samples has 21 input features and one output (non-diabetes or diabetes). In addition to health indicators like Body Mass Index (BMI), blood pressure, and cholesterol level, the features include socioeconomic factors (e.g., income, education) and lifestyle factors such as diet and physical activity. This paper aims to study how these features influence diabetes risk. 80 % of the dataset is used for training and 20 % for testing. Six machine learning models, as well as the Multivariate Adaptive Regression Splines (MARS) model, were used in the investigation. A detailed comparison of the performance of these models is given. Shapley values explain the nature of various machine learning models using visualization by color graphs to demonstrate the reliability of different machine learning models. This paper shows how Shapley values can improve their explainability and interpretability on diabetes prediction. We leverage the SHapley Additive exPlanations (SHAP) scores to provide information about the relative importance of each predictive feature, and these results shed light on the relationship between the features and the risk of developing type 2 diabetes.</div></div>\",\"PeriodicalId\":73222,\"journal\":{\"name\":\"Healthcare analytics (New York, N.Y.)\",\"volume\":\"7 \",\"pages\":\"Article 100390\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-03-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Healthcare analytics (New York, N.Y.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772442525000097\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442525000097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

多年来，人们开发了许多机器学习模型，并在各个领域取得了成功的应用。本研究使用了来自美国疾病控制与预防中心（CDC）的与2型糖尿病预测相关的大型数据集。具有70692个样本的数据集有21个输入特征和一个输出（非糖尿病或糖尿病）。除了身体质量指数（BMI）、血压和胆固醇水平等健康指标外，这些特征还包括社会经济因素（如收入、教育）和生活方式因素（如饮食和体育活动）。本文旨在研究这些特征如何影响糖尿病风险。80%的数据集用于训练，20%用于测试。研究中使用了六种机器学习模型以及多元自适应回归样条（MARS）模型。对这些模型的性能进行了详细的比较。Shapley值通过彩色图形的可视化来解释各种机器学习模型的本质，以证明不同机器学习模型的可靠性。本文展示了Shapley值如何提高其在糖尿病预测中的可解释性和可解释性。我们利用SHapley加性解释（SHAP）评分来提供关于每个预测特征的相对重要性的信息，这些结果揭示了特征与患2型糖尿病风险之间的关系。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A comparative study of explainable machine learning models with Shapley values for diabetes prediction

Over the years, numerous machine learning models have been developed, leading to successful applications across various fields. This study uses a large dataset related to type 2 diabetes prediction from the Centers for Disease Control and Prevention (CDC) in the United States. The dataset with 70692 samples has 21 input features and one output (non-diabetes or diabetes). In addition to health indicators like Body Mass Index (BMI), blood pressure, and cholesterol level, the features include socioeconomic factors (e.g., income, education) and lifestyle factors such as diet and physical activity. This paper aims to study how these features influence diabetes risk. 80 % of the dataset is used for training and 20 % for testing. Six machine learning models, as well as the Multivariate Adaptive Regression Splines (MARS) model, were used in the investigation. A detailed comparison of the performance of these models is given. Shapley values explain the nature of various machine learning models using visualization by color graphs to demonstrate the reliability of different machine learning models. This paper shows how Shapley values can improve their explainability and interpretability on diabetes prediction. We leverage the SHapley Additive exPlanations (SHAP) scores to provide information about the relative importance of each predictive feature, and these results shed light on the relationship between the features and the risk of developing type 2 diabetes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Healthcare analytics (New York, N.Y.) Applied Mathematics, Modelling and Simulation, Nursing and Health Professions (General)

CiteScore

4.40

自引率

0.00%

发文量

审稿时长

79 days