{"title":"可解释机器学习模型与Shapley值用于糖尿病预测的比较研究","authors":"Keona Pang","doi":"10.1016/j.health.2025.100390","DOIUrl":null,"url":null,"abstract":"<div><div>Over the years, numerous machine learning models have been developed, leading to successful applications across various fields. This study uses a large dataset related to type 2 diabetes prediction from the Centers for Disease Control and Prevention (CDC) in the United States. The dataset with 70692 samples has 21 input features and one output (non-diabetes or diabetes). In addition to health indicators like Body Mass Index (BMI), blood pressure, and cholesterol level, the features include socioeconomic factors (e.g., income, education) and lifestyle factors such as diet and physical activity. This paper aims to study how these features influence diabetes risk. 80 % of the dataset is used for training and 20 % for testing. Six machine learning models, as well as the Multivariate Adaptive Regression Splines (MARS) model, were used in the investigation. A detailed comparison of the performance of these models is given. Shapley values explain the nature of various machine learning models using visualization by color graphs to demonstrate the reliability of different machine learning models. This paper shows how Shapley values can improve their explainability and interpretability on diabetes prediction. We leverage the SHapley Additive exPlanations (SHAP) scores to provide information about the relative importance of each predictive feature, and these results shed light on the relationship between the features and the risk of developing type 2 diabetes.</div></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"7 ","pages":"Article 100390"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A comparative study of explainable machine learning models with Shapley values for diabetes prediction\",\"authors\":\"Keona Pang\",\"doi\":\"10.1016/j.health.2025.100390\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Over the years, numerous machine learning models have been developed, leading to successful applications across various fields. This study uses a large dataset related to type 2 diabetes prediction from the Centers for Disease Control and Prevention (CDC) in the United States. The dataset with 70692 samples has 21 input features and one output (non-diabetes or diabetes). In addition to health indicators like Body Mass Index (BMI), blood pressure, and cholesterol level, the features include socioeconomic factors (e.g., income, education) and lifestyle factors such as diet and physical activity. This paper aims to study how these features influence diabetes risk. 80 % of the dataset is used for training and 20 % for testing. Six machine learning models, as well as the Multivariate Adaptive Regression Splines (MARS) model, were used in the investigation. A detailed comparison of the performance of these models is given. Shapley values explain the nature of various machine learning models using visualization by color graphs to demonstrate the reliability of different machine learning models. This paper shows how Shapley values can improve their explainability and interpretability on diabetes prediction. We leverage the SHapley Additive exPlanations (SHAP) scores to provide information about the relative importance of each predictive feature, and these results shed light on the relationship between the features and the risk of developing type 2 diabetes.</div></div>\",\"PeriodicalId\":73222,\"journal\":{\"name\":\"Healthcare analytics (New York, N.Y.)\",\"volume\":\"7 \",\"pages\":\"Article 100390\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-03-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Healthcare analytics (New York, N.Y.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772442525000097\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442525000097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A comparative study of explainable machine learning models with Shapley values for diabetes prediction
Over the years, numerous machine learning models have been developed, leading to successful applications across various fields. This study uses a large dataset related to type 2 diabetes prediction from the Centers for Disease Control and Prevention (CDC) in the United States. The dataset with 70692 samples has 21 input features and one output (non-diabetes or diabetes). In addition to health indicators like Body Mass Index (BMI), blood pressure, and cholesterol level, the features include socioeconomic factors (e.g., income, education) and lifestyle factors such as diet and physical activity. This paper aims to study how these features influence diabetes risk. 80 % of the dataset is used for training and 20 % for testing. Six machine learning models, as well as the Multivariate Adaptive Regression Splines (MARS) model, were used in the investigation. A detailed comparison of the performance of these models is given. Shapley values explain the nature of various machine learning models using visualization by color graphs to demonstrate the reliability of different machine learning models. This paper shows how Shapley values can improve their explainability and interpretability on diabetes prediction. We leverage the SHapley Additive exPlanations (SHAP) scores to provide information about the relative importance of each predictive feature, and these results shed light on the relationship between the features and the risk of developing type 2 diabetes.