Explainable machine learning identifies key quality-of-life-related predictors of arthritis status: evidence from the China health and retirement longitudinal study.
{"title":"Explainable machine learning identifies key quality-of-life-related predictors of arthritis status: evidence from the China health and retirement longitudinal study.","authors":"Kaibin Lin, Tingting Jiang, Jiafen Liao, Xianrun Zhou, Zheng Wang, Yiyue Chen, Xi Xu, Bing Zhou","doi":"10.1186/s12955-025-02412-9","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Arthritis is a prevalent chronic disease substantially impacting patients' quality of life (QoL). While identifying key determinants associated with arthritis is critical for targeted interventions, traditional statistical methods often struggle with complex interactions, and existing machine learning (ML) approaches frequently lack the interpretability needed to guide clinical decisions. This study integrates a comprehensive, explainable machine learning (XAI) workflow to identify and interpret key QoL-related predictors of arthritis status in a large national cohort.</p><p><strong>Methods: </strong>Data were obtained from 15,011 participants aged > 45 years in the 2020 China Health and Retirement Longitudinal Study (CHARLS). We initially selected 55 potential QoL-related predictors spanning demographic, functional, pain, psychosocial, and lifestyle domains. Feature engineering was performed to create aggregate scores, indicators, and binned variables. Missing data were handled using imputation combined with missing indicator variables. A LightGBM-based feature selection process identified 68 key predictors. Nine ML models (including Logistic Regression, RandomForest, GradientBoosting, LightGBM, CatBoost, XGBoost, DecisionTree, NaiveBayes, KNN) were developed using SMOTE-resampled training data, with hyperparameters optimized via Optuna targeting recall. Performance was evaluated on a held-out test set using Area Under the ROC Curve (AUC), Average Precision (AP), Recall, Specificity, Precison, and F1-score. SHapley Additive exPlanations (SHAP) analysis was applied to the best-performing model (GradientBoosting) for interpretation.</p><p><strong>Results: </strong>Several models achieved strong predictive performance, with GradientBoosting yielding the highest AUC (0.767, 95% CI: 0.752-0.782) and high AP (0.678, 95% CI: 0.655-0.702). SHAP analysis identified multi-site pain burden (particularly knee/leg pain and pain location count), age, self-rated health, sleep quality, functional limitations (ADL counts/scores), and negative affect as the most influential predictors driving arthritis status prediction.</p><p><strong>Conclusions: </strong>This study successfully applied an XAI pipeline to identify and rank key QoL-related factors predictive of arthritis status in a large Chinese cohort, achieving robust model performance. Pain burden, age, subjective health, sleep, functional status, and psychological well-being are critical determinants. These interpretable findings can inform risk stratification and guide targeted interventions focusing on these key areas to potentially improve arthritis management.</p>","PeriodicalId":12980,"journal":{"name":"Health and Quality of Life Outcomes","volume":"23 1","pages":"80"},"PeriodicalIF":3.4000,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12381994/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health and Quality of Life Outcomes","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12955-025-02412-9","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Arthritis is a prevalent chronic disease substantially impacting patients' quality of life (QoL). While identifying key determinants associated with arthritis is critical for targeted interventions, traditional statistical methods often struggle with complex interactions, and existing machine learning (ML) approaches frequently lack the interpretability needed to guide clinical decisions. This study integrates a comprehensive, explainable machine learning (XAI) workflow to identify and interpret key QoL-related predictors of arthritis status in a large national cohort.
Methods: Data were obtained from 15,011 participants aged > 45 years in the 2020 China Health and Retirement Longitudinal Study (CHARLS). We initially selected 55 potential QoL-related predictors spanning demographic, functional, pain, psychosocial, and lifestyle domains. Feature engineering was performed to create aggregate scores, indicators, and binned variables. Missing data were handled using imputation combined with missing indicator variables. A LightGBM-based feature selection process identified 68 key predictors. Nine ML models (including Logistic Regression, RandomForest, GradientBoosting, LightGBM, CatBoost, XGBoost, DecisionTree, NaiveBayes, KNN) were developed using SMOTE-resampled training data, with hyperparameters optimized via Optuna targeting recall. Performance was evaluated on a held-out test set using Area Under the ROC Curve (AUC), Average Precision (AP), Recall, Specificity, Precison, and F1-score. SHapley Additive exPlanations (SHAP) analysis was applied to the best-performing model (GradientBoosting) for interpretation.
Results: Several models achieved strong predictive performance, with GradientBoosting yielding the highest AUC (0.767, 95% CI: 0.752-0.782) and high AP (0.678, 95% CI: 0.655-0.702). SHAP analysis identified multi-site pain burden (particularly knee/leg pain and pain location count), age, self-rated health, sleep quality, functional limitations (ADL counts/scores), and negative affect as the most influential predictors driving arthritis status prediction.
Conclusions: This study successfully applied an XAI pipeline to identify and rank key QoL-related factors predictive of arthritis status in a large Chinese cohort, achieving robust model performance. Pain burden, age, subjective health, sleep, functional status, and psychological well-being are critical determinants. These interpretable findings can inform risk stratification and guide targeted interventions focusing on these key areas to potentially improve arthritis management.
期刊介绍:
Health and Quality of Life Outcomes is an open access, peer-reviewed, journal offering high quality articles, rapid publication and wide diffusion in the public domain.
Health and Quality of Life Outcomes considers original manuscripts on the Health-Related Quality of Life (HRQOL) assessment for evaluation of medical and psychosocial interventions. It also considers approaches and studies on psychometric properties of HRQOL and patient reported outcome measures, including cultural validation of instruments if they provide information about the impact of interventions. The journal publishes study protocols and reviews summarising the present state of knowledge concerning a particular aspect of HRQOL and patient reported outcome measures. Reviews should generally follow systematic review methodology. Comments on articles and letters to the editor are welcome.