Optimizing Alzheimer's disease prediction through ensemble learning and feature interpretability with SHAP-based feature analysis.

IF 4.4 Q1 CLINICAL NEUROLOGY

Alzheimer''s and Dementia: Diagnosis, Assessment and Disease Monitoring Pub Date : 2025-08-08 eCollection Date: 2025-07-01 DOI:10.1002/dad2.70162

Md Kamrul Hossain, Afrina Ashraf, Md Mominul Islam, Shoriful Hassan Sourav, Md Monir Hossain Shimul

{"title":"Optimizing Alzheimer's disease prediction through ensemble learning and feature interpretability with SHAP-based feature analysis.","authors":"Md Kamrul Hossain, Afrina Ashraf, Md Mominul Islam, Shoriful Hassan Sourav, Md Monir Hossain Shimul","doi":"10.1002/dad2.70162","DOIUrl":null,"url":null,"abstract":"Introduction: Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia. Early diagnosis is vital. We developed an interpretable machine learning (ML) model for early AD prediction using open clinical data.Methods: Data from 2149 adults (60-90 years) were obtained from Kaggle. After preprocessing and feature engineering, tree-based models were trained. A stacking ensemble model combining Gradient Boosting and XGBoost was trained, with Logistic Regression as the meta-learner. SHapley Additive exPlanations (SHAP) provided interpretability. Performance was measured by accuracy, precision, recall, F1 score, ROC and AUC.Results: The stacked ensemble achieved 97% accuracy (AUC 0.97), with 0.97 precision, 0.94 recall, and 0.96 F1 score for AD. SHAP identified memory complaints, Mini-Mental State Examination (MMSE), functional assessment, behavioral symptoms, cholesterol, and lifestyle factors (activity, diet, sleep) as top predictors.Conclusion: The ensemble model, enhanced by SHAP analysis, provides accurate and interpretable AD risk predictions with potential applicability in future clinical decision support systems.Highlights: Developed an ensemble machine learning (ML) model for early Alzheimer's disease (AD) prediction.Achieved 97% accuracy using stacked XGBoost and Gradient Boosting.SHapley Additive exPlanations (SHAP) analysis identified key cognitive and lifestyle-related risk factors.Model interprets AD risk using explainable artificial intelligence (AI) for clinical applicability.Utilized open-access dataset to ensure reproducibility and transparency.","PeriodicalId":53226,"journal":{"name":"Alzheimer''s and Dementia: Diagnosis, Assessment and Disease Monitoring","volume":"17 3","pages":"e70162"},"PeriodicalIF":4.4000,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12333869/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Alzheimer''s and Dementia: Diagnosis, Assessment and Disease Monitoring","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/dad2.70162","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia. Early diagnosis is vital. We developed an interpretable machine learning (ML) model for early AD prediction using open clinical data.

Methods: Data from 2149 adults (60-90 years) were obtained from Kaggle. After preprocessing and feature engineering, tree-based models were trained. A stacking ensemble model combining Gradient Boosting and XGBoost was trained, with Logistic Regression as the meta-learner. SHapley Additive exPlanations (SHAP) provided interpretability. Performance was measured by accuracy, precision, recall, F1 score, ROC and AUC.

Results: The stacked ensemble achieved 97% accuracy (AUC 0.97), with 0.97 precision, 0.94 recall, and 0.96 F1 score for AD. SHAP identified memory complaints, Mini-Mental State Examination (MMSE), functional assessment, behavioral symptoms, cholesterol, and lifestyle factors (activity, diet, sleep) as top predictors.

Conclusion: The ensemble model, enhanced by SHAP analysis, provides accurate and interpretable AD risk predictions with potential applicability in future clinical decision support systems.

Highlights: Developed an ensemble machine learning (ML) model for early Alzheimer's disease (AD) prediction.Achieved 97% accuracy using stacked XGBoost and Gradient Boosting.SHapley Additive exPlanations (SHAP) analysis identified key cognitive and lifestyle-related risk factors.Model interprets AD risk using explainable artificial intelligence (AI) for clinical applicability.Utilized open-access dataset to ensure reproducibility and transparency.

Abstract Image

查看原文本刊更多论文

基于shap的特征分析，通过集成学习和特征可解释性优化阿尔茨海默病预测。

简介：阿尔茨海默病（AD）是一种进行性神经退行性疾病，是痴呆症的主要原因。早期诊断至关重要。我们开发了一个可解释的机器学习（ML）模型，用于使用开放的临床数据进行早期AD预测。方法：从Kaggle获得2149名60-90岁成年人的数据。经过预处理和特征工程，训练出基于树的模型。以Logistic回归作为元学习器，训练了梯度增强和XGBoost相结合的叠加集成模型。SHapley加法解释（SHAP）提供了可解释性。以正确率、精密度、召回率、F1评分、ROC、AUC等指标衡量。结果：堆叠集成准确率达到97% (AUC 0.97)，精密度0.97，召回率0.94，AD F1评分0.96。SHAP发现，记忆抱怨、简易精神状态检查（MMSE）、功能评估、行为症状、胆固醇和生活方式因素（活动、饮食、睡眠）是最重要的预测因素。结论：经SHAP分析增强的集成模型提供了准确且可解释的AD风险预测，在未来的临床决策支持系统中具有潜在的适用性。重点：开发了用于早期阿尔茨海默病（AD）预测的集成机器学习（ML）模型。使用堆叠的XGBoost和梯度增强达到97%的精度。SHapley加性解释（SHAP）分析确定了关键的认知和生活方式相关的风险因素。该模型使用可解释的人工智能（AI）来解释AD风险，以供临床应用。利用开放获取数据集，确保再现性和透明度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Alzheimer''s and Dementia: Diagnosis, Assessment and Disease Monitoring Medicine-Psychiatry and Mental Health

CiteScore

7.80

自引率

7.50%

发文量

101

审稿时长

8 weeks

期刊介绍： Alzheimer''s & Dementia: Diagnosis, Assessment & Disease Monitoring (DADM) is an open access, peer-reviewed, journal from the Alzheimer''s Association® that will publish new research that reports the discovery, development and validation of instruments, technologies, algorithms, and innovative processes. Papers will cover a range of topics interested in the early and accurate detection of individuals with memory complaints and/or among asymptomatic individuals at elevated risk for various forms of memory disorders. The expectation for published papers will be to translate fundamental knowledge about the neurobiology of the disease into practical reports that describe both the conceptual and methodological aspects of the submitted scientific inquiry. Published topics will explore the development of biomarkers, surrogate markers, and conceptual/methodological challenges. Publication priority will be given to papers that 1) describe putative surrogate markers that accurately track disease progression, 2) biomarkers that fulfill international regulatory requirements, 3) reports from large, well-characterized population-based cohorts that comprise the heterogeneity and diversity of asymptomatic individuals and 4) algorithmic development that considers multi-marker arrays (e.g., integrated-omics, genetics, biofluids, imaging, etc.) and advanced computational analytics and technologies.