基于shap的机器学习模型特征工程改进阑尾癌预测:一项预测研究。

IF 0.2 Q3 MEDICINE, GENERAL & INTERNAL
Ewha Medical Journal Pub Date : 2025-04-01 Epub Date: 2025-04-15 DOI:10.12771/emj.2025.00297
Ji Yoon Kim
{"title":"基于shap的机器学习模型特征工程改进阑尾癌预测:一项预测研究。","authors":"Ji Yoon Kim","doi":"10.12771/emj.2025.00297","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study aimed to leverage Shapley additive explanation (SHAP)-based feature engineering to predict appendix cancer. Traditional models often lack transparency, hindering clinical adoption. We propose a framework that integrates SHAP for feature selection, construction, and weighting to enhance accuracy and clinical relevance.</p><p><strong>Methods: </strong>Data from the Kaggle Appendix Cancer Prediction dataset (260,000 samples, 21 features) were used in this prediction study conducted from January through March 2025, in accordance with TRIPOD-AI guidelines. Preprocessing involved label encoding, SMOTE (synthetic minority over-sampling technique) to address class imbalance, and an 80:20 train-test split. Baseline models (random forest, XGBoost, LightGBM) were compared; LightGBM was selected for its superior performance (accuracy=0.8794). SHAP analysis identified key features and guided 3 engineering steps: selection of the top 15 features, construction of interaction-based features (e.g., chronic severity), and feature weighting based on SHAP values. Performance was evaluated using accuracy, precision, recall, and F1-score.</p><p><strong>Results: </strong>Four LightGBM model configurations were evaluated: baseline (accuracy=0.8794, F1-score=0.8691), feature selection (accuracy=0.8968, F1-score=0.8860), feature construction (accuracy=0.8980, F1-score=0.8872), and feature weighting (accuracy=0.8986, F1-score=0.8877). SHAP-based engineering yielded performance improvements, with feature weighting achieving the highest precision (0.9940). Key features (e.g., red blood cell count and chronic severity) contributed to predictions while maintaining interpretability.</p><p><strong>Conclusion: </strong>The SHAP-based framework substantially improved the accuracy and transparency of appendix cancer predictions using LightGBM (F1-score=0.8877). This approach bridges the gap between predictive power and clinical interpretability, offering a scalable model for rare disease prediction. Future validation with real-world data is recommended to ensure generalizability.</p>","PeriodicalId":41392,"journal":{"name":"Ewha Medical Journal","volume":"48 2","pages":"e31"},"PeriodicalIF":0.2000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12277501/pdf/","citationCount":"0","resultStr":"{\"title\":\"Improving appendix cancer prediction with SHAP-based feature engineering for machine learning models: a prediction study.\",\"authors\":\"Ji Yoon Kim\",\"doi\":\"10.12771/emj.2025.00297\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>This study aimed to leverage Shapley additive explanation (SHAP)-based feature engineering to predict appendix cancer. Traditional models often lack transparency, hindering clinical adoption. We propose a framework that integrates SHAP for feature selection, construction, and weighting to enhance accuracy and clinical relevance.</p><p><strong>Methods: </strong>Data from the Kaggle Appendix Cancer Prediction dataset (260,000 samples, 21 features) were used in this prediction study conducted from January through March 2025, in accordance with TRIPOD-AI guidelines. Preprocessing involved label encoding, SMOTE (synthetic minority over-sampling technique) to address class imbalance, and an 80:20 train-test split. Baseline models (random forest, XGBoost, LightGBM) were compared; LightGBM was selected for its superior performance (accuracy=0.8794). SHAP analysis identified key features and guided 3 engineering steps: selection of the top 15 features, construction of interaction-based features (e.g., chronic severity), and feature weighting based on SHAP values. Performance was evaluated using accuracy, precision, recall, and F1-score.</p><p><strong>Results: </strong>Four LightGBM model configurations were evaluated: baseline (accuracy=0.8794, F1-score=0.8691), feature selection (accuracy=0.8968, F1-score=0.8860), feature construction (accuracy=0.8980, F1-score=0.8872), and feature weighting (accuracy=0.8986, F1-score=0.8877). SHAP-based engineering yielded performance improvements, with feature weighting achieving the highest precision (0.9940). Key features (e.g., red blood cell count and chronic severity) contributed to predictions while maintaining interpretability.</p><p><strong>Conclusion: </strong>The SHAP-based framework substantially improved the accuracy and transparency of appendix cancer predictions using LightGBM (F1-score=0.8877). This approach bridges the gap between predictive power and clinical interpretability, offering a scalable model for rare disease prediction. Future validation with real-world data is recommended to ensure generalizability.</p>\",\"PeriodicalId\":41392,\"journal\":{\"name\":\"Ewha Medical Journal\",\"volume\":\"48 2\",\"pages\":\"e31\"},\"PeriodicalIF\":0.2000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12277501/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ewha Medical Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12771/emj.2025.00297\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/4/15 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ewha Medical Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12771/emj.2025.00297","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/15 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0

摘要

目的:本研究旨在利用Shapley加性解释(SHAP)为基础的特征工程预测阑尾癌。传统模型往往缺乏透明度,阻碍了临床应用。我们提出了一个框架,将SHAP集成到特征选择、构建和加权中,以提高准确性和临床相关性。方法:根据TRIPOD-AI指南,在2025年1月至3月进行的预测研究中使用了来自Kaggle阑尾癌症预测数据集(26万个样本,21个特征)的数据。预处理包括标签编码,SMOTE(合成少数过度采样技术),以解决类别不平衡,和80:20训练测试分割。比较基线模型(随机森林、XGBoost、LightGBM);选择LightGBM是因为其性能优越(准确率=0.8794)。SHAP分析确定了关键特征,并指导了3个工程步骤:选择前15个特征,构建基于交互的特征(例如,慢性严重程度),以及基于SHAP值的特征加权。使用准确性、精密度、召回率和f1评分来评估性能。结果:评估了四种LightGBM模型配置:基线(准确率=0.8794,F1-score=0.8691)、特征选择(准确率=0.8968,F1-score=0.8860)、特征构建(准确率=0.8980,F1-score=0.8872)和特征加权(准确率=0.8986,F1-score=0.8877)。基于shap的工程产生了性能改进,特征权重达到了最高的精度(0.9940)。关键特征(如红细胞计数和慢性严重程度)有助于预测,同时保持可解释性。结论:基于shap的框架显著提高了LightGBM预测阑尾癌的准确性和透明度(F1-score=0.8877)。这种方法弥合了预测能力和临床可解释性之间的差距,为罕见病预测提供了可扩展的模型。建议将来使用实际数据进行验证,以确保通用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Improving appendix cancer prediction with SHAP-based feature engineering for machine learning models: a prediction study.

Improving appendix cancer prediction with SHAP-based feature engineering for machine learning models: a prediction study.

Improving appendix cancer prediction with SHAP-based feature engineering for machine learning models: a prediction study.

Improving appendix cancer prediction with SHAP-based feature engineering for machine learning models: a prediction study.

Purpose: This study aimed to leverage Shapley additive explanation (SHAP)-based feature engineering to predict appendix cancer. Traditional models often lack transparency, hindering clinical adoption. We propose a framework that integrates SHAP for feature selection, construction, and weighting to enhance accuracy and clinical relevance.

Methods: Data from the Kaggle Appendix Cancer Prediction dataset (260,000 samples, 21 features) were used in this prediction study conducted from January through March 2025, in accordance with TRIPOD-AI guidelines. Preprocessing involved label encoding, SMOTE (synthetic minority over-sampling technique) to address class imbalance, and an 80:20 train-test split. Baseline models (random forest, XGBoost, LightGBM) were compared; LightGBM was selected for its superior performance (accuracy=0.8794). SHAP analysis identified key features and guided 3 engineering steps: selection of the top 15 features, construction of interaction-based features (e.g., chronic severity), and feature weighting based on SHAP values. Performance was evaluated using accuracy, precision, recall, and F1-score.

Results: Four LightGBM model configurations were evaluated: baseline (accuracy=0.8794, F1-score=0.8691), feature selection (accuracy=0.8968, F1-score=0.8860), feature construction (accuracy=0.8980, F1-score=0.8872), and feature weighting (accuracy=0.8986, F1-score=0.8877). SHAP-based engineering yielded performance improvements, with feature weighting achieving the highest precision (0.9940). Key features (e.g., red blood cell count and chronic severity) contributed to predictions while maintaining interpretability.

Conclusion: The SHAP-based framework substantially improved the accuracy and transparency of appendix cancer predictions using LightGBM (F1-score=0.8877). This approach bridges the gap between predictive power and clinical interpretability, offering a scalable model for rare disease prediction. Future validation with real-world data is recommended to ensure generalizability.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Ewha Medical Journal
Ewha Medical Journal MEDICINE, GENERAL & INTERNAL-
自引率
33.30%
发文量
28
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信