基于shap的机器学习模型特征工程改进阑尾癌预测：一项预测研究。

IF 0.2 Q3 MEDICINE, GENERAL & INTERNAL

Ewha Medical Journal Pub Date : 2025-04-01 Epub Date: 2025-04-15 DOI:10.12771/emj.2025.00297

Ji Yoon Kim

{"title":"基于shap的机器学习模型特征工程改进阑尾癌预测：一项预测研究。","authors":"Ji Yoon Kim","doi":"10.12771/emj.2025.00297","DOIUrl":null,"url":null,"abstract":"Purpose: This study aimed to leverage Shapley additive explanation (SHAP)-based feature engineering to predict appendix cancer. Traditional models often lack transparency, hindering clinical adoption. We propose a framework that integrates SHAP for feature selection, construction, and weighting to enhance accuracy and clinical relevance.Methods: Data from the Kaggle Appendix Cancer Prediction dataset (260,000 samples, 21 features) were used in this prediction study conducted from January through March 2025, in accordance with TRIPOD-AI guidelines. Preprocessing involved label encoding, SMOTE (synthetic minority over-sampling technique) to address class imbalance, and an 80:20 train-test split. Baseline models (random forest, XGBoost, LightGBM) were compared; LightGBM was selected for its superior performance (accuracy=0.8794). SHAP analysis identified key features and guided 3 engineering steps: selection of the top 15 features, construction of interaction-based features (e.g., chronic severity), and feature weighting based on SHAP values. Performance was evaluated using accuracy, precision, recall, and F1-score.Results: Four LightGBM model configurations were evaluated: baseline (accuracy=0.8794, F1-score=0.8691), feature selection (accuracy=0.8968, F1-score=0.8860), feature construction (accuracy=0.8980, F1-score=0.8872), and feature weighting (accuracy=0.8986, F1-score=0.8877). SHAP-based engineering yielded performance improvements, with feature weighting achieving the highest precision (0.9940). Key features (e.g., red blood cell count and chronic severity) contributed to predictions while maintaining interpretability.Conclusion: The SHAP-based framework substantially improved the accuracy and transparency of appendix cancer predictions using LightGBM (F1-score=0.8877). This approach bridges the gap between predictive power and clinical interpretability, offering a scalable model for rare disease prediction. Future validation with real-world data is recommended to ensure generalizability.","PeriodicalId":41392,"journal":{"name":"Ewha Medical Journal","volume":"48 2","pages":"e31"},"PeriodicalIF":0.2000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12277501/pdf/","citationCount":"0","resultStr":"{\"title\":\"Improving appendix cancer prediction with SHAP-based feature engineering for machine learning models: a prediction study.\",\"authors\":\"Ji Yoon Kim\",\"doi\":\"10.12771/emj.2025.00297\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: This study aimed to leverage Shapley additive explanation (SHAP)-based feature engineering to predict appendix cancer. Traditional models often lack transparency, hindering clinical adoption. We propose a framework that integrates SHAP for feature selection, construction, and weighting to enhance accuracy and clinical relevance.Methods: Data from the Kaggle Appendix Cancer Prediction dataset (260,000 samples, 21 features) were used in this prediction study conducted from January through March 2025, in accordance with TRIPOD-AI guidelines. Preprocessing involved label encoding, SMOTE (synthetic minority over-sampling technique) to address class imbalance, and an 80:20 train-test split. Baseline models (random forest, XGBoost, LightGBM) were compared; LightGBM was selected for its superior performance (accuracy=0.8794). SHAP analysis identified key features and guided 3 engineering steps: selection of the top 15 features, construction of interaction-based features (e.g., chronic severity), and feature weighting based on SHAP values. Performance was evaluated using accuracy, precision, recall, and F1-score.Results: Four LightGBM model configurations were evaluated: baseline (accuracy=0.8794, F1-score=0.8691), feature selection (accuracy=0.8968, F1-score=0.8860), feature construction (accuracy=0.8980, F1-score=0.8872), and feature weighting (accuracy=0.8986, F1-score=0.8877). SHAP-based engineering yielded performance improvements, with feature weighting achieving the highest precision (0.9940). Key features (e.g., red blood cell count and chronic severity) contributed to predictions while maintaining interpretability.Conclusion: The SHAP-based framework substantially improved the accuracy and transparency of appendix cancer predictions using LightGBM (F1-score=0.8877). This approach bridges the gap between predictive power and clinical interpretability, offering a scalable model for rare disease prediction. Future validation with real-world data is recommended to ensure generalizability.\",\"PeriodicalId\":41392,\"journal\":{\"name\":\"Ewha Medical Journal\",\"volume\":\"48 2\",\"pages\":\"e31\"},\"PeriodicalIF\":0.2000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12277501/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ewha Medical Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12771/emj.2025.00297\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/4/15 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ewha Medical Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12771/emj.2025.00297","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/15 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

摘要

目的：本研究旨在利用Shapley加性解释（SHAP）为基础的特征工程预测阑尾癌。传统模型往往缺乏透明度，阻碍了临床应用。我们提出了一个框架，将SHAP集成到特征选择、构建和加权中，以提高准确性和临床相关性。方法：根据TRIPOD-AI指南，在2025年1月至3月进行的预测研究中使用了来自Kaggle阑尾癌症预测数据集（26万个样本，21个特征）的数据。预处理包括标签编码，SMOTE（合成少数过度采样技术），以解决类别不平衡，和80:20训练测试分割。比较基线模型（随机森林、XGBoost、LightGBM）；选择LightGBM是因为其性能优越（准确率=0.8794）。SHAP分析确定了关键特征，并指导了3个工程步骤：选择前15个特征，构建基于交互的特征（例如，慢性严重程度），以及基于SHAP值的特征加权。使用准确性、精密度、召回率和f1评分来评估性能。结果：评估了四种LightGBM模型配置：基线（准确率=0.8794,F1-score=0.8691）、特征选择（准确率=0.8968,F1-score=0.8860）、特征构建（准确率=0.8980,F1-score=0.8872）和特征加权（准确率=0.8986,F1-score=0.8877）。基于shap的工程产生了性能改进，特征权重达到了最高的精度（0.9940）。关键特征（如红细胞计数和慢性严重程度）有助于预测，同时保持可解释性。结论：基于shap的框架显著提高了LightGBM预测阑尾癌的准确性和透明度（F1-score=0.8877）。这种方法弥合了预测能力和临床可解释性之间的差距，为罕见病预测提供了可扩展的模型。建议将来使用实际数据进行验证，以确保通用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Improving appendix cancer prediction with SHAP-based feature engineering for machine learning models: a prediction study.

查看原文本刊更多论文

Improving appendix cancer prediction with SHAP-based feature engineering for machine learning models: a prediction study.

Purpose: This study aimed to leverage Shapley additive explanation (SHAP)-based feature engineering to predict appendix cancer. Traditional models often lack transparency, hindering clinical adoption. We propose a framework that integrates SHAP for feature selection, construction, and weighting to enhance accuracy and clinical relevance.

Methods: Data from the Kaggle Appendix Cancer Prediction dataset (260,000 samples, 21 features) were used in this prediction study conducted from January through March 2025, in accordance with TRIPOD-AI guidelines. Preprocessing involved label encoding, SMOTE (synthetic minority over-sampling technique) to address class imbalance, and an 80:20 train-test split. Baseline models (random forest, XGBoost, LightGBM) were compared; LightGBM was selected for its superior performance (accuracy=0.8794). SHAP analysis identified key features and guided 3 engineering steps: selection of the top 15 features, construction of interaction-based features (e.g., chronic severity), and feature weighting based on SHAP values. Performance was evaluated using accuracy, precision, recall, and F1-score.

Results: Four LightGBM model configurations were evaluated: baseline (accuracy=0.8794, F1-score=0.8691), feature selection (accuracy=0.8968, F1-score=0.8860), feature construction (accuracy=0.8980, F1-score=0.8872), and feature weighting (accuracy=0.8986, F1-score=0.8877). SHAP-based engineering yielded performance improvements, with feature weighting achieving the highest precision (0.9940). Key features (e.g., red blood cell count and chronic severity) contributed to predictions while maintaining interpretability.

Conclusion: The SHAP-based framework substantially improved the accuracy and transparency of appendix cancer predictions using LightGBM (F1-score=0.8877). This approach bridges the gap between predictive power and clinical interpretability, offering a scalable model for rare disease prediction. Future validation with real-world data is recommended to ensure generalizability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Ewha Medical Journal MEDICINE, GENERAL & INTERNAL-

自引率

33.30%

发文量