Explainable Machine Learning Models for Colorectal Cancer Prediction Using Clinical Laboratory Data.

IF 2.5 4区医学 Q3 ONCOLOGY

Cancer Control Pub Date : 2025-01-01 Epub Date: 2025-05-07 DOI:10.1177/10732748251336417

Rui Li, Xiaoyan Hao, Yanjun Diao, Liu Yang, Jiayun Liu

{"title":"Explainable Machine Learning Models for Colorectal Cancer Prediction Using Clinical Laboratory Data.","authors":"Rui Li, Xiaoyan Hao, Yanjun Diao, Liu Yang, Jiayun Liu","doi":"10.1177/10732748251336417","DOIUrl":null,"url":null,"abstract":"<p><p>IntroductionEarly diagnosis of colorectal cancer (CRC) poses a significant clinical challenge. This study aims to develop machine learning (ML) models for CRC risk prediction using clinical laboratory data.MethodsThis retrospective, single-center study analyzed laboratory examination data from healthy controls (HC), polyp patients (Polyp), and CRC patients between 2013 and 2023. Five ML algorithms, including adaptive boosting (AdaBoost), extreme gradient boosting (XGBoost), decision tree (DT), logistic regression (LR), and random forest (RF), were employed to classify subjects into HC vs Polyp vs CRC, HC vs CRC, and Polyp vs CRC, respectively.ResultsThis study included 31 539 subjects: 11 793 HCs, 10 125 polyp patients, and 9621 CRC patients. The XGBoost model achieved the highest AUCs of 0.966 for differentiating HC from CRC and 0.881 for Polyp from CRC, outperforming carcino-embryonic antigen (CEA) and fecal occult blood testing (FOBT) tests. This model could also identify CEA-negative or FOBT-negative CRC patients. Incorporating stool miR-92a detection into the model further improved diagnostic performance. Shapley additive explanations (SHAP) plots indicated that FOBT, CEA, lymphocyte percentage (LYMPH%), and hematocrit (HCT) were the most significant features contributing to CRC diagnosis. Additionally, a computational tool for predicting CRC risk based on the optimal model was developed, designed for researchers with programming experience.ConclusionFive ML models for CRC diagnosis, based on ten routine laboratory test items, were developed, achieving higher diagnostic accuracies than traditional CRC biomarkers. The diagnostic capabilities of these ML models can be further enhanced by including stool miR-92a levels.</p>","PeriodicalId":49093,"journal":{"name":"Cancer Control","volume":"32 ","pages":"10732748251336417"},"PeriodicalIF":2.5000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12062600/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cancer Control","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/10732748251336417","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/7 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

IntroductionEarly diagnosis of colorectal cancer (CRC) poses a significant clinical challenge. This study aims to develop machine learning (ML) models for CRC risk prediction using clinical laboratory data.MethodsThis retrospective, single-center study analyzed laboratory examination data from healthy controls (HC), polyp patients (Polyp), and CRC patients between 2013 and 2023. Five ML algorithms, including adaptive boosting (AdaBoost), extreme gradient boosting (XGBoost), decision tree (DT), logistic regression (LR), and random forest (RF), were employed to classify subjects into HC vs Polyp vs CRC, HC vs CRC, and Polyp vs CRC, respectively.ResultsThis study included 31 539 subjects: 11 793 HCs, 10 125 polyp patients, and 9621 CRC patients. The XGBoost model achieved the highest AUCs of 0.966 for differentiating HC from CRC and 0.881 for Polyp from CRC, outperforming carcino-embryonic antigen (CEA) and fecal occult blood testing (FOBT) tests. This model could also identify CEA-negative or FOBT-negative CRC patients. Incorporating stool miR-92a detection into the model further improved diagnostic performance. Shapley additive explanations (SHAP) plots indicated that FOBT, CEA, lymphocyte percentage (LYMPH%), and hematocrit (HCT) were the most significant features contributing to CRC diagnosis. Additionally, a computational tool for predicting CRC risk based on the optimal model was developed, designed for researchers with programming experience.ConclusionFive ML models for CRC diagnosis, based on ten routine laboratory test items, were developed, achieving higher diagnostic accuracies than traditional CRC biomarkers. The diagnostic capabilities of these ML models can be further enhanced by including stool miR-92a levels.

查看原文本刊更多论文

使用临床实验室数据预测结直肠癌的可解释机器学习模型。

结直肠癌（CRC）的早期诊断是一个重大的临床挑战。本研究旨在利用临床实验室数据开发用于CRC风险预测的机器学习（ML）模型。方法本研究为回顾性、单中心研究，分析2013 - 2023年健康对照（HC）、息肉患者（polyp）和结直肠癌患者的实验室检查数据。采用自适应增强（AdaBoost）、极端梯度增强（XGBoost）、决策树（DT）、逻辑回归（LR）和随机森林（RF）等5种ML算法，分别将受试者分为HC vs Polyp vs CRC、HC vs CRC和Polyp vs CRC。结果共纳入31 539例患者，其中肝癌患者11 793例，息肉患者10 125例，结直肠癌患者9621例。XGBoost模型鉴别HC和CRC的auc最高，分别为0.966和0.881，优于癌胚抗原（CEA）和粪便潜血试验（FOBT）。该模型也可以识别cea阴性或fobt阴性的CRC患者。将粪便miR-92a检测纳入模型进一步提高了诊断性能。Shapley加性解释（SHAP）图显示，FOBT、CEA、淋巴细胞百分比（LYMPH%）和红细胞压积（HCT）是诊断结直肠癌的最重要特征。此外，开发了基于最优模型的CRC风险预测计算工具，专为具有编程经验的研究人员设计。结论基于10项常规实验室检测项目，建立了5种用于结直肠癌诊断的ML模型，其诊断准确率高于传统的结直肠癌生物标志物。这些ML模型的诊断能力可以通过纳入粪便miR-92a水平进一步增强。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cancer Control ONCOLOGY-

CiteScore

3.80

自引率

0.00%

发文量

148

审稿时长

>12 weeks

期刊介绍： Cancer Control is a JCR-ranked, peer-reviewed open access journal whose mission is to advance the prevention, detection, diagnosis, treatment, and palliative care of cancer by enabling researchers, doctors, policymakers, and other healthcare professionals to freely share research along the cancer control continuum. Our vision is a world where gold-standard cancer care is the norm, not the exception.