Machine learning models for predicting surgical intervention in colorectal cancer

Measurement and Evaluations in Cancer Care Pub Date : 2025-07-12 DOI:10.1016/j.ymecc.2025.100018

Felipe Mendes Delpino, Francisco Tustumi, Marina Martins Siqueira, Gabriely Rangel Pereira, Marcelo Passos Teivelis, Vanessa Damazio Teich, Sergio Eduardo Alonso Araujo, Lucas Hernandes Corrêa, Nelson Wolosker

{"title":"Machine learning models for predicting surgical intervention in colorectal cancer","authors":"Felipe Mendes Delpino, Francisco Tustumi, Marina Martins Siqueira, Gabriely Rangel Pereira, Marcelo Passos Teivelis, Vanessa Damazio Teich, Sergio Eduardo Alonso Araujo, Lucas Hernandes Corrêa, Nelson Wolosker","doi":"10.1016/j.ymecc.2025.100018","DOIUrl":null,"url":null,"abstract":"<div><h3>Aim</h3><div>We aimed to develop and validate a machine learning (ML) model to predict surgical intervention in colorectal cancer (CRC) patients in the state of São Paulo, Brazil, using clinical and sociodemographic data as predictors.</div></div><div><h3>Methods</h3><div>We conducted a longitudinal analysis using data from the <em>Fundação Oncocentro de São Paulo</em> (FOSP) database, which included CRC cases diagnosed between 2000 and 2023. We defined the primary outcome as surgical intervention and analyzed 29 predictor variables, including clinical, demographic, and socioeconomic factors. We evaluated six ML algorithms (Random Forest, Gradient Boosting, LightGBM, CatBoost, Logistic Regression, and Decision Trees). Data was divided into training (70 %) and test (30 %) sets and preprocessing steps were applied, including normalization, one-hot encoding, and addressing class imbalance. We assessed model performance using AUC-ROC, accuracy, precision, recall, F1-score, and specificity. SHAP was used to interpret variable importance.</div></div><div><h3>Results</h3><div>The dataset comprised 72,038 participants, 17,852 in the group that did not undergo surgery and 54,186 in the group that did. The Random Forest model achieved the highest performance, with an AUC of 0.94, accuracy of 0.82, and F1-score of 0.87. Key predictors included treatment-related factors (e.g., time between diagnosis and treatment), tumor stage, age, and socioeconomic indicators (e.g., municipal human development index). Geographic accessibility, such as travel time to healthcare facilities, also significantly influenced predictions.</div></div><div><h3>Conclusion</h3><div>This study demonstrates the potential of ML models, particularly Random Forest, to predict surgical necessity in CRC patients by integrating clinical and sociodemographic data.</div></div>","PeriodicalId":100896,"journal":{"name":"Measurement and Evaluations in Cancer Care","volume":"3 ","pages":"Article 100018"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Measurement and Evaluations in Cancer Care","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949877525000061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Aim

We aimed to develop and validate a machine learning (ML) model to predict surgical intervention in colorectal cancer (CRC) patients in the state of São Paulo, Brazil, using clinical and sociodemographic data as predictors.

Methods

We conducted a longitudinal analysis using data from the Fundação Oncocentro de São Paulo (FOSP) database, which included CRC cases diagnosed between 2000 and 2023. We defined the primary outcome as surgical intervention and analyzed 29 predictor variables, including clinical, demographic, and socioeconomic factors. We evaluated six ML algorithms (Random Forest, Gradient Boosting, LightGBM, CatBoost, Logistic Regression, and Decision Trees). Data was divided into training (70 %) and test (30 %) sets and preprocessing steps were applied, including normalization, one-hot encoding, and addressing class imbalance. We assessed model performance using AUC-ROC, accuracy, precision, recall, F1-score, and specificity. SHAP was used to interpret variable importance.

Results

The dataset comprised 72,038 participants, 17,852 in the group that did not undergo surgery and 54,186 in the group that did. The Random Forest model achieved the highest performance, with an AUC of 0.94, accuracy of 0.82, and F1-score of 0.87. Key predictors included treatment-related factors (e.g., time between diagnosis and treatment), tumor stage, age, and socioeconomic indicators (e.g., municipal human development index). Geographic accessibility, such as travel time to healthcare facilities, also significantly influenced predictions.

Conclusion

This study demonstrates the potential of ML models, particularly Random Forest, to predict surgical necessity in CRC patients by integrating clinical and sociodemographic data.

查看原文本刊更多论文

预测结直肠癌手术干预的机器学习模型

AimWe旨在开发和验证一个机器学习（ML）模型，以临床和社会人口统计学数据作为预测因素，预测巴西圣保罗州结直肠癌（CRC）患者的手术干预。方法利用圣保罗肿瘤中心基金会（FOSP）数据库的数据进行纵向分析，其中包括2000年至2023年诊断的CRC病例。我们将主要结局定义为手术干预，并分析了29个预测变量，包括临床、人口统计学和社会经济因素。我们评估了六种机器学习算法（随机森林、梯度增强、LightGBM、CatBoost、逻辑回归和决策树）。数据被分为训练集（70 %）和测试集（30 %），并应用预处理步骤，包括归一化、单热编码和处理类不平衡。我们使用AUC-ROC、准确度、精密度、召回率、f1评分和特异性来评估模型的性能。SHAP用于解释变量重要性。该数据集包括72,038名参与者，未接受手术的组为17,852人，接受手术的组为54,186人。随机森林模型的AUC为0.94，准确率为0.82，F1-score为0.87。主要预测因素包括治疗相关因素（如诊断和治疗之间的时间）、肿瘤分期、年龄和社会经济指标（如城市人类发展指数）。地理上的可达性，如到医疗机构的旅行时间，也会显著影响预测。本研究证明了ML模型，特别是随机森林模型，通过整合临床和社会人口学数据来预测结直肠癌患者手术必要性的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Measurement and Evaluations in Cancer Care

自引率

0.00%

发文量