Felipe Mendes Delpino, Francisco Tustumi, Marina Martins Siqueira, Gabriely Rangel Pereira, Marcelo Passos Teivelis, Vanessa Damazio Teich, Sergio Eduardo Alonso Araujo, Lucas Hernandes Corrêa, Nelson Wolosker
{"title":"Machine learning models for predicting surgical intervention in colorectal cancer","authors":"Felipe Mendes Delpino, Francisco Tustumi, Marina Martins Siqueira, Gabriely Rangel Pereira, Marcelo Passos Teivelis, Vanessa Damazio Teich, Sergio Eduardo Alonso Araujo, Lucas Hernandes Corrêa, Nelson Wolosker","doi":"10.1016/j.ymecc.2025.100018","DOIUrl":null,"url":null,"abstract":"<div><h3>Aim</h3><div>We aimed to develop and validate a machine learning (ML) model to predict surgical intervention in colorectal cancer (CRC) patients in the state of São Paulo, Brazil, using clinical and sociodemographic data as predictors.</div></div><div><h3>Methods</h3><div>We conducted a longitudinal analysis using data from the <em>Fundação Oncocentro de São Paulo</em> (FOSP) database, which included CRC cases diagnosed between 2000 and 2023. We defined the primary outcome as surgical intervention and analyzed 29 predictor variables, including clinical, demographic, and socioeconomic factors. We evaluated six ML algorithms (Random Forest, Gradient Boosting, LightGBM, CatBoost, Logistic Regression, and Decision Trees). Data was divided into training (70 %) and test (30 %) sets and preprocessing steps were applied, including normalization, one-hot encoding, and addressing class imbalance. We assessed model performance using AUC-ROC, accuracy, precision, recall, F1-score, and specificity. SHAP was used to interpret variable importance.</div></div><div><h3>Results</h3><div>The dataset comprised 72,038 participants, 17,852 in the group that did not undergo surgery and 54,186 in the group that did. The Random Forest model achieved the highest performance, with an AUC of 0.94, accuracy of 0.82, and F1-score of 0.87. Key predictors included treatment-related factors (e.g., time between diagnosis and treatment), tumor stage, age, and socioeconomic indicators (e.g., municipal human development index). Geographic accessibility, such as travel time to healthcare facilities, also significantly influenced predictions.</div></div><div><h3>Conclusion</h3><div>This study demonstrates the potential of ML models, particularly Random Forest, to predict surgical necessity in CRC patients by integrating clinical and sociodemographic data.</div></div>","PeriodicalId":100896,"journal":{"name":"Measurement and Evaluations in Cancer Care","volume":"3 ","pages":"Article 100018"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Measurement and Evaluations in Cancer Care","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949877525000061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Aim
We aimed to develop and validate a machine learning (ML) model to predict surgical intervention in colorectal cancer (CRC) patients in the state of São Paulo, Brazil, using clinical and sociodemographic data as predictors.
Methods
We conducted a longitudinal analysis using data from the Fundação Oncocentro de São Paulo (FOSP) database, which included CRC cases diagnosed between 2000 and 2023. We defined the primary outcome as surgical intervention and analyzed 29 predictor variables, including clinical, demographic, and socioeconomic factors. We evaluated six ML algorithms (Random Forest, Gradient Boosting, LightGBM, CatBoost, Logistic Regression, and Decision Trees). Data was divided into training (70 %) and test (30 %) sets and preprocessing steps were applied, including normalization, one-hot encoding, and addressing class imbalance. We assessed model performance using AUC-ROC, accuracy, precision, recall, F1-score, and specificity. SHAP was used to interpret variable importance.
Results
The dataset comprised 72,038 participants, 17,852 in the group that did not undergo surgery and 54,186 in the group that did. The Random Forest model achieved the highest performance, with an AUC of 0.94, accuracy of 0.82, and F1-score of 0.87. Key predictors included treatment-related factors (e.g., time between diagnosis and treatment), tumor stage, age, and socioeconomic indicators (e.g., municipal human development index). Geographic accessibility, such as travel time to healthcare facilities, also significantly influenced predictions.
Conclusion
This study demonstrates the potential of ML models, particularly Random Forest, to predict surgical necessity in CRC patients by integrating clinical and sociodemographic data.