Khaled Toffaha, Mecit Can Emre Simsekler, Aamna Al Shehhi, Andrei Sleptchenko, Aydah AlAwadhi
{"title":"Comprehensive survival analysis of breast cancer patients: a bayesian network approach.","authors":"Khaled Toffaha, Mecit Can Emre Simsekler, Aamna Al Shehhi, Andrei Sleptchenko, Aydah AlAwadhi","doi":"10.1186/s12911-025-03197-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Breast cancer is recognized as one of the leading causes of cancer-related deaths globally. A deeper understanding of the complex interactions between clinical, pathological, and treatment-related factors is essential for improving patient outcomes.</p><p><strong>Methods: </strong>Following comprehensive data cleaning and preprocessing, an analysis was performed on a cohort of 1,980 primary breast cancer samples from the METABRIC database. The dataset was divided into a 75/25 training-testing split, and five-fold cross-validation was applied to the training set to mitigate overfitting. Overall and relapse-free survival were then modeled using four fully parametric distributions: Weibull, Exponential, Log-Normal, and Log-Logistic, along with their corresponding Accelerated Failure Time (AFT) forms, to identify significant prognostic features. Competing models were ranked by the Akaike Information Criterion (AIC) and further validated through Quantile-Quantile (QQ) plots. Finally, the probabilistic relationships among the significant factors selected by the optimal AFT models were explored using a Bayesian Belief Network (BBN), whose structure was learned from the training data using multiple score-based algorithms and refined through expert-driven judgment; all conditional probability parameters were estimated using maximum likelihood.</p><p><strong>Results: </strong>The Weibull model provided the best fit for overall survival, whereas the Log-Normal form was optimal for relapse-free survival, each satisfying their respective error-distribution diagnostics. In the hold-out test set, the Bayesian network achieved an Area Under the Curve (AUC) of 0.880 and an F1-score of 0.779. Age at diagnosis, menopausal status, tumor stage, lymph-node burden, and treatment modality were identified as the most influential predictors, and the learned network clarified their direct and mediated effects on both survival endpoints.</p><p><strong>Conclusion: </strong>Through the integration of validated parametric survival models with a data-driven BBN, this study delivers a comprehensive framework for estimating individualized survival probabilities and visualizing the complex probabilistic relationships that characterize high-risk cancer patient profiles. This approach supports evidence-based, personalized breast cancer management and demonstrates the potential for guiding clinical decision-making and adapting to diverse external patient cohorts.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"349"},"PeriodicalIF":3.8000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12482897/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03197-z","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Breast cancer is recognized as one of the leading causes of cancer-related deaths globally. A deeper understanding of the complex interactions between clinical, pathological, and treatment-related factors is essential for improving patient outcomes.
Methods: Following comprehensive data cleaning and preprocessing, an analysis was performed on a cohort of 1,980 primary breast cancer samples from the METABRIC database. The dataset was divided into a 75/25 training-testing split, and five-fold cross-validation was applied to the training set to mitigate overfitting. Overall and relapse-free survival were then modeled using four fully parametric distributions: Weibull, Exponential, Log-Normal, and Log-Logistic, along with their corresponding Accelerated Failure Time (AFT) forms, to identify significant prognostic features. Competing models were ranked by the Akaike Information Criterion (AIC) and further validated through Quantile-Quantile (QQ) plots. Finally, the probabilistic relationships among the significant factors selected by the optimal AFT models were explored using a Bayesian Belief Network (BBN), whose structure was learned from the training data using multiple score-based algorithms and refined through expert-driven judgment; all conditional probability parameters were estimated using maximum likelihood.
Results: The Weibull model provided the best fit for overall survival, whereas the Log-Normal form was optimal for relapse-free survival, each satisfying their respective error-distribution diagnostics. In the hold-out test set, the Bayesian network achieved an Area Under the Curve (AUC) of 0.880 and an F1-score of 0.779. Age at diagnosis, menopausal status, tumor stage, lymph-node burden, and treatment modality were identified as the most influential predictors, and the learned network clarified their direct and mediated effects on both survival endpoints.
Conclusion: Through the integration of validated parametric survival models with a data-driven BBN, this study delivers a comprehensive framework for estimating individualized survival probabilities and visualizing the complex probabilistic relationships that characterize high-risk cancer patient profiles. This approach supports evidence-based, personalized breast cancer management and demonstrates the potential for guiding clinical decision-making and adapting to diverse external patient cohorts.
背景:乳腺癌被认为是全球癌症相关死亡的主要原因之一。深入了解临床、病理和治疗相关因素之间复杂的相互作用对于改善患者预后至关重要。方法:经过全面的数据清理和预处理,对METABRIC数据库中1980例原发性乳腺癌样本进行了分析。将数据集划分为75/25的训练-测试分割,并对训练集进行5倍交叉验证以减轻过拟合。然后使用四种全参数分布:威布尔分布、指数分布、对数正态分布和对数逻辑分布,以及相应的加速失效时间(AFT)形式,对总体生存和无复发生存进行建模,以确定重要的预后特征。采用Akaike信息准则(AIC)对竞争模型进行排序,并通过分位数-分位数(QQ)图进一步验证。最后,利用贝叶斯信念网络(BBN)探索最优AFT模型所选择的显著因素之间的概率关系,该网络的结构通过多种基于分数的算法从训练数据中学习,并通过专家驱动判断进行优化;使用最大似然法估计所有条件概率参数。结果:威布尔模型对总生存提供了最佳拟合,而对数正态形式对无复发生存是最佳的,每种形式都满足各自的错误分布诊断。在hold-out测试集中,贝叶斯网络的曲线下面积(Area Under the Curve, AUC)为0.880,f1得分为0.779。诊断年龄、绝经状态、肿瘤分期、淋巴结负担和治疗方式被确定为最具影响力的预测因素,学习网络澄清了它们对两个生存终点的直接和介导作用。结论:通过将经过验证的参数生存模型与数据驱动的BBN相结合,本研究提供了一个综合框架,用于估计个体化生存概率,并将表征高风险癌症患者特征的复杂概率关系可视化。该方法支持基于证据的个性化乳腺癌管理,并展示了指导临床决策和适应不同外部患者群体的潜力。
期刊介绍:
BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.