Modeling strategies for a flexible estimation of the crude cumulative incidence in the context of long follow-ups: model choice and predictive ability evaluation.

IF 3.4 3区医学 Q1 HEALTH CARE SCIENCES & SERVICES

BMC Medical Research Methodology Pub Date : 2025-09-29 DOI:10.1186/s12874-025-02650-x

Giacomo Biganzoli, Giuseppe Marano, Patrizia Boracchi

{"title":"Modeling strategies for a flexible estimation of the crude cumulative incidence in the context of long follow-ups: model choice and predictive ability evaluation.","authors":"Giacomo Biganzoli, Giuseppe Marano, Patrizia Boracchi","doi":"10.1186/s12874-025-02650-x","DOIUrl":null,"url":null,"abstract":"Background: Advancements in treatments for chronic diseases, such as breast cancer, have expanded our ability to observe patient outcomes beyond disease-related mortality, including events like distant recurrences. However, competing events can complicate the interpretation of primary outcomes, making the crude cumulative incidence function the most reliable measure for accurate follow-up analysis. Long-term studies require flexible modeling to accommodate intricate, time-dependent effects and interactions among covariates. Traditional models, such as the proportional sub-distribution hazards model, often insufficient to address these complexities. Although more adaptable methods have been proposed, there is still a need to systematically assess model complexity, particularly for exploratory purposes. This article presents a statistical learning workflow designed to evaluate model complexity in crude cumulative incidence and introduces a time-dependent metric for predictive accuracy. This framework provides researchers with an enhanced toolkit for robustly addressing the complexities of long-term outcome modeling and deriving interpretable prognostic algorithms.Methods: We demonstrate our approach using data on time-to-distant breast cancer recurrences from the Milan 1 and Milan 3 trials, which have extensive follow-up periods. Two flexible modeling frameworks-pseudo-observations and sub-distribution hazard models-are employed, enhanced with spline functions to capture baseline hazard and risk. Our proposed workflow integrates graphical representations of Aalen-Johansen estimates for crude cumulative incidence, enabling researchers to visually hypothesize and adjust model complexity to match the studied phenomenon. Information criteria guide model selection to approximate the underlying data structure. Using bootstrapped data perturbations and time-dependent predictive accuracy measures, adjusted with Harrell's optimism correction, we identify the optimal model structure, balancing explainability, predictivity, and generalizability.Results: Our findings highlight the importance of data perturbation and validation through optimism-corrected predictive measures following the original data analysis. The initial model structure may differ from the most robust model identified through iterative perturbation. The ideal model has high robustness (most frequently selected in perturbations), strong explainability, and predictive capacity. When perturbation results are inconsistent, evaluating various time-dependent predictive measures offers additional insights, particularly regarding the trade-off between model complexity and predictive gains. In cases where predictive improvement is minimal, simpler and more explainable model structures are preferable.Conclusions: The proposed statistical learning workflow, informed by domain expertise, allows for incorporating clinically relevant complexities in the prognostic modeling of distant recurrences in breast cancer. Our results suggest that, in many cases, a nuanced and flexible model structure may better serve future predictions than simpler models. This approach underscores the value of balancing model simplicity and complexity to achieve meaningful, clinically useful insights.","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"217"},"PeriodicalIF":3.4000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12481988/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Research Methodology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12874-025-02650-x","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Advancements in treatments for chronic diseases, such as breast cancer, have expanded our ability to observe patient outcomes beyond disease-related mortality, including events like distant recurrences. However, competing events can complicate the interpretation of primary outcomes, making the crude cumulative incidence function the most reliable measure for accurate follow-up analysis. Long-term studies require flexible modeling to accommodate intricate, time-dependent effects and interactions among covariates. Traditional models, such as the proportional sub-distribution hazards model, often insufficient to address these complexities. Although more adaptable methods have been proposed, there is still a need to systematically assess model complexity, particularly for exploratory purposes. This article presents a statistical learning workflow designed to evaluate model complexity in crude cumulative incidence and introduces a time-dependent metric for predictive accuracy. This framework provides researchers with an enhanced toolkit for robustly addressing the complexities of long-term outcome modeling and deriving interpretable prognostic algorithms.

Methods: We demonstrate our approach using data on time-to-distant breast cancer recurrences from the Milan 1 and Milan 3 trials, which have extensive follow-up periods. Two flexible modeling frameworks-pseudo-observations and sub-distribution hazard models-are employed, enhanced with spline functions to capture baseline hazard and risk. Our proposed workflow integrates graphical representations of Aalen-Johansen estimates for crude cumulative incidence, enabling researchers to visually hypothesize and adjust model complexity to match the studied phenomenon. Information criteria guide model selection to approximate the underlying data structure. Using bootstrapped data perturbations and time-dependent predictive accuracy measures, adjusted with Harrell's optimism correction, we identify the optimal model structure, balancing explainability, predictivity, and generalizability.

Results: Our findings highlight the importance of data perturbation and validation through optimism-corrected predictive measures following the original data analysis. The initial model structure may differ from the most robust model identified through iterative perturbation. The ideal model has high robustness (most frequently selected in perturbations), strong explainability, and predictive capacity. When perturbation results are inconsistent, evaluating various time-dependent predictive measures offers additional insights, particularly regarding the trade-off between model complexity and predictive gains. In cases where predictive improvement is minimal, simpler and more explainable model structures are preferable.

Conclusions: The proposed statistical learning workflow, informed by domain expertise, allows for incorporating clinically relevant complexities in the prognostic modeling of distant recurrences in breast cancer. Our results suggest that, in many cases, a nuanced and flexible model structure may better serve future predictions than simpler models. This approach underscores the value of balancing model simplicity and complexity to achieve meaningful, clinically useful insights.

Abstract Image

查看原文本刊更多论文

在长时间随访背景下对粗累积发生率进行灵活估计的建模策略：模型选择和预测能力评估。

背景：慢性疾病（如乳腺癌）治疗的进步，扩大了我们观察患者预后的能力，超出了疾病相关死亡率，包括远处复发等事件。然而，竞争事件可能使主要结局的解释复杂化，使得粗糙的累积发生率函数成为准确随访分析的最可靠测量。长期研究需要灵活的建模，以适应复杂的、时间依赖的效应和协变量之间的相互作用。传统的模型，如比例子分布风险模型，往往不足以解决这些复杂性。虽然已经提出了适应性更强的方法，但仍然需要系统地评估模型复杂性，特别是为了探索性目的。本文提出了一个统计学习工作流，用于评估粗累积发生率的模型复杂性，并引入了一个与时间相关的预测精度度量。该框架为研究人员提供了一个增强的工具包，用于稳健地解决长期结果建模的复杂性和推导可解释的预后算法。方法：我们使用米兰1号和米兰3号试验的乳腺癌复发时间数据来证明我们的方法，这些试验有很长的随访期。采用了两个灵活的建模框架-伪观测和子分布风险模型，并通过样条函数增强以捕获基线危害和风险。我们提出的工作流程集成了原油累积发生率的aallen - johansen估计的图形表示，使研究人员能够直观地假设和调整模型复杂性以匹配所研究的现象。信息标准指导模型选择以近似底层数据结构。利用自举数据扰动和随时间变化的预测精度测量，加上Harrell的乐观修正，我们确定了最优的模型结构，平衡了可解释性、预测性和概泛性。结果：我们的研究结果强调了数据扰动的重要性，并通过原始数据分析后的乐观校正预测措施进行验证。初始模型结构可能不同于通过迭代摄动识别的最鲁棒模型。理想模型具有高鲁棒性（最常在扰动中选择）、强可解释性和预测能力。当扰动结果不一致时，评估各种时间相关的预测措施提供了额外的见解，特别是关于模型复杂性和预测收益之间的权衡。在预测性改进很小的情况下，更简单和更可解释的模型结构是可取的。结论：提出的统计学习工作流程，由领域专家提供信息，允许在乳腺癌远处复发的预后建模中纳入临床相关的复杂性。我们的结果表明，在许多情况下，一个细致而灵活的模型结构可能比简单的模型更好地服务于未来的预测。这种方法强调了平衡模型简单性和复杂性的价值，以获得有意义的、临床有用的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Medical Research Methodology 医学-卫生保健

CiteScore

6.50

自引率

2.50%

发文量

298

审稿时长

3-8 weeks

期刊介绍： BMC Medical Research Methodology is an open access journal publishing original peer-reviewed research articles in methodological approaches to healthcare research. Articles on the methodology of epidemiological research, clinical trials and meta-analysis/systematic review are particularly encouraged, as are empirical studies of the associations between choice of methodology and study outcomes. BMC Medical Research Methodology does not aim to publish articles describing scientific methods or techniques: these should be directed to the BMC journal covering the relevant biomedical subject area.