Modeling strategies for a flexible estimation of the crude cumulative incidence in the context of long follow-ups: model choice and predictive ability evaluation.
Giacomo Biganzoli, Giuseppe Marano, Patrizia Boracchi
{"title":"Modeling strategies for a flexible estimation of the crude cumulative incidence in the context of long follow-ups: model choice and predictive ability evaluation.","authors":"Giacomo Biganzoli, Giuseppe Marano, Patrizia Boracchi","doi":"10.1186/s12874-025-02650-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Advancements in treatments for chronic diseases, such as breast cancer, have expanded our ability to observe patient outcomes beyond disease-related mortality, including events like distant recurrences. However, competing events can complicate the interpretation of primary outcomes, making the crude cumulative incidence function the most reliable measure for accurate follow-up analysis. Long-term studies require flexible modeling to accommodate intricate, time-dependent effects and interactions among covariates. Traditional models, such as the proportional sub-distribution hazards model, often insufficient to address these complexities. Although more adaptable methods have been proposed, there is still a need to systematically assess model complexity, particularly for exploratory purposes. This article presents a statistical learning workflow designed to evaluate model complexity in crude cumulative incidence and introduces a time-dependent metric for predictive accuracy. This framework provides researchers with an enhanced toolkit for robustly addressing the complexities of long-term outcome modeling and deriving interpretable prognostic algorithms.</p><p><strong>Methods: </strong>We demonstrate our approach using data on time-to-distant breast cancer recurrences from the Milan 1 and Milan 3 trials, which have extensive follow-up periods. Two flexible modeling frameworks-pseudo-observations and sub-distribution hazard models-are employed, enhanced with spline functions to capture baseline hazard and risk. Our proposed workflow integrates graphical representations of Aalen-Johansen estimates for crude cumulative incidence, enabling researchers to visually hypothesize and adjust model complexity to match the studied phenomenon. Information criteria guide model selection to approximate the underlying data structure. Using bootstrapped data perturbations and time-dependent predictive accuracy measures, adjusted with Harrell's optimism correction, we identify the optimal model structure, balancing explainability, predictivity, and generalizability.</p><p><strong>Results: </strong>Our findings highlight the importance of data perturbation and validation through optimism-corrected predictive measures following the original data analysis. The initial model structure may differ from the most robust model identified through iterative perturbation. The ideal model has high robustness (most frequently selected in perturbations), strong explainability, and predictive capacity. When perturbation results are inconsistent, evaluating various time-dependent predictive measures offers additional insights, particularly regarding the trade-off between model complexity and predictive gains. In cases where predictive improvement is minimal, simpler and more explainable model structures are preferable.</p><p><strong>Conclusions: </strong>The proposed statistical learning workflow, informed by domain expertise, allows for incorporating clinically relevant complexities in the prognostic modeling of distant recurrences in breast cancer. Our results suggest that, in many cases, a nuanced and flexible model structure may better serve future predictions than simpler models. This approach underscores the value of balancing model simplicity and complexity to achieve meaningful, clinically useful insights.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"217"},"PeriodicalIF":3.4000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12481988/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Research Methodology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12874-025-02650-x","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Advancements in treatments for chronic diseases, such as breast cancer, have expanded our ability to observe patient outcomes beyond disease-related mortality, including events like distant recurrences. However, competing events can complicate the interpretation of primary outcomes, making the crude cumulative incidence function the most reliable measure for accurate follow-up analysis. Long-term studies require flexible modeling to accommodate intricate, time-dependent effects and interactions among covariates. Traditional models, such as the proportional sub-distribution hazards model, often insufficient to address these complexities. Although more adaptable methods have been proposed, there is still a need to systematically assess model complexity, particularly for exploratory purposes. This article presents a statistical learning workflow designed to evaluate model complexity in crude cumulative incidence and introduces a time-dependent metric for predictive accuracy. This framework provides researchers with an enhanced toolkit for robustly addressing the complexities of long-term outcome modeling and deriving interpretable prognostic algorithms.
Methods: We demonstrate our approach using data on time-to-distant breast cancer recurrences from the Milan 1 and Milan 3 trials, which have extensive follow-up periods. Two flexible modeling frameworks-pseudo-observations and sub-distribution hazard models-are employed, enhanced with spline functions to capture baseline hazard and risk. Our proposed workflow integrates graphical representations of Aalen-Johansen estimates for crude cumulative incidence, enabling researchers to visually hypothesize and adjust model complexity to match the studied phenomenon. Information criteria guide model selection to approximate the underlying data structure. Using bootstrapped data perturbations and time-dependent predictive accuracy measures, adjusted with Harrell's optimism correction, we identify the optimal model structure, balancing explainability, predictivity, and generalizability.
Results: Our findings highlight the importance of data perturbation and validation through optimism-corrected predictive measures following the original data analysis. The initial model structure may differ from the most robust model identified through iterative perturbation. The ideal model has high robustness (most frequently selected in perturbations), strong explainability, and predictive capacity. When perturbation results are inconsistent, evaluating various time-dependent predictive measures offers additional insights, particularly regarding the trade-off between model complexity and predictive gains. In cases where predictive improvement is minimal, simpler and more explainable model structures are preferable.
Conclusions: The proposed statistical learning workflow, informed by domain expertise, allows for incorporating clinically relevant complexities in the prognostic modeling of distant recurrences in breast cancer. Our results suggest that, in many cases, a nuanced and flexible model structure may better serve future predictions than simpler models. This approach underscores the value of balancing model simplicity and complexity to achieve meaningful, clinically useful insights.
期刊介绍:
BMC Medical Research Methodology is an open access journal publishing original peer-reviewed research articles in methodological approaches to healthcare research. Articles on the methodology of epidemiological research, clinical trials and meta-analysis/systematic review are particularly encouraged, as are empirical studies of the associations between choice of methodology and study outcomes. BMC Medical Research Methodology does not aim to publish articles describing scientific methods or techniques: these should be directed to the BMC journal covering the relevant biomedical subject area.