Internal validation strategy for high dimensional prognosis model: A simulation study and application to transcriptomic in head and neck tumors.

IF 4.1 2区生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY

Computational and structural biotechnology journal Pub Date : 2025-09-03 eCollection Date: 2025-01-01 DOI:10.1016/j.csbj.2025.08.035

Antoine Dubray-Vautrin, Victor Gravrand, Grégoire Marret, Constance Lamy, Jerzy Klijanienko, Sophie Vacher, Ladidi Ahmanache, Maud Kamal, Olivier Choussy, Nicolas Servant, Célia Dupain, Christophe Le Tourneau, Jimmy Mullaert

{"title":"Internal validation strategy for high dimensional prognosis model: A simulation study and application to transcriptomic in head and neck tumors.","authors":"Antoine Dubray-Vautrin, Victor Gravrand, Grégoire Marret, Constance Lamy, Jerzy Klijanienko, Sophie Vacher, Ladidi Ahmanache, Maud Kamal, Olivier Choussy, Nicolas Servant, Célia Dupain, Christophe Le Tourneau, Jimmy Mullaert","doi":"10.1016/j.csbj.2025.08.035","DOIUrl":null,"url":null,"abstract":"Background: Predictive models using high-dimensional data, such as genomics and transcriptomics, are increasingly used in oncology for time-to-event endpoints. Internal validation of these models is crucial to mitigate optimism bias prior to external validation. Common strategies include train-test, bootstrap, and (nested) cross-validation. However, no benchmark exists for these methods in high-dimensional settings. We aimed to compare these strategies and provide recommendations in the field of transcriptomic analysis.Method: A simulation study was conducted using data from the SCANDARE head and neck cohort (NCT03017573) including n = 76 patients. Simulated datasets included clinical variables (age, sex, HPV status, TNM staging), transcriptomic data (15,000 transcripts), and disease-free survival, with a realistic cumulative baseline hazard. Sample sizes of 50, 75, 100, 500, and 1000 were simulated, with 100 replicates each. Cox penalized regression was performed for model selection, followed by train-test 70 % training), bootstrap (100 iterations), 5-fold cross-validation, and nested cross-validation (5 ×5) to assess discriminative (time-dependent AUC and C-Index) and calibration (3-year integrated Brier Score) performance.Results: Train-test validation showed unstable performance. Conventional bootstrap was over-optimistic, while the 0.632 + bootstrap was overly pessimistic, particularly with small samples (n = 50 to n = 100). The k-fold cross-validation and nested cross-validation improved performance with larger sample sizes, with k-fold cross-validation demonstrating greater stability. Nested cross-validation showed performance fluctuations depending on the regularization method for model development.Conclusion: The K-fold cross-validation and nested cross-validation are recommended for internal validation of Cox penalized models in high-dimensional time-to-event settings. These methods offer greater stability and reliability compared to train-test or bootstrap approaches, particularly when sample sizes are sufficient.","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"3792-3802"},"PeriodicalIF":4.1000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12451366/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational and structural biotechnology journal","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.csbj.2025.08.035","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Predictive models using high-dimensional data, such as genomics and transcriptomics, are increasingly used in oncology for time-to-event endpoints. Internal validation of these models is crucial to mitigate optimism bias prior to external validation. Common strategies include train-test, bootstrap, and (nested) cross-validation. However, no benchmark exists for these methods in high-dimensional settings. We aimed to compare these strategies and provide recommendations in the field of transcriptomic analysis.

Method: A simulation study was conducted using data from the SCANDARE head and neck cohort (NCT03017573) including n = 76 patients. Simulated datasets included clinical variables (age, sex, HPV status, TNM staging), transcriptomic data (15,000 transcripts), and disease-free survival, with a realistic cumulative baseline hazard. Sample sizes of 50, 75, 100, 500, and 1000 were simulated, with 100 replicates each. Cox penalized regression was performed for model selection, followed by train-test 70 % training), bootstrap (100 iterations), 5-fold cross-validation, and nested cross-validation (5 ×5) to assess discriminative (time-dependent AUC and C-Index) and calibration (3-year integrated Brier Score) performance.

Results: Train-test validation showed unstable performance. Conventional bootstrap was over-optimistic, while the 0.632 + bootstrap was overly pessimistic, particularly with small samples (n = 50 to n = 100). The k-fold cross-validation and nested cross-validation improved performance with larger sample sizes, with k-fold cross-validation demonstrating greater stability. Nested cross-validation showed performance fluctuations depending on the regularization method for model development.

Conclusion: The K-fold cross-validation and nested cross-validation are recommended for internal validation of Cox penalized models in high-dimensional time-to-event settings. These methods offer greater stability and reliability compared to train-test or bootstrap approaches, particularly when sample sizes are sufficient.

Abstract Image

查看原文本刊更多论文

高维预后模型的内部验证策略：头颈部肿瘤转录组学的模拟研究及应用。

背景：使用高维数据的预测模型，如基因组学和转录组学，越来越多地用于肿瘤学的时间到事件终点。在外部验证之前，这些模型的内部验证对于减轻乐观偏见至关重要。常见的策略包括训练测试、引导和（嵌套的）交叉验证。但是，在高维设置中没有针对这些方法的基准。我们的目的是比较这些策略，并提供转录组学分析领域的建议。方法：采用scanare头颈部队列（NCT03017573）的数据进行模拟研究，包括n = 76例患者。模拟数据集包括临床变量（年龄、性别、HPV状态、TNM分期）、转录组学数据（15,000个转录本）和无病生存期，具有现实的累积基线风险。模拟样本量为50、75、100、500和1000，每个重复100次。采用Cox惩罚回归进行模型选择，然后进行训练检验70 %训练)、bootstrap（100次迭代）、5次交叉验证和嵌套交叉验证（5 ×5），以评估判别性（随时间变化的AUC和C-Index）和校准性（3年综合Brier评分）的表现。结果：训练测试验证结果不稳定。传统的bootstrap过于乐观，而0.632 + bootstrap过于悲观，特别是对于小样本（n = 50到n = 100）。k-fold交叉验证和嵌套交叉验证在更大的样本量下提高了性能，其中k-fold交叉验证显示出更高的稳定性。嵌套交叉验证显示了模型开发的正则化方法对性能的影响。结论：建议采用K-fold交叉验证和嵌套交叉验证对高维事件时间设置下的Cox惩罚模型进行内部验证。与训练测试或自举方法相比，这些方法提供了更大的稳定性和可靠性，特别是在样本量足够的情况下。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational and structural biotechnology journal Biochemistry, Genetics and Molecular Biology-Biophysics

CiteScore

9.30

自引率

3.30%

发文量

540

审稿时长

6 weeks

期刊介绍： Computational and Structural Biotechnology Journal (CSBJ) is an online gold open access journal publishing research articles and reviews after full peer review. All articles are published, without barriers to access, immediately upon acceptance. The journal places a strong emphasis on functional and mechanistic understanding of how molecular components in a biological process work together through the application of computational methods. Structural data may provide such insights, but they are not a pre-requisite for publication in the journal. Specific areas of interest include, but are not limited to: Structure and function of proteins, nucleic acids and other macromolecules Structure and function of multi-component complexes Protein folding, processing and degradation Enzymology Computational and structural studies of plant systems Microbial Informatics Genomics Proteomics Metabolomics Algorithms and Hypothesis in Bioinformatics Mathematical and Theoretical Biology Computational Chemistry and Drug Discovery Microscopy and Molecular Imaging Nanotechnology Systems and Synthetic Biology