Internal validation strategy for high dimensional prognosis model: A simulation study and application to transcriptomic in head and neck tumors.

IF 4.1 2区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY
Computational and structural biotechnology journal Pub Date : 2025-09-03 eCollection Date: 2025-01-01 DOI:10.1016/j.csbj.2025.08.035
Antoine Dubray-Vautrin, Victor Gravrand, Grégoire Marret, Constance Lamy, Jerzy Klijanienko, Sophie Vacher, Ladidi Ahmanache, Maud Kamal, Olivier Choussy, Nicolas Servant, Célia Dupain, Christophe Le Tourneau, Jimmy Mullaert
{"title":"Internal validation strategy for high dimensional prognosis model: A simulation study and application to transcriptomic in head and neck tumors.","authors":"Antoine Dubray-Vautrin, Victor Gravrand, Grégoire Marret, Constance Lamy, Jerzy Klijanienko, Sophie Vacher, Ladidi Ahmanache, Maud Kamal, Olivier Choussy, Nicolas Servant, Célia Dupain, Christophe Le Tourneau, Jimmy Mullaert","doi":"10.1016/j.csbj.2025.08.035","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Predictive models using high-dimensional data, such as genomics and transcriptomics, are increasingly used in oncology for time-to-event endpoints. Internal validation of these models is crucial to mitigate optimism bias prior to external validation. Common strategies include train-test, bootstrap, and (nested) cross-validation. However, no benchmark exists for these methods in high-dimensional settings. We aimed to compare these strategies and provide recommendations in the field of transcriptomic analysis.</p><p><strong>Method: </strong>A simulation study was conducted using data from the SCANDARE head and neck cohort (NCT03017573) including n = 76 patients. Simulated datasets included clinical variables (age, sex, HPV status, TNM staging), transcriptomic data (15,000 transcripts), and disease-free survival, with a realistic cumulative baseline hazard. Sample sizes of 50, 75, 100, 500, and 1000 were simulated, with 100 replicates each. Cox penalized regression was performed for model selection, followed by train-test 70 % training), bootstrap (100 iterations), 5-fold cross-validation, and nested cross-validation (5 ×5) to assess discriminative (time-dependent AUC and C-Index) and calibration (3-year integrated Brier Score) performance.</p><p><strong>Results: </strong>Train-test validation showed unstable performance. Conventional bootstrap was over-optimistic, while the 0.632 + bootstrap was overly pessimistic, particularly with small samples (n = 50 to n = 100). The k-fold cross-validation and nested cross-validation improved performance with larger sample sizes, with k-fold cross-validation demonstrating greater stability. Nested cross-validation showed performance fluctuations depending on the regularization method for model development.</p><p><strong>Conclusion: </strong>The K-fold cross-validation and nested cross-validation are recommended for internal validation of Cox penalized models in high-dimensional time-to-event settings. These methods offer greater stability and reliability compared to train-test or bootstrap approaches, particularly when sample sizes are sufficient.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"3792-3802"},"PeriodicalIF":4.1000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12451366/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational and structural biotechnology journal","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.csbj.2025.08.035","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Predictive models using high-dimensional data, such as genomics and transcriptomics, are increasingly used in oncology for time-to-event endpoints. Internal validation of these models is crucial to mitigate optimism bias prior to external validation. Common strategies include train-test, bootstrap, and (nested) cross-validation. However, no benchmark exists for these methods in high-dimensional settings. We aimed to compare these strategies and provide recommendations in the field of transcriptomic analysis.

Method: A simulation study was conducted using data from the SCANDARE head and neck cohort (NCT03017573) including n = 76 patients. Simulated datasets included clinical variables (age, sex, HPV status, TNM staging), transcriptomic data (15,000 transcripts), and disease-free survival, with a realistic cumulative baseline hazard. Sample sizes of 50, 75, 100, 500, and 1000 were simulated, with 100 replicates each. Cox penalized regression was performed for model selection, followed by train-test 70 % training), bootstrap (100 iterations), 5-fold cross-validation, and nested cross-validation (5 ×5) to assess discriminative (time-dependent AUC and C-Index) and calibration (3-year integrated Brier Score) performance.

Results: Train-test validation showed unstable performance. Conventional bootstrap was over-optimistic, while the 0.632 + bootstrap was overly pessimistic, particularly with small samples (n = 50 to n = 100). The k-fold cross-validation and nested cross-validation improved performance with larger sample sizes, with k-fold cross-validation demonstrating greater stability. Nested cross-validation showed performance fluctuations depending on the regularization method for model development.

Conclusion: The K-fold cross-validation and nested cross-validation are recommended for internal validation of Cox penalized models in high-dimensional time-to-event settings. These methods offer greater stability and reliability compared to train-test or bootstrap approaches, particularly when sample sizes are sufficient.

Abstract Image

Abstract Image

Abstract Image

高维预后模型的内部验证策略:头颈部肿瘤转录组学的模拟研究及应用。
背景:使用高维数据的预测模型,如基因组学和转录组学,越来越多地用于肿瘤学的时间到事件终点。在外部验证之前,这些模型的内部验证对于减轻乐观偏见至关重要。常见的策略包括训练测试、引导和(嵌套的)交叉验证。但是,在高维设置中没有针对这些方法的基准。我们的目的是比较这些策略,并提供转录组学分析领域的建议。方法:采用scanare头颈部队列(NCT03017573)的数据进行模拟研究,包括n = 76例患者。模拟数据集包括临床变量(年龄、性别、HPV状态、TNM分期)、转录组学数据(15,000个转录本)和无病生存期,具有现实的累积基线风险。模拟样本量为50、75、100、500和1000,每个重复100次。采用Cox惩罚回归进行模型选择,然后进行训练检验70 %训练)、bootstrap(100次迭代)、5次交叉验证和嵌套交叉验证(5 ×5),以评估判别性(随时间变化的AUC和C-Index)和校准性(3年综合Brier评分)的表现。结果:训练测试验证结果不稳定。传统的bootstrap过于乐观,而0.632 + bootstrap过于悲观,特别是对于小样本(n = 50到n = 100)。k-fold交叉验证和嵌套交叉验证在更大的样本量下提高了性能,其中k-fold交叉验证显示出更高的稳定性。嵌套交叉验证显示了模型开发的正则化方法对性能的影响。结论:建议采用K-fold交叉验证和嵌套交叉验证对高维事件时间设置下的Cox惩罚模型进行内部验证。与训练测试或自举方法相比,这些方法提供了更大的稳定性和可靠性,特别是在样本量足够的情况下。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computational and structural biotechnology journal
Computational and structural biotechnology journal Biochemistry, Genetics and Molecular Biology-Biophysics
CiteScore
9.30
自引率
3.30%
发文量
540
审稿时长
6 weeks
期刊介绍: Computational and Structural Biotechnology Journal (CSBJ) is an online gold open access journal publishing research articles and reviews after full peer review. All articles are published, without barriers to access, immediately upon acceptance. The journal places a strong emphasis on functional and mechanistic understanding of how molecular components in a biological process work together through the application of computational methods. Structural data may provide such insights, but they are not a pre-requisite for publication in the journal. Specific areas of interest include, but are not limited to: Structure and function of proteins, nucleic acids and other macromolecules Structure and function of multi-component complexes Protein folding, processing and degradation Enzymology Computational and structural studies of plant systems Microbial Informatics Genomics Proteomics Metabolomics Algorithms and Hypothesis in Bioinformatics Mathematical and Theoretical Biology Computational Chemistry and Drug Discovery Microscopy and Molecular Imaging Nanotechnology Systems and Synthetic Biology
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信