Finding the Optimal Number of Splits and Repetitions in Double Cross-Fitting Targeted Maximum Likelihood Estimators.

IF 1.4 4区医学 Q4 PHARMACOLOGY & PHARMACY

Pharmaceutical Statistics Pub Date : 2025-09-01 DOI:10.1002/pst.70022

Mohammad Ehsanul Karim, Momenul Haque Mondol

{"title":"Finding the Optimal Number of Splits and Repetitions in Double Cross-Fitting Targeted Maximum Likelihood Estimators.","authors":"Mohammad Ehsanul Karim, Momenul Haque Mondol","doi":"10.1002/pst.70022","DOIUrl":null,"url":null,"abstract":"<p><p>Flexible machine learning algorithms are increasingly utilized in real-world data analyses. When integrated within double robust methods, such as the Targeted Maximum Likelihood Estimator (TMLE), complex estimators can result in significant undercoverage-an issue that is even more pronounced in singly robust methods. The Double Cross-Fitting (DCF) procedure complements these methods by enabling the use of diverse machine learning estimators, yet optimal guidelines for the number of data splits and repetitions remain unclear. This study aims to explore the effects of varying the number of splits and repetitions in DCF on TMLE estimators through statistical simulations and a data analysis. We discuss two generalizations of DCF beyond the conventional three splits and apply a range of splits to fit the TMLE estimator, incorporating a super learner without transforming covariates. The statistical properties of these configurations are compared across two sample sizes (3000 and 5000) and two DCF generalizations (equal splits and full data use). Additionally, we conduct a real-world analysis using data from the National Health and Nutrition Examination Survey (NHANES) 2017-18 cycle to illustrate the practical implications of varying DCF splits, focusing on the association between obesity and the risk of developing diabetes. Our simulation study reveals that five splits in DCF yield satisfactory bias, variance, and coverage across scenarios. In the real-world application, the DCF TMLE method showed consistent risk difference estimates over a range of splits, though standard errors increased with more splits in one generalization, suggesting potential drawbacks to excessive splitting. This research underscores the importance of judicious selection of the number of splits and repetitions in DCF TMLE methods to achieve a balance between computational efficiency and accurate statistical inference. Optimal performance seems attainable with three to five splits. Among the generalizations considered, using full data for nuisance estimation offered more consistent variance estimation and is preferable for applied use. Additionally, increasing the repetitions beyond 25 did not enhance performance, providing crucial guidance for researchers employing complex machine learning algorithms in causal studies and advocating for cautious split management in DCF procedures.</p>","PeriodicalId":19934,"journal":{"name":"Pharmaceutical Statistics","volume":"24 5","pages":"e70022"},"PeriodicalIF":1.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12425639/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pharmaceutical Statistics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/pst.70022","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}

引用次数: 0

Abstract

Flexible machine learning algorithms are increasingly utilized in real-world data analyses. When integrated within double robust methods, such as the Targeted Maximum Likelihood Estimator (TMLE), complex estimators can result in significant undercoverage-an issue that is even more pronounced in singly robust methods. The Double Cross-Fitting (DCF) procedure complements these methods by enabling the use of diverse machine learning estimators, yet optimal guidelines for the number of data splits and repetitions remain unclear. This study aims to explore the effects of varying the number of splits and repetitions in DCF on TMLE estimators through statistical simulations and a data analysis. We discuss two generalizations of DCF beyond the conventional three splits and apply a range of splits to fit the TMLE estimator, incorporating a super learner without transforming covariates. The statistical properties of these configurations are compared across two sample sizes (3000 and 5000) and two DCF generalizations (equal splits and full data use). Additionally, we conduct a real-world analysis using data from the National Health and Nutrition Examination Survey (NHANES) 2017-18 cycle to illustrate the practical implications of varying DCF splits, focusing on the association between obesity and the risk of developing diabetes. Our simulation study reveals that five splits in DCF yield satisfactory bias, variance, and coverage across scenarios. In the real-world application, the DCF TMLE method showed consistent risk difference estimates over a range of splits, though standard errors increased with more splits in one generalization, suggesting potential drawbacks to excessive splitting. This research underscores the importance of judicious selection of the number of splits and repetitions in DCF TMLE methods to achieve a balance between computational efficiency and accurate statistical inference. Optimal performance seems attainable with three to five splits. Among the generalizations considered, using full data for nuisance estimation offered more consistent variance estimation and is preferable for applied use. Additionally, increasing the repetitions beyond 25 did not enhance performance, providing crucial guidance for researchers employing complex machine learning algorithms in causal studies and advocating for cautious split management in DCF procedures.

Abstract Image

查看原文本刊更多论文

寻找双交叉拟合目标最大似然估计中分裂和重复的最优数量。

灵活的机器学习算法越来越多地应用于现实世界的数据分析。当与双鲁棒方法（如目标最大似然估计器（TMLE））集成时，复杂的估计器可能导致严重的覆盖不足——这个问题在单鲁棒方法中更为明显。双交叉拟合（DCF）过程通过使用不同的机器学习估计器来补充这些方法，但关于数据分割和重复次数的最佳指导方针仍不清楚。本研究旨在通过统计模拟和数据分析，探讨DCF中不同分割次数和重复次数对TMLE估计量的影响。我们讨论了DCF的两种推广，超越了传统的三分裂，并应用一系列分裂来拟合TMLE估计量，结合了一个不转换协变量的超级学习器。这些配置的统计特性在两个样本大小（3000和5000）和两个DCF泛化（相等的分割和完整的数据使用）之间进行比较。此外，我们使用国家健康与营养调查（NHANES） 2017-18周期的数据进行了现实世界的分析，以说明不同DCF分割的实际含义，重点关注肥胖与患糖尿病风险之间的关系。我们的模拟研究表明，DCF的五种分裂产生了令人满意的偏差、方差和跨场景的覆盖。在实际应用中，DCF TMLE方法在一系列分割范围内显示出一致的风险差异估计，尽管标准误差随着一次泛化中的更多分割而增加，这表明过度分割的潜在缺点。本研究强调了在DCF TMLE方法中，为了在计算效率和准确的统计推断之间取得平衡，明智地选择分割和重复次数的重要性。最佳的表现似乎可以通过三到五次分割来实现。在考虑的推广中，使用完整数据进行妨害估计提供了更一致的方差估计，更适合应用。此外，将重复次数增加到25次以上并不能提高性能，这为在因果研究中使用复杂机器学习算法的研究人员提供了至关重要的指导，并倡导在DCF过程中谨慎地进行分割管理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pharmaceutical Statistics 医学-统计学与概率论

CiteScore

2.70

自引率

6.70%

发文量

审稿时长

6-12 weeks

期刊介绍： Pharmaceutical Statistics is an industry-led initiative, tackling real problems in statistical applications. The Journal publishes papers that share experiences in the practical application of statistics within the pharmaceutical industry. It covers all aspects of pharmaceutical statistical applications from discovery, through pre-clinical development, clinical development, post-marketing surveillance, consumer health, production, epidemiology, and health economics. The Journal is both international and multidisciplinary. It includes high quality practical papers, case studies and review papers.