Reply to: Comments on “Fisher–Schultz Lecture: Generic Machine Learning Inference on Heterogeneous Treatment Effects in Randomized Experiments, With an Application to Immunization in India”

IF 7.1 1区经济学 Q1 ECONOMICS

Econometrica Pub Date : 2025-07-30 DOI:10.3982/ECTA23706

Victor Chernozhukov, Mert Demirer, Esther Duflo, Iván Fernández-Val

{"title":"Reply to: Comments on “Fisher–Schultz Lecture: Generic Machine Learning Inference on Heterogeneous Treatment Effects in Randomized Experiments, With an Application to Immunization in India”","authors":"Victor Chernozhukov, Mert Demirer, Esther Duflo, Iván Fernández-Val","doi":"10.3982/ECTA23706","DOIUrl":null,"url":null,"abstract":"We warmly thank Kosuke Imai, Michael Lingzhi Li, and Stefan Wager for their gracious and insightful comments. We are particularly encouraged that both pieces recognize the importance of the research agenda the lecture laid out, which we see as critical for applied researchers. It is also great to see that both underscore the potential of the basic approach we propose—targeting summary features of the CATE after proxy estimation with sample splitting.We are also happy that both papers push us (and the reader) to continue thinking about the inference problem associated with sample splitting. We recognize that our current paper is only scratching the surface of this interesting agenda. Our proposal is certainly not the only option, and it is exciting that both papers provide and assess alternatives. Hopefully, this will generate even more work in this area.One potential concern with our approach is that it is demanding in terms of data, since it relies on repeated splitting of data into two parts: one used for CATE signal extraction and another used for post-processing. To examine potential improvements, Wager's discussion focuses on the special problem of testing the null effect—that is, whether the CATE function is zero. It is a specific setting, as typical machine learning algorithms are in fact able to learn the zero function consistently even in high-dimensional settings.1 Nonetheless, the problem of testing the null of a zero-CATE remains very important.Fixing a single split of data into K folds, Wager (2024) investigates relative gains in power generated by the sequential inference approach of Luedtke and Van Der Laan (2016). This approach uses progressively more data to estimate the “signal” and then generates a sequence of statistics to test if the “signal” is zero. The statistics can be aggregated to form a “single-split” p-value using the martingale properties of the construction. Wager shows in Monte Carlo experiments (reproduced below) that this improves power over a method of taking the median p-value over K equal-sized folds (which is not the method we propose, but is a sensible benchmark). It also outperforms the “naïve” approach that relies on cross-fitting à la the debiased machine learning (DML) approach, which is asymptotically valid in this special setting but suffers from size distortions.2We believe that Wager's proposal is potentially a fruitful complement to what we propose, since we can use the sequential estimation within our “multiple-split” approach. We now show that this combination generates further size and power improvements.In what follows, we report results from a numerical simulation using the same experiment and implementation details as in Wager (2024).3 As in Wager, we consider the following “single-split” approaches: (a) naïve or DML approach; (b) 2-fold approach with 2/3 of the sample allocated to training and 1/3 to testing data; and (c) the sequential approach (with 3 equal folds). We compare these approaches to “multiple-split” versions of (a), (b), and (c).The results in Table I, based on 10,000 simulation replications, show that, in line with Wager, the sequential approach increases power relative to the simple approach of using two folds of unequal size. Interestingly, the use of “multiple” splits makes the sequential approach even better: the frequency of false rejection is decreased dramatically while the power of rejecting the false null increases.Finally, recall that we are testing here the null zero-CATE. “Multiple” splitting fixes the size distortions of the “naïve” DML method, and it emerges as a very strong winner among all—it has the highest power and keeps the size well below the nominal level. Of course, we do not expect this superior performance of the naïve method to hold in more general settings of inference on CATE features, whenever the CATE function is not “special” enough to be learned quickly by ML (zero function, flat function, approximately sparse, etc).In summary, we believe that the idea of Wager (2024) of using martingale aggregation warrants further investigation, and we very much welcome any further research in this area.4Imai and Li (2024) propose an alternative inference approach to account for sample-splitting uncertainty, based on Neyman's randomization paradigm. Relative to our approach, their method is analytical and relies on a single split of the data. This provides a clear computational advantage, as performing multiple splits requires additional computation time.5 We find the approach very interesting, although we have two comments.First, as we highlighted theoretically in our paper, our multiple-split approach outperforms the single-split approach in terms of estimation risk. Specifically, we formally established that our method has a lower mean absolute deviation (MAD). We provide empirical evidence for this claim in the computational experiment below.Second, as emphasized in our paper, a key motivation for our approach is its natural protection against “data mining” (whether intentional or not). For example, a researcher might try a few (F) different Monte Carlo seeds and—for replicability purposes—retain the seed that produces the most favorable results. This “mining” behavior in single-split approaches significantly increases estimation risk. In contrast, our procedure is expected to remain highly stable and exhibits minimal to no distortion. We provide practical evidence for this point in the computational experiment below.In Table II, we reuse the Monte Carlo design from the previous section, where the CATE is zero, <math></math>. The table reports the bias, standard deviation, and MAD of estimators for the difference in GATES and rejection frequencies for this parameter being equal to zero, based on single and multiple sample splits. Specifically, we compare the method of Imai and Li (2024), which uses three folds (<math></math>) with cross-fitting and a single split (IMLI), to our method (CDDF), computed as the median of 100 splits, with 2/3 of the sample in the auxiliary set and 1/3 in the validation set.6 The columns labeled “Mining (<math></math>)” illustrate the risks of data mining when using estimators reliant on a single split of the data. These columns report results for the maximum of IMLI and CDDF over <math></math> different random seeds, emulating the behavior of a “mining” researcher searching (intentionally or not) for positive effects.From Table II, we draw the following conclusions:In summary, we conclude that performing multiple splits provides clear statistical advantages over single-split methods, provided the computational cost is not a significant concern. CDDF offers lower estimation risk, greater robustness to mining, and attractive inferential properties.Developing reliable methods to uncover the presence and magnitude of heterogeneous treatment effects is an important task in modern econometrics and statistics. Our paper made a specific suggestion, and Wager (2024) and Imai and Li (2024) propose clever alternatives, which are computationally appealing because they do not require multiple splits. Unsurprisingly, these gains come with some costs both in terms of theoretical requirements, and potential robustness to data mining. We see these approaches are very useful complements to the idea of multiple splits. Additional research on how to balance these trade-offs would be highly valuable.","PeriodicalId":50556,"journal":{"name":"Econometrica","volume":"93 4","pages":"1177-1181"},"PeriodicalIF":7.1000,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.3982/ECTA23706","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Econometrica","FirstCategoryId":"96","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.3982/ECTA23706","RegionNum":1,"RegionCategory":"经济学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}

引用次数: 0

Abstract

We warmly thank Kosuke Imai, Michael Lingzhi Li, and Stefan Wager for their gracious and insightful comments. We are particularly encouraged that both pieces recognize the importance of the research agenda the lecture laid out, which we see as critical for applied researchers. It is also great to see that both underscore the potential of the basic approach we propose—targeting summary features of the CATE after proxy estimation with sample splitting.

We are also happy that both papers push us (and the reader) to continue thinking about the inference problem associated with sample splitting. We recognize that our current paper is only scratching the surface of this interesting agenda. Our proposal is certainly not the only option, and it is exciting that both papers provide and assess alternatives. Hopefully, this will generate even more work in this area.

One potential concern with our approach is that it is demanding in terms of data, since it relies on repeated splitting of data into two parts: one used for CATE signal extraction and another used for post-processing. To examine potential improvements, Wager's discussion focuses on the special problem of testing the null effect—that is, whether the CATE function is zero. It is a specific setting, as typical machine learning algorithms are in fact able to learn the zero function consistently even in high-dimensional settings.¹ Nonetheless, the problem of testing the null of a zero-CATE remains very important.

Fixing a single split of data into K folds, Wager (2024) investigates relative gains in power generated by the sequential inference approach of Luedtke and Van Der Laan (2016). This approach uses progressively more data to estimate the “signal” and then generates a sequence of statistics to test if the “signal” is zero. The statistics can be aggregated to form a “single-split” p-value using the martingale properties of the construction. Wager shows in Monte Carlo experiments (reproduced below) that this improves power over a method of taking the median p-value over K equal-sized folds (which is not the method we propose, but is a sensible benchmark). It also outperforms the “naïve” approach that relies on cross-fitting à la the debiased machine learning (DML) approach, which is asymptotically valid in this special setting but suffers from size distortions.²

We believe that Wager's proposal is potentially a fruitful complement to what we propose, since we can use the sequential estimation within our “multiple-split” approach. We now show that this combination generates further size and power improvements.

In what follows, we report results from a numerical simulation using the same experiment and implementation details as in Wager (2024).³ As in Wager, we consider the following “single-split” approaches: (a) naïve or DML approach; (b) 2-fold approach with 2/3 of the sample allocated to training and 1/3 to testing data; and (c) the sequential approach (with 3 equal folds). We compare these approaches to “multiple-split” versions of (a), (b), and (c).

The results in Table I, based on 10,000 simulation replications, show that, in line with Wager, the sequential approach increases power relative to the simple approach of using two folds of unequal size. Interestingly, the use of “multiple” splits makes the sequential approach even better: the frequency of false rejection is decreased dramatically while the power of rejecting the false null increases.

Finally, recall that we are testing here the null zero-CATE. “Multiple” splitting fixes the size distortions of the “naïve” DML method, and it emerges as a very strong winner among all—it has the highest power and keeps the size well below the nominal level. Of course, we do not expect this superior performance of the naïve method to hold in more general settings of inference on CATE features, whenever the CATE function is not “special” enough to be learned quickly by ML (zero function, flat function, approximately sparse, etc).

In summary, we believe that the idea of Wager (2024) of using martingale aggregation warrants further investigation, and we very much welcome any further research in this area.⁴

Imai and Li (2024) propose an alternative inference approach to account for sample-splitting uncertainty, based on Neyman's randomization paradigm. Relative to our approach, their method is analytical and relies on a single split of the data. This provides a clear computational advantage, as performing multiple splits requires additional computation time.⁵ We find the approach very interesting, although we have two comments.

First, as we highlighted theoretically in our paper, our multiple-split approach outperforms the single-split approach in terms of estimation risk. Specifically, we formally established that our method has a lower mean absolute deviation (MAD). We provide empirical evidence for this claim in the computational experiment below.

Second, as emphasized in our paper, a key motivation for our approach is its natural protection against “data mining” (whether intentional or not). For example, a researcher might try a few (F) different Monte Carlo seeds and—for replicability purposes—retain the seed that produces the most favorable results. This “mining” behavior in single-split approaches significantly increases estimation risk. In contrast, our procedure is expected to remain highly stable and exhibits minimal to no distortion. We provide practical evidence for this point in the computational experiment below.

In Table II, we reuse the Monte Carlo design from the previous section, where the CATE is zero, . The table reports the bias, standard deviation, and MAD of estimators for the difference in GATES and rejection frequencies for this parameter being equal to zero, based on single and multiple sample splits. Specifically, we compare the method of Imai and Li (2024), which uses three folds () with cross-fitting and a single split (IMLI), to our method (CDDF), computed as the median of 100 splits, with 2/3 of the sample in the auxiliary set and 1/3 in the validation set.⁶ The columns labeled “Mining ()” illustrate the risks of data mining when using estimators reliant on a single split of the data. These columns report results for the maximum of IMLI and CDDF over different random seeds, emulating the behavior of a “mining” researcher searching (intentionally or not) for positive effects.

From Table II, we draw the following conclusions:

In summary, we conclude that performing multiple splits provides clear statistical advantages over single-split methods, provided the computational cost is not a significant concern. CDDF offers lower estimation risk, greater robustness to mining, and attractive inferential properties.

Developing reliable methods to uncover the presence and magnitude of heterogeneous treatment effects is an important task in modern econometrics and statistics. Our paper made a specific suggestion, and Wager (2024) and Imai and Li (2024) propose clever alternatives, which are computationally appealing because they do not require multiple splits. Unsurprisingly, these gains come with some costs both in terms of theoretical requirements, and potential robustness to data mining. We see these approaches are very useful complements to the idea of multiple splits. Additional research on how to balance these trade-offs would be highly valuable.

查看原文本刊更多论文

回复：关于“Fisher-Schultz讲座：随机实验中异质治疗效果的通用机器学习推断，在印度的免疫应用”的评论

我们热烈感谢今井Kosuke， Michael Lingzhi Li和Stefan Wager的亲切而深刻的评论。令我们特别鼓舞的是，这两篇文章都认识到讲座所提出的研究议程的重要性，我们认为这对应用研究人员至关重要。同样令人高兴的是，两者都强调了我们提出的基本方法的潜力——在使用样本分裂的代理估计之后，针对CATE的摘要特征。我们也很高兴这两篇论文促使我们（和读者）继续思考与样本分裂相关的推理问题。我们认识到，我们目前的文件只是触及了这一有趣议程的表面。我们的建议当然不是唯一的选择，令人兴奋的是，这两篇论文都提供并评估了替代方案。希望这将在这一领域产生更多的工作。我们的方法的一个潜在问题是，它在数据方面要求很高，因为它依赖于将数据反复分成两部分：一部分用于CATE信号提取，另一部分用于后处理。为了检查潜在的改进，Wager的讨论集中在测试零效应的特殊问题上——即，CATE函数是否为零。这是一个特定的设置，因为典型的机器学习算法实际上能够在高维设置中始终如一地学习零函数尽管如此，测试零- cate的null的问题仍然非常重要。Wager（2024）将单个数据分割成K个折叠，研究了Luedtke和Van Der Laan（2016）的顺序推理方法产生的相对功率增益。这种方法使用越来越多的数据来估计“信号”，然后生成一系列统计数据来测试“信号”是否为零。统计数据可以使用构造的鞅属性聚合形成“单分裂”p值。Wager在蒙特卡罗实验中显示（如下所示），这比在K个等大小的折叠上取中位数p值的方法（这不是我们提出的方法，但是一个明智的基准）提高了功率。它也优于依赖交叉拟合的“naïve”方法和去偏见机器学习（DML）方法，后者在这种特殊设置中是渐近有效的，但会受到大小扭曲的影响。2我们相信Wager的建议可能是对我们提议的富有成效的补充，因为我们可以在我们的“多重分裂”方法中使用顺序估计。我们现在展示了这种组合产生了进一步的尺寸和功率改进。在接下来的内容中，我们报告了使用与Wager（2024）相同的实验和实现细节的数值模拟结果与Wager一样，我们考虑以下“单拆分”方法：(a) naïve或DML方法；(b) 2-fold方法，2/3的样本分配给训练数据，1/3分配给测试数据；(c)顺序方法（有3个相等的折叠）。我们将这些方法与(a)、(b)和(c)的“多重分裂”版本进行比较。表1中基于10,000次模拟重复的结果表明，与Wager一致，顺序方法相对于使用大小不等的两个折叠的简单方法增加了功率。有趣的是，使用“多重”分割使得顺序方法更好：错误拒绝的频率显著降低，而拒绝错误null的能力增加。最后，回想一下，我们在这里测试的是null - zero-CATE。“多重”分割修复了“naïve”DML方法的大小扭曲，它在所有方法中脱颖而出，成为一个非常强大的赢家——它具有最高的功率，并使大小远低于标称水平。当然，我们不期望naïve方法的优越性能在更一般的CATE特征推理设置中保持，只要CATE函数不够“特殊”，无法被ML快速学习（零函数、平坦函数、近似稀疏等）。综上所述，我们认为Wager（2024）使用鞅聚集的想法值得进一步研究，我们非常欢迎在这一领域进行进一步的研究。4Imai和Li（2024）提出了一种基于Neyman随机化范式的替代推理方法来解释样本分裂的不确定性。与我们的方法相比，他们的方法是分析性的，依赖于单一的数据分割。这提供了明显的计算优势，因为执行多次拆分需要额外的计算时间我们发现这种方法非常有趣，尽管我们有两点意见。首先，正如我们在论文中强调的那样，在评估风险方面，我们的多重分割方法优于单一分割方法。具体来说，我们正式确定了我们的方法具有较低的平均绝对偏差（MAD）。我们在下面的计算实验中为这一说法提供了经验证据。其次，正如我们在论文中强调的那样，我们的方法的一个关键动机是它对“数据挖掘”的自然保护（无论有意还是无意）。例如，研究人员可能会尝试几种不同的蒙特卡罗种子，并且为了可复制性的目的，保留产生最有利结果的种子。单拆分方法中的这种“挖掘”行为显著地增加了评估风险。相比之下，我们的程序预计将保持高度稳定，并表现出最小或没有扭曲。我们在下面的计算实验中为这一点提供了实际的证据。在表2中，我们重用了上一节中的蒙特卡罗设计，其中CATE为零。该表报告了基于单个和多个样本分割，该参数的GATES和拒绝频率差异的估计器的偏差、标准差和MAD等于零。具体来说，我们比较了Imai和Li（2024）的方法，该方法使用三折（）交叉拟合和单分裂（IMLI），与我们的方法（CDDF）进行比较，CDDF计算为100次分裂的中位数，其中2/3的样本在辅助集，1/3在验证集标记为“Mining（）”的列说明了当使用依赖于单个数据分割的估计器时，数据挖掘的风险。这些列报告了不同随机种子上IMLI和CDDF最大值的结果，模拟了“挖掘”研究人员搜索（有意或无意）积极影响的行为。从表2中，我们得出以下结论：总的来说，我们得出的结论是，如果计算成本不是一个重大问题，那么执行多次分割比单次分割方法具有明显的统计优势。CDDF提供了更低的估计风险、更强的挖掘鲁棒性和吸引人的推断属性。发展可靠的方法来揭示异质性治疗效应的存在和程度是现代计量经济学和统计学的一项重要任务。我们的论文提出了一个具体的建议，Wager（2024）和Imai和Li（2024）提出了聪明的替代方案，这些替代方案在计算上很有吸引力，因为它们不需要多次分割。不出所料，这些收益在理论要求和数据挖掘的潜在健壮性方面都有一定的成本。我们认为这些方法是对多重分裂思想非常有用的补充。关于如何平衡这些权衡的额外研究将是非常有价值的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Econometrica 社会科学-数学跨学科应用

CiteScore

11.00

自引率

3.30%

发文量

审稿时长

6-12 weeks

期刊介绍： Econometrica publishes original articles in all branches of economics - theoretical and empirical, abstract and applied, providing wide-ranging coverage across the subject area. It promotes studies that aim at the unification of the theoretical-quantitative and the empirical-quantitative approach to economic problems and that are penetrated by constructive and rigorous thinking. It explores a unique range of topics each year - from the frontier of theoretical developments in many new and important areas, to research on current and applied economic problems, to methodologically innovative, theoretical and applied studies in econometrics. Econometrica maintains a long tradition that submitted articles are refereed carefully and that detailed and thoughtful referee reports are provided to the author as an aid to scientific research, thus ensuring the high calibre of papers found in Econometrica. An international board of editors, together with the referees it has selected, has succeeded in substantially reducing editorial turnaround time, thereby encouraging submissions of the highest quality. We strongly encourage recent Ph. D. graduates to submit their work to Econometrica. Our policy is to take into account the fact that recent graduates are less experienced in the process of writing and submitting papers.