Reply to: Comments on “Fisher–Schultz Lecture: Generic Machine Learning Inference on Heterogeneous Treatment Effects in Randomized Experiments, With an Application to Immunization in India”
Victor Chernozhukov, Mert Demirer, Esther Duflo, Iván Fernández-Val
{"title":"Reply to: Comments on “Fisher–Schultz Lecture: Generic Machine Learning Inference on Heterogeneous Treatment Effects in Randomized Experiments, With an Application to Immunization in India”","authors":"Victor Chernozhukov, Mert Demirer, Esther Duflo, Iván Fernández-Val","doi":"10.3982/ECTA23706","DOIUrl":null,"url":null,"abstract":"<p><span>We warmly thank Kosuke Imai, Michael Lingzhi Li, and Stefan Wager</span> for their gracious and insightful comments. We are particularly encouraged that both pieces recognize the importance of the research agenda the lecture laid out, which we see as critical for applied researchers. It is also great to see that both underscore the potential of the basic approach we propose—targeting summary features of the CATE after proxy estimation with sample splitting.</p><p>We are also happy that both papers push us (and the reader) to continue thinking about the inference problem associated with sample splitting. We recognize that our current paper is only scratching the surface of this interesting agenda. Our proposal is certainly not the only option, and it is exciting that both papers provide and assess alternatives. Hopefully, this will generate even more work in this area.</p><p>One potential concern with our approach is that it is demanding in terms of data, since it relies on repeated splitting of data into two parts: one used for CATE signal extraction and another used for post-processing. To examine potential improvements, Wager's discussion focuses on the special problem of testing the null effect—that is, whether the CATE function is zero. It is a specific setting, as typical machine learning algorithms are in fact able to learn the zero function consistently even in high-dimensional settings.<sup>1</sup> Nonetheless, the problem of testing the null of a zero-CATE remains very important.</p><p>Fixing a <i>single</i> split of data into <i>K</i> folds, <span>Wager</span> (<span>2024</span>) investigates relative gains in power generated by the sequential inference approach of <span>Luedtke and Van Der Laan</span> (<span>2016</span>). This approach uses progressively more data to estimate the “signal” and then generates a sequence of statistics to test if the “signal” is zero. The statistics can be aggregated to form a “single-split” <i>p</i>-value using the martingale properties of the construction. Wager shows in Monte Carlo experiments (reproduced below) that this improves power over a method of taking the median <i>p</i>-value over <i>K</i> equal-sized folds (which is not the method we propose, but is a sensible benchmark). It also outperforms the “naïve” approach that relies on cross-fitting à la the debiased machine learning (DML) approach, which is asymptotically valid in this special setting but suffers from size distortions.<sup>2</sup></p><p>We believe that Wager's proposal is potentially a fruitful complement to what we propose, since we can use the sequential estimation within our “multiple-split” approach. We now show that this combination generates further size and power improvements.</p><p>In what follows, we report results from a numerical simulation using the same experiment and implementation details as in <span>Wager</span> (<span>2024</span>).<sup>3</sup> As in Wager, we consider the following “single-split” approaches: (a) naïve or DML approach; (b) 2-fold approach with 2/3 of the sample allocated to training and 1/3 to testing data; and (c) the sequential approach (with 3 equal folds). We compare these approaches to “multiple-split” versions of (a), (b), and (c).</p><p>The results in Table I, based on 10,000 simulation replications, show that, in line with Wager, the sequential approach increases power relative to the simple approach of using two folds of unequal size. Interestingly, the use of “multiple” splits makes the sequential approach even better: the frequency of false rejection is decreased dramatically while the power of rejecting the false null increases.</p><p>Finally, recall that we are testing here the null zero-CATE. “Multiple” splitting fixes the size distortions of the “naïve” DML method, and it emerges as a very strong winner among all—it has the highest power and keeps the size well below the nominal level. Of course, we do not expect this superior performance of the naïve method to hold in more general settings of inference on CATE features, whenever the CATE function is not “special” enough to be learned quickly by ML (zero function, flat function, approximately sparse, etc).</p><p>In summary, we believe that the idea of <span>Wager</span> (<span>2024</span>) of using martingale aggregation warrants further investigation, and we very much welcome any further research in this area.<sup>4</sup></p><p><span>Imai and Li</span> (<span>2024</span>) propose an alternative inference approach to account for sample-splitting uncertainty, based on Neyman's randomization paradigm. Relative to our approach, their method is analytical and relies on a single split of the data. This provides a clear computational advantage, as performing multiple splits requires additional computation time.<sup>5</sup> We find the approach very interesting, although we have two comments.</p><p>First, as we highlighted theoretically in our paper, our multiple-split approach outperforms the single-split approach in terms of estimation risk. Specifically, we formally established that our method has a lower mean absolute deviation (MAD). We provide empirical evidence for this claim in the computational experiment below.</p><p>Second, as emphasized in our paper, a key motivation for our approach is its natural protection against “data mining” (whether intentional or not). For example, a researcher might try a few (<i>F</i>) different Monte Carlo seeds and—for replicability purposes—retain the seed that produces the most favorable results. This “mining” behavior in single-split approaches significantly increases estimation risk. In contrast, our procedure is expected to remain highly stable and exhibits minimal to no distortion. We provide practical evidence for this point in the computational experiment below.</p><p>In Table II, we reuse the Monte Carlo design from the previous section, where the CATE is zero, <span></span><math></math>. The table reports the bias, standard deviation, and MAD of estimators for the difference in GATES and rejection frequencies for this parameter being equal to zero, based on single and multiple sample splits. Specifically, we compare the method of <span>Imai and Li</span> (<span>2024</span>), which uses three folds (<span></span><math></math>) with cross-fitting and a single split (IMLI), to our method (CDDF), computed as the median of 100 splits, with 2/3 of the sample in the auxiliary set and 1/3 in the validation set.<sup>6</sup> The columns labeled “Mining (<span></span><math></math>)” illustrate the risks of data mining when using estimators reliant on a single split of the data. These columns report results for the maximum of IMLI and CDDF over <span></span><math></math> different random seeds, emulating the behavior of a “mining” researcher searching (intentionally or not) for positive effects.</p><p>From Table II, we draw the following conclusions:</p><p>In summary, we conclude that performing multiple splits provides clear statistical advantages over single-split methods, provided the computational cost is not a significant concern. CDDF offers lower estimation risk, greater robustness to mining, and attractive inferential properties.</p><p>Developing reliable methods to uncover the presence and magnitude of heterogeneous treatment effects is an important task in modern econometrics and statistics. Our paper made a specific suggestion, and <span>Wager</span> (<span>2024</span>) and <span>Imai and Li</span> (<span>2024</span>) propose clever alternatives, which are computationally appealing because they do not require multiple splits. Unsurprisingly, these gains come with some costs both in terms of theoretical requirements, and potential robustness to data mining. We see these approaches are very useful complements to the idea of multiple splits. Additional research on how to balance these trade-offs would be highly valuable.</p>","PeriodicalId":50556,"journal":{"name":"Econometrica","volume":"93 4","pages":"1177-1181"},"PeriodicalIF":7.1000,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.3982/ECTA23706","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Econometrica","FirstCategoryId":"96","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.3982/ECTA23706","RegionNum":1,"RegionCategory":"经济学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}
引用次数: 0
Abstract
We warmly thank Kosuke Imai, Michael Lingzhi Li, and Stefan Wager for their gracious and insightful comments. We are particularly encouraged that both pieces recognize the importance of the research agenda the lecture laid out, which we see as critical for applied researchers. It is also great to see that both underscore the potential of the basic approach we propose—targeting summary features of the CATE after proxy estimation with sample splitting.
We are also happy that both papers push us (and the reader) to continue thinking about the inference problem associated with sample splitting. We recognize that our current paper is only scratching the surface of this interesting agenda. Our proposal is certainly not the only option, and it is exciting that both papers provide and assess alternatives. Hopefully, this will generate even more work in this area.
One potential concern with our approach is that it is demanding in terms of data, since it relies on repeated splitting of data into two parts: one used for CATE signal extraction and another used for post-processing. To examine potential improvements, Wager's discussion focuses on the special problem of testing the null effect—that is, whether the CATE function is zero. It is a specific setting, as typical machine learning algorithms are in fact able to learn the zero function consistently even in high-dimensional settings.1 Nonetheless, the problem of testing the null of a zero-CATE remains very important.
Fixing a single split of data into K folds, Wager (2024) investigates relative gains in power generated by the sequential inference approach of Luedtke and Van Der Laan (2016). This approach uses progressively more data to estimate the “signal” and then generates a sequence of statistics to test if the “signal” is zero. The statistics can be aggregated to form a “single-split” p-value using the martingale properties of the construction. Wager shows in Monte Carlo experiments (reproduced below) that this improves power over a method of taking the median p-value over K equal-sized folds (which is not the method we propose, but is a sensible benchmark). It also outperforms the “naïve” approach that relies on cross-fitting à la the debiased machine learning (DML) approach, which is asymptotically valid in this special setting but suffers from size distortions.2
We believe that Wager's proposal is potentially a fruitful complement to what we propose, since we can use the sequential estimation within our “multiple-split” approach. We now show that this combination generates further size and power improvements.
In what follows, we report results from a numerical simulation using the same experiment and implementation details as in Wager (2024).3 As in Wager, we consider the following “single-split” approaches: (a) naïve or DML approach; (b) 2-fold approach with 2/3 of the sample allocated to training and 1/3 to testing data; and (c) the sequential approach (with 3 equal folds). We compare these approaches to “multiple-split” versions of (a), (b), and (c).
The results in Table I, based on 10,000 simulation replications, show that, in line with Wager, the sequential approach increases power relative to the simple approach of using two folds of unequal size. Interestingly, the use of “multiple” splits makes the sequential approach even better: the frequency of false rejection is decreased dramatically while the power of rejecting the false null increases.
Finally, recall that we are testing here the null zero-CATE. “Multiple” splitting fixes the size distortions of the “naïve” DML method, and it emerges as a very strong winner among all—it has the highest power and keeps the size well below the nominal level. Of course, we do not expect this superior performance of the naïve method to hold in more general settings of inference on CATE features, whenever the CATE function is not “special” enough to be learned quickly by ML (zero function, flat function, approximately sparse, etc).
In summary, we believe that the idea of Wager (2024) of using martingale aggregation warrants further investigation, and we very much welcome any further research in this area.4
Imai and Li (2024) propose an alternative inference approach to account for sample-splitting uncertainty, based on Neyman's randomization paradigm. Relative to our approach, their method is analytical and relies on a single split of the data. This provides a clear computational advantage, as performing multiple splits requires additional computation time.5 We find the approach very interesting, although we have two comments.
First, as we highlighted theoretically in our paper, our multiple-split approach outperforms the single-split approach in terms of estimation risk. Specifically, we formally established that our method has a lower mean absolute deviation (MAD). We provide empirical evidence for this claim in the computational experiment below.
Second, as emphasized in our paper, a key motivation for our approach is its natural protection against “data mining” (whether intentional or not). For example, a researcher might try a few (F) different Monte Carlo seeds and—for replicability purposes—retain the seed that produces the most favorable results. This “mining” behavior in single-split approaches significantly increases estimation risk. In contrast, our procedure is expected to remain highly stable and exhibits minimal to no distortion. We provide practical evidence for this point in the computational experiment below.
In Table II, we reuse the Monte Carlo design from the previous section, where the CATE is zero, . The table reports the bias, standard deviation, and MAD of estimators for the difference in GATES and rejection frequencies for this parameter being equal to zero, based on single and multiple sample splits. Specifically, we compare the method of Imai and Li (2024), which uses three folds () with cross-fitting and a single split (IMLI), to our method (CDDF), computed as the median of 100 splits, with 2/3 of the sample in the auxiliary set and 1/3 in the validation set.6 The columns labeled “Mining ()” illustrate the risks of data mining when using estimators reliant on a single split of the data. These columns report results for the maximum of IMLI and CDDF over different random seeds, emulating the behavior of a “mining” researcher searching (intentionally or not) for positive effects.
From Table II, we draw the following conclusions:
In summary, we conclude that performing multiple splits provides clear statistical advantages over single-split methods, provided the computational cost is not a significant concern. CDDF offers lower estimation risk, greater robustness to mining, and attractive inferential properties.
Developing reliable methods to uncover the presence and magnitude of heterogeneous treatment effects is an important task in modern econometrics and statistics. Our paper made a specific suggestion, and Wager (2024) and Imai and Li (2024) propose clever alternatives, which are computationally appealing because they do not require multiple splits. Unsurprisingly, these gains come with some costs both in terms of theoretical requirements, and potential robustness to data mining. We see these approaches are very useful complements to the idea of multiple splits. Additional research on how to balance these trade-offs would be highly valuable.
期刊介绍:
Econometrica publishes original articles in all branches of economics - theoretical and empirical, abstract and applied, providing wide-ranging coverage across the subject area. It promotes studies that aim at the unification of the theoretical-quantitative and the empirical-quantitative approach to economic problems and that are penetrated by constructive and rigorous thinking. It explores a unique range of topics each year - from the frontier of theoretical developments in many new and important areas, to research on current and applied economic problems, to methodologically innovative, theoretical and applied studies in econometrics.
Econometrica maintains a long tradition that submitted articles are refereed carefully and that detailed and thoughtful referee reports are provided to the author as an aid to scientific research, thus ensuring the high calibre of papers found in Econometrica. An international board of editors, together with the referees it has selected, has succeeded in substantially reducing editorial turnaround time, thereby encouraging submissions of the highest quality.
We strongly encourage recent Ph. D. graduates to submit their work to Econometrica. Our policy is to take into account the fact that recent graduates are less experienced in the process of writing and submitting papers.