用数据挖掘变量实现可靠的因果推理:测量误差问题的随机森林方法

Indiana University Kelley School of Business Research Paper Series Pub Date : 2019-02-22 DOI:10.2139/ssrn.3339983

Mochen Yang, E. McFowland, Gordon Burtch, G. Adomavicius

{"title":"用数据挖掘变量实现可靠的因果推理:测量误差问题的随机森林方法","authors":"Mochen Yang, E. McFowland, Gordon Burtch, G. Adomavicius","doi":"10.2139/ssrn.3339983","DOIUrl":null,"url":null,"abstract":"Combining machine learning with econometric analysis is becoming increasingly prevalent in both research and practice. A common empirical strategy uses predictive modeling techniques to “mine” variables of interest from available data and then includes those variables into an econometric framework to estimate causal effects. However, because the predictions from machine learning models are inevitably imperfect, econometric analyses based on the predicted variables likely suffer from bias due to measurement error. We propose a novel approach to mitigate these biases, leveraging the random forest technique. We propose using random forest not just for prediction but also for generating instrumental variables for bias correction. The random forest algorithm performs best when comprised of a set of trees that are individually accurate in their predictions, yet which also make “different” mistakes, that is, have weakly correlated prediction errors. A key observation is that these properties are closely related to the relevance and exclusion requirements of valid instrumental variables. We design a data-driven procedure to select tuples of individual trees from a random forest, in which one tree serves as the endogenous covariate and the others serve as its instruments. Simulation experiments demonstrate its efficacy in mitigating estimation biases and its superior performance over alternative methods.","PeriodicalId":412480,"journal":{"name":"Indiana University Kelley School of Business Research Paper Series","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem\",\"authors\":\"Mochen Yang, E. McFowland, Gordon Burtch, G. Adomavicius\",\"doi\":\"10.2139/ssrn.3339983\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Combining machine learning with econometric analysis is becoming increasingly prevalent in both research and practice. A common empirical strategy uses predictive modeling techniques to “mine” variables of interest from available data and then includes those variables into an econometric framework to estimate causal effects. However, because the predictions from machine learning models are inevitably imperfect, econometric analyses based on the predicted variables likely suffer from bias due to measurement error. We propose a novel approach to mitigate these biases, leveraging the random forest technique. We propose using random forest not just for prediction but also for generating instrumental variables for bias correction. The random forest algorithm performs best when comprised of a set of trees that are individually accurate in their predictions, yet which also make “different” mistakes, that is, have weakly correlated prediction errors. A key observation is that these properties are closely related to the relevance and exclusion requirements of valid instrumental variables. We design a data-driven procedure to select tuples of individual trees from a random forest, in which one tree serves as the endogenous covariate and the others serve as its instruments. Simulation experiments demonstrate its efficacy in mitigating estimation biases and its superior performance over alternative methods.\",\"PeriodicalId\":412480,\"journal\":{\"name\":\"Indiana University Kelley School of Business Research Paper Series\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Indiana University Kelley School of Business Research Paper Series\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2139/ssrn.3339983\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Indiana University Kelley School of Business Research Paper Series","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.3339983","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

将机器学习与计量经济分析相结合在研究和实践中变得越来越普遍。一种常见的经验策略是使用预测建模技术从可用数据中“挖掘”感兴趣的变量，然后将这些变量纳入计量经济学框架，以估计因果关系。然而，由于机器学习模型的预测不可避免地是不完美的，基于预测变量的计量经济学分析可能由于测量误差而遭受偏差。我们提出了一种新的方法来减轻这些偏差，利用随机森林技术。我们建议使用随机森林不仅用于预测，而且用于产生偏差校正的工具变量。随机森林算法在由一组单独预测准确的树组成时表现最好，但这些树也会犯“不同”的错误，即具有弱相关的预测误差。一个关键的观察是，这些属性与有效工具变量的相关性和排除要求密切相关。我们设计了一个数据驱动的过程来从随机森林中选择单个树的元组，其中一棵树作为内生协变量，其他树作为其工具。仿真实验证明了该方法在减轻估计偏差方面的有效性和优于其他方法的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem

Combining machine learning with econometric analysis is becoming increasingly prevalent in both research and practice. A common empirical strategy uses predictive modeling techniques to “mine” variables of interest from available data and then includes those variables into an econometric framework to estimate causal effects. However, because the predictions from machine learning models are inevitably imperfect, econometric analyses based on the predicted variables likely suffer from bias due to measurement error. We propose a novel approach to mitigate these biases, leveraging the random forest technique. We propose using random forest not just for prediction but also for generating instrumental variables for bias correction. The random forest algorithm performs best when comprised of a set of trees that are individually accurate in their predictions, yet which also make “different” mistakes, that is, have weakly correlated prediction errors. A key observation is that these properties are closely related to the relevance and exclusion requirements of valid instrumental variables. We design a data-driven procedure to select tuples of individual trees from a random forest, in which one tree serves as the endogenous covariate and the others serve as its instruments. Simulation experiments demonstrate its efficacy in mitigating estimation biases and its superior performance over alternative methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Indiana University Kelley School of Business Research Paper Series

自引率

0.00%

发文量