{"title":"Addressing sample selection bias for machine learning methods","authors":"Dylan Brewer, Alyssa Carlson","doi":"10.1002/jae.3029","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>We study approaches for adjusting machine learning methods when the training sample differs from the prediction sample on unobserved dimensions. The machine learning literature predominately assumes selection only on observed dimensions. Common approaches are to weight or include variables that influence selection as solutions to selection on observables. Simulation results show that selection on unobservables increases mean squared prediction error using popular machine-learning algorithms. Common machine learning practices such as weighting or including variables that influence selection into the training or prediction sample often worsen sample selection bias. We propose two control function approaches that remove the effects of selection bias before training and find that they reduce mean-squared prediction error in simulations. We apply these approaches to predicting the vote share of the incumbent in gubernatorial elections using previously observed re-election bids. We find that ignoring selection on unobservables leads to substantially higher predicted vote shares for the incumbent than when the control function approach is used.</p>\n </div>","PeriodicalId":48363,"journal":{"name":"Journal of Applied Econometrics","volume":null,"pages":null},"PeriodicalIF":2.3000,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Econometrics","FirstCategoryId":"96","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/jae.3029","RegionNum":3,"RegionCategory":"经济学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ECONOMICS","Score":null,"Total":0}
引用次数: 0
Abstract
We study approaches for adjusting machine learning methods when the training sample differs from the prediction sample on unobserved dimensions. The machine learning literature predominately assumes selection only on observed dimensions. Common approaches are to weight or include variables that influence selection as solutions to selection on observables. Simulation results show that selection on unobservables increases mean squared prediction error using popular machine-learning algorithms. Common machine learning practices such as weighting or including variables that influence selection into the training or prediction sample often worsen sample selection bias. We propose two control function approaches that remove the effects of selection bias before training and find that they reduce mean-squared prediction error in simulations. We apply these approaches to predicting the vote share of the incumbent in gubernatorial elections using previously observed re-election bids. We find that ignoring selection on unobservables leads to substantially higher predicted vote shares for the incumbent than when the control function approach is used.
期刊介绍:
The Journal of Applied Econometrics is an international journal published bi-monthly, plus 1 additional issue (total 7 issues). It aims to publish articles of high quality dealing with the application of existing as well as new econometric techniques to a wide variety of problems in economics and related subjects, covering topics in measurement, estimation, testing, forecasting, and policy analysis. The emphasis is on the careful and rigorous application of econometric techniques and the appropriate interpretation of the results. The economic content of the articles is stressed. A special feature of the Journal is its emphasis on the replicability of results by other researchers. To achieve this aim, authors are expected to make available a complete set of the data used as well as any specialised computer programs employed through a readily accessible medium, preferably in a machine-readable form. The use of microcomputers in applied research and transferability of data is emphasised. The Journal also features occasional sections of short papers re-evaluating previously published papers. The intention of the Journal of Applied Econometrics is to provide an outlet for innovative, quantitative research in economics which cuts across areas of specialisation, involves transferable techniques, and is easily replicable by other researchers. Contributions that introduce statistical methods that are applicable to a variety of economic problems are actively encouraged. The Journal also aims to publish review and survey articles that make recent developments in the field of theoretical and applied econometrics more readily accessible to applied economists in general.