{"title":"Bandit Algorithms for Policy Learning: Methods, Implementation, and Welfare-performance","authors":"Toru Kitagawa, Jeff Rowley","doi":"arxiv-2409.00379","DOIUrl":null,"url":null,"abstract":"Static supervised learning-in which experimental data serves as a training\nsample for the estimation of an optimal treatment assignment policy-is a\ncommonly assumed framework of policy learning. An arguably more realistic but\nchallenging scenario is a dynamic setting in which the planner performs\nexperimentation and exploitation simultaneously with subjects that arrive\nsequentially. This paper studies bandit algorithms for learning an optimal\nindividualised treatment assignment policy. Specifically, we study\napplicability of the EXP4.P (Exponential weighting for Exploration and\nExploitation with Experts) algorithm developed by Beygelzimer et al. (2011) to\npolicy learning. Assuming that the class of policies has a finite\nVapnik-Chervonenkis dimension and that the number of subjects to be allocated\nis known, we present a high probability welfare-regret bound of the algorithm.\nTo implement the algorithm, we use an incremental enumeration algorithm for\nhyperplane arrangements. We perform extensive numerical analysis to assess the\nalgorithm's sensitivity to its tuning parameters and its welfare-regret\nperformance. Further simulation exercises are calibrated to the National Job\nTraining Partnership Act (JTPA) Study sample to determine how the algorithm\nperforms when applied to economic data. Our findings highlight various\ncomputational challenges and suggest that the limited welfare gain from the\nalgorithm is due to substantial heterogeneity in causal effects in the JTPA\ndata.","PeriodicalId":501293,"journal":{"name":"arXiv - ECON - Econometrics","volume":"22 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - ECON - Econometrics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00379","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Static supervised learning-in which experimental data serves as a training
sample for the estimation of an optimal treatment assignment policy-is a
commonly assumed framework of policy learning. An arguably more realistic but
challenging scenario is a dynamic setting in which the planner performs
experimentation and exploitation simultaneously with subjects that arrive
sequentially. This paper studies bandit algorithms for learning an optimal
individualised treatment assignment policy. Specifically, we study
applicability of the EXP4.P (Exponential weighting for Exploration and
Exploitation with Experts) algorithm developed by Beygelzimer et al. (2011) to
policy learning. Assuming that the class of policies has a finite
Vapnik-Chervonenkis dimension and that the number of subjects to be allocated
is known, we present a high probability welfare-regret bound of the algorithm.
To implement the algorithm, we use an incremental enumeration algorithm for
hyperplane arrangements. We perform extensive numerical analysis to assess the
algorithm's sensitivity to its tuning parameters and its welfare-regret
performance. Further simulation exercises are calibrated to the National Job
Training Partnership Act (JTPA) Study sample to determine how the algorithm
performs when applied to economic data. Our findings highlight various
computational challenges and suggest that the limited welfare gain from the
algorithm is due to substantial heterogeneity in causal effects in the JTPA
data.