Regret-Minimization in Risk-Averse Bandits

Shubhada Agrawal, S. Juneja, Wouter M. Koolen
{"title":"Regret-Minimization in Risk-Averse Bandits","authors":"Shubhada Agrawal, S. Juneja, Wouter M. Koolen","doi":"10.1109/ICC54714.2021.9703134","DOIUrl":null,"url":null,"abstract":"Classical regret minimization in a bandit frame-work involves a number of probability distributions or arms that are not known to the learner but that can be sampled from or pulled. The learner's aim is to sequentially pull these arms so as to maximize the number of times the best arm is pulled, or equivalently, minimize the regret associated with the sub-optimal pulls. Best is classically defined as the arm with the largest mean. Lower bounds on expected regret are well known, and lately, in great generality, efficient algorithms that match the lower bounds have been developed. In this paper we extend this methodology to a more general risk-reward set-up where the best arm corresponds to the one with the lowest average loss (negative of reward), with a multiple of Conditional-Value-at-Risk $(\\mathbf{CVaR})$ of the loss distribution added to it. $(\\mathbf{CVaR})$ is a popular tail risk measure. The settings where risk becomes an important consideration, typically involve heavy-tailed distributions. Unlike in most of the previous literature, we allow for all the distributions with a known uniform bound on the moment of order $(1+\\epsilon)$, allowing for heavy-tailed bandits. We extend the lower bound of the classical regret minimization setup to this setting and develop an index-based algorithm. Like the popular KL-UCB algorithm for the mean setting, our index is derived from the proposed lower bound, and is based on the empirical likelihood principle. We also propose anytime-valid confidence intervals for the mean-CVaR trade-off metric. En route, we develop concentration inequalities, which may be of independent interest.","PeriodicalId":382373,"journal":{"name":"2021 Seventh Indian Control Conference (ICC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Seventh Indian Control Conference (ICC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICC54714.2021.9703134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Classical regret minimization in a bandit frame-work involves a number of probability distributions or arms that are not known to the learner but that can be sampled from or pulled. The learner's aim is to sequentially pull these arms so as to maximize the number of times the best arm is pulled, or equivalently, minimize the regret associated with the sub-optimal pulls. Best is classically defined as the arm with the largest mean. Lower bounds on expected regret are well known, and lately, in great generality, efficient algorithms that match the lower bounds have been developed. In this paper we extend this methodology to a more general risk-reward set-up where the best arm corresponds to the one with the lowest average loss (negative of reward), with a multiple of Conditional-Value-at-Risk $(\mathbf{CVaR})$ of the loss distribution added to it. $(\mathbf{CVaR})$ is a popular tail risk measure. The settings where risk becomes an important consideration, typically involve heavy-tailed distributions. Unlike in most of the previous literature, we allow for all the distributions with a known uniform bound on the moment of order $(1+\epsilon)$, allowing for heavy-tailed bandits. We extend the lower bound of the classical regret minimization setup to this setting and develop an index-based algorithm. Like the popular KL-UCB algorithm for the mean setting, our index is derived from the proposed lower bound, and is based on the empirical likelihood principle. We also propose anytime-valid confidence intervals for the mean-CVaR trade-off metric. En route, we develop concentration inequalities, which may be of independent interest.
风险规避型强盗的后悔最小化
在强盗框架中,经典的遗憾最小化涉及许多概率分布或臂,这些分布或臂是学习器不知道的,但可以从中采样或提取。学习者的目标是依次拉动这些手臂,以便最大限度地拉动最佳手臂的次数,或者同等地,最小化与次优拉动相关的遗憾。经典的定义是均值最大的那条臂。期望后悔的下界是众所周知的,最近,在很大程度上,已经开发出了匹配下界的有效算法。在本文中,我们将该方法扩展到更一般的风险-奖励设置,其中最佳臂对应于具有最低平均损失(负奖励)的臂,并将损失分布的条件值-风险值$(\mathbf{CVaR})$的倍数添加到其上。$(\mathbf{CVaR})$是一个流行的尾部风险度量。风险成为重要考虑因素的设置通常涉及重尾分布。与之前的大多数文献不同,我们允许在阶矩$(1+\epsilon)$上具有已知均匀界的所有分布,允许重尾强盗。我们将经典后悔最小化设置的下界扩展到这个设置,并开发了一个基于索引的算法。与流行的KL-UCB均值设置算法一样,我们的指标是从提出的下界推导出来的,并基于经验似然原理。我们还提出了任意时间有效的均值- cvar权衡度量置信区间。在此过程中,我们发展了集中不平等,这可能是独立的兴趣。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信