Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret

Lai Wei;Vaibhav Srivastava
{"title":"Nonstationary Stochastic Bandits: UCB Policies and Minimax Regret","authors":"Lai Wei;Vaibhav Srivastava","doi":"10.1109/OJCSYS.2024.3372929","DOIUrl":null,"url":null,"abstract":"We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.","PeriodicalId":73299,"journal":{"name":"IEEE open journal of control systems","volume":"3 ","pages":"128-142"},"PeriodicalIF":0.0000,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10460198","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of control systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10460198/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem in which the distributions of rewards associated with arms are assumed to be time-varying and the total variation in the expected rewards is subject to a variation budget. The regret of a policy is defined by the difference in the expected cumulative reward obtained using the policy and using an oracle that selects the arm with the maximum mean reward at each time. We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget. We design Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodic resetting, sliding observation window, and discount factor, and show that they are order-optimal with respect to the minimax regret, i.e., the minimum worst-case regret achieved by any policy. We also relax the sub-Gaussian assumption on reward distributions and develop robust versions of the proposed policies that can handle heavy-tailed reward distributions and maintain their performance guarantees.
非平稳随机强盗:UCB 政策和最小遗憾
我们研究的是非平稳随机多臂匪徒(MAB)问题,其中假定与臂相关的奖励分布是时变的,并且预期奖励的总变化受制于变化预算。一个策略的遗憾度是指使用该策略与使用一个每次选择平均奖励最大的手臂的神谕所获得的预期累积奖励之差。我们用最坏情况下的遗憾值来描述所提策略的性能,即满足变化预算的奖励分布序列集合上遗憾值的上确值。我们设计了基于置信度上限(UCB)的策略,采用了三种不同的方法,即周期性重置、滑动观察窗口和贴现因子,并证明它们在最小遗憾(即任何策略都能达到的最小最坏情况遗憾)方面都是有序最优的。我们还放宽了奖赏分布的亚高斯假设,并开发了可处理重尾奖赏分布并保持其性能保证的鲁棒版本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信