Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards

Omar Besbes, Y. Gur, A. Zeevi
{"title":"Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards","authors":"Omar Besbes, Y. Gur, A. Zeevi","doi":"10.2139/ssrn.2436629","DOIUrl":null,"url":null,"abstract":"In a multi-armed bandit (MAB) problem a gambler needs to choose at each round of play one of K arms, each characterized by an unknown reward distribution. Reward realizations are only observed when an arm is selected, and the gambler's objective is to maximize his cumulative expected earnings over some given horizon of play T. To do this, the gambler needs to acquire information about arms (exploration) while simultaneously optimizing immediate rewards (exploitation); the price paid due to this trade off is often referred to as the regret, and the main question is how small can this price be as a function of the horizon length T. This problem has been studied extensively when the reward distributions do not change over time; an assumption that supports a sharp characterization of the regret, yet is often violated in practical settings. In this paper, we focus on a MAB formulation which allows for a broad range of temporal uncertainties in the rewards, while still maintaining mathematical tractability. We fully characterize the (regret) complexity of this class of MAB problems by establishing a direct link between the extent of allowable reward \"variation\" and the minimal achievable regret. Our analysis draws some connections between two rather disparate strands of literature: the adversarial and the stochastic MAB frameworks.","PeriodicalId":275253,"journal":{"name":"Operations Research eJournal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"111","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operations Research eJournal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.2436629","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 111

Abstract

In a multi-armed bandit (MAB) problem a gambler needs to choose at each round of play one of K arms, each characterized by an unknown reward distribution. Reward realizations are only observed when an arm is selected, and the gambler's objective is to maximize his cumulative expected earnings over some given horizon of play T. To do this, the gambler needs to acquire information about arms (exploration) while simultaneously optimizing immediate rewards (exploitation); the price paid due to this trade off is often referred to as the regret, and the main question is how small can this price be as a function of the horizon length T. This problem has been studied extensively when the reward distributions do not change over time; an assumption that supports a sharp characterization of the regret, yet is often violated in practical settings. In this paper, we focus on a MAB formulation which allows for a broad range of temporal uncertainties in the rewards, while still maintaining mathematical tractability. We fully characterize the (regret) complexity of this class of MAB problems by establishing a direct link between the extent of allowable reward "variation" and the minimal achievable regret. Our analysis draws some connections between two rather disparate strands of literature: the adversarial and the stochastic MAB frameworks.
具有非平稳奖励的多武装盗匪问题的最优勘探开发
在多手强盗(MAB)问题中,赌徒需要在每轮游戏中选择K支武器中的一支,每支武器都有一个未知的奖励分布。只有在选择武器时才会观察到奖励实现,而赌徒的目标是在一定的游戏时间内最大化他的累积预期收益。为此,赌徒需要获取有关武器的信息(探索),同时优化即时奖励(开发);由于这种权衡而付出的代价通常被称为后悔,主要问题是,作为视界长度t的函数,这个代价有多小?当奖励分配不随时间变化时,这个问题已经被广泛研究;这一假设支持对遗憾的尖锐描述,但在实际环境中经常被违背。在本文中,我们专注于MAB公式,该公式允许奖励的大范围时间不确定性,同时仍然保持数学可追溯性。我们通过在允许的奖励“变化”范围和最小可实现的后悔之间建立直接联系,充分表征了这类MAB问题的(后悔)复杂性。我们的分析在两种完全不同的文献之间建立了一些联系:对抗性和随机MAB框架。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信