Bandits with switching costs: T2/3 regret

Proceedings of the forty-sixth annual ACM symposium on Theory of computing Pub Date : 2013-10-10 DOI:10.1145/2591796.2591868

O. Dekel, Jian Ding, Tomer Koren, Y. Peres

{"title":"Bandits with switching costs: T2/3 regret","authors":"O. Dekel, Jian Ding, Tomer Koren, Y. Peres","doi":"10.1145/2591796.2591868","DOIUrl":null,"url":null,"abstract":"We study the adversarial multi-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player's T-round minimax regret in this setting is [EQUATION], thereby closing a fundamental gap in our understanding of learning with bandit feedback. In the corresponding full-information version of the problem, the minimax regret is known to grow at a much slower rate of Θ(√T). The difference between these two rates provides the first indication that learning with bandit feedback can be significantly harder than learning with full information feedback (previous results only showed a different dependence on the number of actions, but not on T.) In addition to characterizing the inherent difficulty of the multi-armed bandit problem with switching costs, our results also resolve several other open problems in online learning. One direct implication is that learning with bandit feedback against bounded-memory adaptive adversaries has a minimax regret of [EQUATION]. Another implication is that the minimax regret of online learning in adversarial Markov decision processes (MDPs) is [EQUATION]. The key to all of our results is a new randomized construction of a multi-scale random walk, which is of independent interest and likely to prove useful in additional settings.","PeriodicalId":123501,"journal":{"name":"Proceedings of the forty-sixth annual ACM symposium on Theory of computing","volume":"47 43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"84","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the forty-sixth annual ACM symposium on Theory of computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2591796.2591868","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 84

Abstract

We study the adversarial multi-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player's T-round minimax regret in this setting is [EQUATION], thereby closing a fundamental gap in our understanding of learning with bandit feedback. In the corresponding full-information version of the problem, the minimax regret is known to grow at a much slower rate of Θ(√T). The difference between these two rates provides the first indication that learning with bandit feedback can be significantly harder than learning with full information feedback (previous results only showed a different dependence on the number of actions, but not on T.) In addition to characterizing the inherent difficulty of the multi-armed bandit problem with switching costs, our results also resolve several other open problems in online learning. One direct implication is that learning with bandit feedback against bounded-memory adaptive adversaries has a minimax regret of [EQUATION]. Another implication is that the minimax regret of online learning in adversarial Markov decision processes (MDPs) is [EQUATION]. The key to all of our results is a new randomized construction of a multi-scale random walk, which is of independent interest and likely to prove useful in additional settings.

查看原文本刊更多论文

有转换成本的强盗:T2/3后悔

我们在玩家每次改变行动都会产生单位损失的情况下研究对抗的多手强盗问题。我们证明了在这种情况下，玩家的t轮最小最大遗憾是[等式]，从而缩小了我们对学习与强盗反馈的理解的根本差距。在相应的全信息版本的问题中，最小最大后悔的增长速度要慢得多，为Θ(√T)。这两种速率之间的差异提供了第一个迹象，即使用强盗反馈的学习可能比使用完整信息反馈的学习要困难得多(之前的结果只显示了对行动数量的不同依赖，而不是对t的依赖)。除了描述具有转换成本的多臂强盗问题的固有困难外，我们的结果还解决了在线学习中的其他几个开放问题。一个直接的含义是，使用强盗反馈来学习对有限记忆自适应对手的最小最大遗憾[等式]。另一个含义是，在对抗马尔可夫决策过程(mdp)中，在线学习的最小最大遗憾是[等式]。我们所有结果的关键是一个新的多尺度随机游走的随机结构，这是一个独立的兴趣，可能在其他设置中证明是有用的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the forty-sixth annual ACM symposium on Theory of computing

自引率

0.00%

发文量