有转换成本的强盗:T2/3后悔

O. Dekel, Jian Ding, Tomer Koren, Y. Peres
{"title":"有转换成本的强盗:T2/3后悔","authors":"O. Dekel, Jian Ding, Tomer Koren, Y. Peres","doi":"10.1145/2591796.2591868","DOIUrl":null,"url":null,"abstract":"We study the adversarial multi-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player's T-round minimax regret in this setting is [EQUATION], thereby closing a fundamental gap in our understanding of learning with bandit feedback. In the corresponding full-information version of the problem, the minimax regret is known to grow at a much slower rate of Θ(√T). The difference between these two rates provides the first indication that learning with bandit feedback can be significantly harder than learning with full information feedback (previous results only showed a different dependence on the number of actions, but not on T.) In addition to characterizing the inherent difficulty of the multi-armed bandit problem with switching costs, our results also resolve several other open problems in online learning. One direct implication is that learning with bandit feedback against bounded-memory adaptive adversaries has a minimax regret of [EQUATION]. Another implication is that the minimax regret of online learning in adversarial Markov decision processes (MDPs) is [EQUATION]. The key to all of our results is a new randomized construction of a multi-scale random walk, which is of independent interest and likely to prove useful in additional settings.","PeriodicalId":123501,"journal":{"name":"Proceedings of the forty-sixth annual ACM symposium on Theory of computing","volume":"47 43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"84","resultStr":"{\"title\":\"Bandits with switching costs: T2/3 regret\",\"authors\":\"O. Dekel, Jian Ding, Tomer Koren, Y. Peres\",\"doi\":\"10.1145/2591796.2591868\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study the adversarial multi-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player's T-round minimax regret in this setting is [EQUATION], thereby closing a fundamental gap in our understanding of learning with bandit feedback. In the corresponding full-information version of the problem, the minimax regret is known to grow at a much slower rate of Θ(√T). The difference between these two rates provides the first indication that learning with bandit feedback can be significantly harder than learning with full information feedback (previous results only showed a different dependence on the number of actions, but not on T.) In addition to characterizing the inherent difficulty of the multi-armed bandit problem with switching costs, our results also resolve several other open problems in online learning. One direct implication is that learning with bandit feedback against bounded-memory adaptive adversaries has a minimax regret of [EQUATION]. Another implication is that the minimax regret of online learning in adversarial Markov decision processes (MDPs) is [EQUATION]. The key to all of our results is a new randomized construction of a multi-scale random walk, which is of independent interest and likely to prove useful in additional settings.\",\"PeriodicalId\":123501,\"journal\":{\"name\":\"Proceedings of the forty-sixth annual ACM symposium on Theory of computing\",\"volume\":\"47 43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"84\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the forty-sixth annual ACM symposium on Theory of computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2591796.2591868\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the forty-sixth annual ACM symposium on Theory of computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2591796.2591868","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 84

摘要

我们在玩家每次改变行动都会产生单位损失的情况下研究对抗的多手强盗问题。我们证明了在这种情况下,玩家的t轮最小最大遗憾是[等式],从而缩小了我们对学习与强盗反馈的理解的根本差距。在相应的全信息版本的问题中,最小最大后悔的增长速度要慢得多,为Θ(√T)。这两种速率之间的差异提供了第一个迹象,即使用强盗反馈的学习可能比使用完整信息反馈的学习要困难得多(之前的结果只显示了对行动数量的不同依赖,而不是对t的依赖)。除了描述具有转换成本的多臂强盗问题的固有困难外,我们的结果还解决了在线学习中的其他几个开放问题。一个直接的含义是,使用强盗反馈来学习对有限记忆自适应对手的最小最大遗憾[等式]。另一个含义是,在对抗马尔可夫决策过程(mdp)中,在线学习的最小最大遗憾是[等式]。我们所有结果的关键是一个新的多尺度随机游走的随机结构,这是一个独立的兴趣,可能在其他设置中证明是有用的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Bandits with switching costs: T2/3 regret
We study the adversarial multi-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player's T-round minimax regret in this setting is [EQUATION], thereby closing a fundamental gap in our understanding of learning with bandit feedback. In the corresponding full-information version of the problem, the minimax regret is known to grow at a much slower rate of Θ(√T). The difference between these two rates provides the first indication that learning with bandit feedback can be significantly harder than learning with full information feedback (previous results only showed a different dependence on the number of actions, but not on T.) In addition to characterizing the inherent difficulty of the multi-armed bandit problem with switching costs, our results also resolve several other open problems in online learning. One direct implication is that learning with bandit feedback against bounded-memory adaptive adversaries has a minimax regret of [EQUATION]. Another implication is that the minimax regret of online learning in adversarial Markov decision processes (MDPs) is [EQUATION]. The key to all of our results is a new randomized construction of a multi-scale random walk, which is of independent interest and likely to prove useful in additional settings.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信