Power-of-2-arms for bandit learning with switching costs

Proceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing Pub Date : 2022-10-03 DOI:10.1145/3492866.3549720

Ming Shi, Xiaojun Lin, Lei Jiao

{"title":"Power-of-2-arms for bandit learning with switching costs","authors":"Ming Shi, Xiaojun Lin, Lei Jiao","doi":"10.1145/3492866.3549720","DOIUrl":null,"url":null,"abstract":"Motivated by edge computing with artificial intelligence, in this paper we study a bandit-learning problem with switching costs. Existing results in the literature either incur [EQUATION] regret with bandit feedback, or rely on free full-feedback in order to reduce the regret to [EQUATION]. In contrast, we expand our study to incorporate two new factors. First, full feedback could incur a cost. Second, the player may choose 2 (or more) arms at a time, in which case she is free to use any one of the chosen arms to calculate loss, and switching costs are incurred only when she changes the set of chosen arms. For the setting where the player pulls only one arm at a time, our new regret lower-bound shows that, even when costly full-feedback is added, the [EQUATION] regret still cannot be improved. However, the dependence on the number of arms may be improved when the full-feedback cost is small. In contrast, for the setting where the player can choose 2 (or more) arms at a time, we provide a novel online learning algorithm that achieves a lower [EQUATION] regret. Further, our new algorithm does not need any full feedback at all. This sharp difference therefore reveals the surprising power of choosing 2 (or more) arms for this type of bandit-learning problems with switching costs. Both our new algorithm and regret analysis involve several new ideas, which may be of independent interest.","PeriodicalId":335155,"journal":{"name":"Proceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3492866.3549720","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Motivated by edge computing with artificial intelligence, in this paper we study a bandit-learning problem with switching costs. Existing results in the literature either incur [EQUATION] regret with bandit feedback, or rely on free full-feedback in order to reduce the regret to [EQUATION]. In contrast, we expand our study to incorporate two new factors. First, full feedback could incur a cost. Second, the player may choose 2 (or more) arms at a time, in which case she is free to use any one of the chosen arms to calculate loss, and switching costs are incurred only when she changes the set of chosen arms. For the setting where the player pulls only one arm at a time, our new regret lower-bound shows that, even when costly full-feedback is added, the [EQUATION] regret still cannot be improved. However, the dependence on the number of arms may be improved when the full-feedback cost is small. In contrast, for the setting where the player can choose 2 (or more) arms at a time, we provide a novel online learning algorithm that achieves a lower [EQUATION] regret. Further, our new algorithm does not need any full feedback at all. This sharp difference therefore reveals the surprising power of choosing 2 (or more) arms for this type of bandit-learning problems with switching costs. Both our new algorithm and regret analysis involve several new ideas, which may be of independent interest.

查看原文本刊更多论文

带转换成本的强盗学习的2臂幂

在人工智能边缘计算的激励下，研究了一个具有切换成本的强盗学习问题。文献中已有的结果，要么通过强盗反馈导致后悔[方程]，要么依靠免费的全反馈来减少后悔[方程]。相比之下，我们扩展了我们的研究，纳入了两个新的因素。首先，全面反馈可能会产生成本。其次，玩家可以一次选择2个(或更多)武器，在这种情况下，他可以自由地使用所选武器中的任何一个来计算损失，并且只有当他改变所选武器的集合时才会产生转换成本。对于玩家一次只拉动一只手臂的设置，我们的新遗憾下限表明，即使添加了代价高昂的全反馈，[EQUATION]遗憾仍然无法得到改善。然而，当全反馈成本较小时，对兵种数量的依赖可能会得到改善。相比之下，对于玩家一次可以选择2(或更多)武器的设置，我们提供了一种新颖的在线学习算法，可以实现更低的遗憾。此外，我们的新算法根本不需要任何完整的反馈。因此，这种明显的差异揭示了选择2(或更多)武器来解决这种带有转换成本的强盗学习问题的惊人力量。我们的新算法和后悔分析都涉及到一些新的想法，这些想法可能是独立的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing

自引率

0.00%

发文量