Multi-armed Bandit Learning on a Graph

Tianpeng Zhang, Kasper Johansson, Na Li
{"title":"Multi-armed Bandit Learning on a Graph","authors":"Tianpeng Zhang, Kasper Johansson, Na Li","doi":"10.1109/CISS56502.2023.10089744","DOIUrl":null,"url":null,"abstract":"The multi-armed bandit(MAB) problem is a simple yet powerful framework that has been extensively studied in the context of decision-making under uncertainty. In many real-world applications, such as robotic applications, selecting an arm corresponds to a physical action that constrains the choices of the next available arms (actions). Motivated by this, we study an extension of MAB called the graph bandit, where an agent travels over a graph to maximize the reward collected from different nodes. The graph defines the agent's freedom in selecting the next available nodes at each step. We assume the graph structure is fully available, but the reward distributions are unknown. Built upon an offline graph-based planning algorithm and the principle of optimism, we design a learning algorithm, G-UCB, that balances long-term exploration-exploitation using the principle of optimism. We show that our proposed algorithm achieves ${O}(\\sqrt{\\vert S\\vert T\\log(T)}+D\\vert S\\vert \\log T)$ learning regret, where $\\vert S\\vert$ is the number of nodes and $D$ is the diameter of the graph, which matches the theoretical lower bound $\\Omega(\\sqrt{\\vert S\\vert T})$ up to logarithmic factors. To our knowledge, this result is among the first tight regret bounds in non-episodic, un-discounted learning problems with known deterministic transitions. Numerical experiments confirm that our algorithm outperforms several benchmarks.","PeriodicalId":243775,"journal":{"name":"2023 57th Annual Conference on Information Sciences and Systems (CISS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 57th Annual Conference on Information Sciences and Systems (CISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISS56502.2023.10089744","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The multi-armed bandit(MAB) problem is a simple yet powerful framework that has been extensively studied in the context of decision-making under uncertainty. In many real-world applications, such as robotic applications, selecting an arm corresponds to a physical action that constrains the choices of the next available arms (actions). Motivated by this, we study an extension of MAB called the graph bandit, where an agent travels over a graph to maximize the reward collected from different nodes. The graph defines the agent's freedom in selecting the next available nodes at each step. We assume the graph structure is fully available, but the reward distributions are unknown. Built upon an offline graph-based planning algorithm and the principle of optimism, we design a learning algorithm, G-UCB, that balances long-term exploration-exploitation using the principle of optimism. We show that our proposed algorithm achieves ${O}(\sqrt{\vert S\vert T\log(T)}+D\vert S\vert \log T)$ learning regret, where $\vert S\vert$ is the number of nodes and $D$ is the diameter of the graph, which matches the theoretical lower bound $\Omega(\sqrt{\vert S\vert T})$ up to logarithmic factors. To our knowledge, this result is among the first tight regret bounds in non-episodic, un-discounted learning problems with known deterministic transitions. Numerical experiments confirm that our algorithm outperforms several benchmarks.
图上的多臂强盗学习
多臂强盗(MAB)问题是一个简单而有力的框架,在不确定决策的背景下得到了广泛的研究。在许多现实世界的应用程序中,例如机器人应用程序,选择一个手臂对应于一个物理动作,该动作限制了下一个可用手臂(动作)的选择。受此启发,我们研究了MAB的扩展,称为图盗,其中智能体在图上移动以最大化从不同节点收集的奖励。该图定义了代理在每一步选择下一个可用节点的自由度。我们假设图结构是完全可用的,但奖励分布是未知的。在基于离线图的规划算法和乐观原则的基础上,我们设计了一个学习算法G-UCB,它利用乐观原则平衡了长期的探索-开发。我们表明,我们提出的算法实现了${O}(\sqrt{\vert S\vert T\log(T)}+D\vert S\vert \log T)$学习遗憾,其中$\vert S\vert$是节点数,$D$是图的直径,它与理论下界$\Omega(\sqrt{\vert S\vert T})$匹配到对数因子。据我们所知,这一结果是在已知确定性过渡的非情景、不打折学习问题中第一个严格的后悔界限。数值实验证实了我们的算法优于几个基准测试。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信