Game-Theoretic Bandits for Network Optimization With High-Probability Swap-Regret Upper Bounds

IF 3 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE/ACM Transactions on Networking Pub Date : 2024-08-26 DOI:10.1109/TNET.2024.3444593

Zhiming Huang;Jianping Pan

{"title":"Game-Theoretic Bandits for Network Optimization With High-Probability Swap-Regret Upper Bounds","authors":"Zhiming Huang;Jianping Pan","doi":"10.1109/TNET.2024.3444593","DOIUrl":null,"url":null,"abstract":"In this paper, we study a multi-agent bandit problem in an unknown general-sum game repeated for a number of rounds (i.e., learning in a black-box game with bandit feedback), where a set of agents have no information about the underlying game structure and cannot observe each other’s actions and rewards. In each round, each agent needs to play an arm (i.e., action) from a (possibly different) arm set (i.e., action set), and \n<italic>only</i>\n receives the reward of the \n<italic>played</i>\n arm that is affected by other agents’ actions. The objective of each agent is to minimize her own cumulative swap regret, where the swap regret is a generic performance measure for online learning algorithms. Many network optimization problems can be cast with the framework of this multi-agent bandit problem, such as wireless medium access control and end-to-end congestion control. We propose an online-mirror-descent-based algorithm and provide near-optimal high-probability swap-regret upper bounds based on refined martingale analyses, which can further bound the expected swap regret instead of the pseudo-regret studied in the literature. Moreover, the high-probability bounds guarantee that correlated equilibria can be achieved in a polynomial number of rounds if the algorithms are played by all agents. To assess the performance of the studied algorithm, we conducted numerical experiments in the context of wireless medium access control, and we performed emulation experiments by implementing the studied algorithms through the Linux Kernel for the end-to-end congestion control.","PeriodicalId":13443,"journal":{"name":"IEEE/ACM Transactions on Networking","volume":"32 6","pages":"4855-4870"},"PeriodicalIF":3.0000,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Networking","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10645817/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we study a multi-agent bandit problem in an unknown general-sum game repeated for a number of rounds (i.e., learning in a black-box game with bandit feedback), where a set of agents have no information about the underlying game structure and cannot observe each other’s actions and rewards. In each round, each agent needs to play an arm (i.e., action) from a (possibly different) arm set (i.e., action set), and only receives the reward of the played arm that is affected by other agents’ actions. The objective of each agent is to minimize her own cumulative swap regret, where the swap regret is a generic performance measure for online learning algorithms. Many network optimization problems can be cast with the framework of this multi-agent bandit problem, such as wireless medium access control and end-to-end congestion control. We propose an online-mirror-descent-based algorithm and provide near-optimal high-probability swap-regret upper bounds based on refined martingale analyses, which can further bound the expected swap regret instead of the pseudo-regret studied in the literature. Moreover, the high-probability bounds guarantee that correlated equilibria can be achieved in a polynomial number of rounds if the algorithms are played by all agents. To assess the performance of the studied algorithm, we conducted numerical experiments in the context of wireless medium access control, and we performed emulation experiments by implementing the studied algorithms through the Linux Kernel for the end-to-end congestion control.

查看原文本刊更多论文

利用高概率交换-保留上限值进行网络优化的博弈论强盗游戏

在本文中，我们研究了一个重复数轮的未知一般和博弈中的多智能体盗匪问题（即在具有盗匪反馈的黑盒博弈中学习），其中一组智能体没有关于潜在博弈结构的信息，并且无法观察彼此的行为和奖励。在每一轮中，每个智能体都需要从一个（可能不同的）手臂集（即行动集）中使用一只手臂（即行动），并且只接受受其他智能体行动影响的已使用手臂的奖励。每个智能体的目标是最小化其自身的累积交换遗憾，其中交换遗憾是在线学习算法的通用性能度量。许多网络优化问题可以用这个多智能体强盗问题的框架来解决，如无线介质访问控制和端到端拥塞控制。我们提出了一种基于在线镜像下降的算法，并基于精细鞅分析提供了接近最优的高概率交换后悔上界，该算法可以进一步约束期望交换后悔，而不是文献中研究的伪后悔。此外，高概率界保证了当算法由所有主体参与时，相关均衡可以在多项式轮数内实现。为了评估所研究算法的性能，我们在无线介质访问控制的背景下进行了数值实验，并通过Linux内核实现所研究的算法进行了端到端拥塞控制的仿真实验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Networking 工程技术-电信学

CiteScore

8.20

自引率

5.40%

发文量

246

审稿时长

4-8 weeks

期刊介绍： The IEEE/ACM Transactions on Networking’s high-level objective is to publish high-quality, original research results derived from theoretical or experimental exploration of the area of communication/computer networking, covering all sorts of information transport networks over all sorts of physical layer technologies, both wireline (all kinds of guided media: e.g., copper, optical) and wireless (e.g., radio-frequency, acoustic (e.g., underwater), infra-red), or hybrids of these. The journal welcomes applied contributions reporting on novel experiences and experiments with actual systems.