Best-of-Both-Worlds Algorithms for Partial Monitoring

International Conference on Algorithmic Learning Theory Pub Date : 2022-07-29 DOI:10.48550/arXiv.2207.14550

Taira Tsuchiya, Shinji Ito, J. Honda

{"title":"Best-of-Both-Worlds Algorithms for Partial Monitoring","authors":"Taira Tsuchiya, Shinji Ito, J. Honda","doi":"10.48550/arXiv.2207.14550","DOIUrl":null,"url":null,"abstract":"This study considers the partial monitoring problem with $k$-actions and $d$-outcomes and provides the first best-of-both-worlds algorithms, whose regrets are favorably bounded both in the stochastic and adversarial regimes. In particular, we show that for non-degenerate locally observable games, the regret is $O(m^2 k^4 \\log(T) \\log(k_{\\Pi} T) / \\Delta_{\\min})$ in the stochastic regime and $O(m k^{2/3} \\sqrt{T \\log(T) \\log k_{\\Pi}})$ in the adversarial regime, where $T$ is the number of rounds, $m$ is the maximum number of distinct observations per action, $\\Delta_{\\min}$ is the minimum suboptimality gap, and $k_{\\Pi}$ is the number of Pareto optimal actions. Moreover, we show that for globally observable games, the regret is $O(c_{\\mathcal{G}}^2 \\log(T) \\log(k_{\\Pi} T) / \\Delta_{\\min}^2)$ in the stochastic regime and $O((c_{\\mathcal{G}}^2 \\log(T) \\log(k_{\\Pi} T))^{1/3} T^{2/3})$ in the adversarial regime, where $c_{\\mathcal{G}}$ is a game-dependent constant. We also provide regret bounds for a stochastic regime with adversarial corruptions. Our algorithms are based on the follow-the-regularized-leader framework and are inspired by the approach of exploration by optimization and the adaptive learning rate in the field of online learning with feedback graphs.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Algorithmic Learning Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2207.14550","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

This study considers the partial monitoring problem with $k$-actions and $d$-outcomes and provides the first best-of-both-worlds algorithms, whose regrets are favorably bounded both in the stochastic and adversarial regimes. In particular, we show that for non-degenerate locally observable games, the regret is $O(m^2 k^4 \log(T) \log(k_{\Pi} T) / \Delta_{\min})$ in the stochastic regime and $O(m k^{2/3} \sqrt{T \log(T) \log k_{\Pi}})$ in the adversarial regime, where $T$ is the number of rounds, $m$ is the maximum number of distinct observations per action, $\Delta_{\min}$ is the minimum suboptimality gap, and $k_{\Pi}$ is the number of Pareto optimal actions. Moreover, we show that for globally observable games, the regret is $O(c_{\mathcal{G}}^2 \log(T) \log(k_{\Pi} T) / \Delta_{\min}^2)$ in the stochastic regime and $O((c_{\mathcal{G}}^2 \log(T) \log(k_{\Pi} T))^{1/3} T^{2/3})$ in the adversarial regime, where $c_{\mathcal{G}}$ is a game-dependent constant. We also provide regret bounds for a stochastic regime with adversarial corruptions. Our algorithms are based on the follow-the-regularized-leader framework and are inspired by the approach of exploration by optimization and the adaptive learning rate in the field of online learning with feedback graphs.

查看原文本刊更多论文

部分监控的两全其美算法

本研究考虑了$k$ -行动和$d$ -结果的部分监控问题，并提供了第一个两全其天下的最佳算法，其遗憾在随机和对抗状态下都是有利的。特别地，我们证明了对于非退化的局部可观察对策，在随机制度下的后悔是$O(m^2 k^4 \log(T) \log(k_{\Pi} T) / \Delta_{\min})$，在对抗制度下的后悔是$O(m k^{2/3} \sqrt{T \log(T) \log k_{\Pi}})$，其中$T$是回合数，$m$是每个行动的最大不同观察数，$\Delta_{\min}$是最小次优性差距，$k_{\Pi}$是帕累托最优行动的数量。此外，我们表明，对于全局可观察的博弈，遗憾是$O(c_{\mathcal{G}}^2 \log(T) \log(k_{\Pi} T) / \Delta_{\min}^2)$在随机制度和$O((c_{\mathcal{G}}^2 \log(T) \log(k_{\Pi} T))^{1/3} T^{2/3})$在对抗制度，$c_{\mathcal{G}}$是一个游戏相关的常数。我们还为具有对抗性腐败的随机制度提供了遗憾界。我们的算法基于遵循正则化领导者框架，并受到在线学习反馈图领域的优化探索方法和自适应学习率的启发。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Algorithmic Learning Theory

自引率

0.00%

发文量