{"title":"改进的具有时变反馈图的对抗性强盗的高概率后悔","authors":"Haipeng Luo, Hanghang Tong, Mengxiao Zhang, Yuheng Zhang","doi":"10.48550/arXiv.2210.01376","DOIUrl":null,"url":null,"abstract":"We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\\widetilde{\\mathcal{O}}((\\sum_{t=1}^T\\alpha_t)^{1/2}+\\max_{t\\in[T]}\\alpha_t)$ with high probability, where $\\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result [Neu, 2015] which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of [Alon et al., 2015] by removing the $\\mathcal{O}(\\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs\",\"authors\":\"Haipeng Luo, Hanghang Tong, Mengxiao Zhang, Yuheng Zhang\",\"doi\":\"10.48550/arXiv.2210.01376\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\\\\widetilde{\\\\mathcal{O}}((\\\\sum_{t=1}^T\\\\alpha_t)^{1/2}+\\\\max_{t\\\\in[T]}\\\\alpha_t)$ with high probability, where $\\\\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result [Neu, 2015] which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\\\\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of [Alon et al., 2015] by removing the $\\\\mathcal{O}(\\\\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.\",\"PeriodicalId\":267197,\"journal\":{\"name\":\"International Conference on Algorithmic Learning Theory\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Algorithmic Learning Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2210.01376\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Algorithmic Learning Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.01376","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
摘要
我们研究了具有时变反馈图的$T$回合对抗性$K$武装匪徒的高概率后悔界。对于一般强可观察图,我们开发了一种高概率实现最优后悔$\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$的算法,其中$\alpha_t$为反馈图在$t$轮的独立数。现有的最佳结果[Neu, 2015]只考虑所有节点的自循环图,与之相比,我们的结果不仅更普遍,而且重要的是,它还消除了任何$\text{poly}(K)$依赖,这种依赖对于上下文强盗等应用程序来说可能太大了。此外,我们还开发了第一个实现弱可观察图的最佳高概率后悔界的算法,甚至通过精细分析去除$\mathcal{O}(\sqrt{KT})$项,改进了[Alon et al., 2015]的最佳期望后悔界。我们的算法基于在线镜像下降框架,但重要的是结合了几种技术的创新。值得注意的是,虽然早期的工作使用乐观的有偏损失估计器来获得高概率界,但我们发现对于强可观察图中没有自环的节点使用悲观估计器是很重要的。
Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs
We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result [Neu, 2015] which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of [Alon et al., 2015] by removing the $\mathcal{O}(\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.