在线学习与非政策反馈

International Conference on Algorithmic Learning Theory Pub Date : 2022-07-18 DOI:10.48550/arXiv.2207.08956

Germano Gabbianelli, M. Papini, Gergely Neu

{"title":"在线学习与非政策反馈","authors":"Germano Gabbianelli, M. Papini, Gergely Neu","doi":"10.48550/arXiv.2207.08956","DOIUrl":null,"url":null,"abstract":"We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback. In this sequential decision making problem, the learner cannot directly observe its rewards, but instead sees the ones obtained by another unknown policy run in parallel (behavior policy). Instead of a standard exploration-exploitation dilemma, the learner has to face another challenge in this setting: due to limited observations outside of their control, the learner may not be able to estimate the value of each policy equally well. To address this issue, we propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy, achieving improved performance against comparators that are well-covered by the observations. We also provide an extension to the setting of adversarial linear contextual bandits, and verify the theoretical guarantees via a set of experiments. Our key algorithmic idea is adapting the notion of pessimistic reward estimators that has been recently popular in the context of off-policy reinforcement learning.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Online Learning with Off-Policy Feedback\",\"authors\":\"Germano Gabbianelli, M. Papini, Gergely Neu\",\"doi\":\"10.48550/arXiv.2207.08956\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback. In this sequential decision making problem, the learner cannot directly observe its rewards, but instead sees the ones obtained by another unknown policy run in parallel (behavior policy). Instead of a standard exploration-exploitation dilemma, the learner has to face another challenge in this setting: due to limited observations outside of their control, the learner may not be able to estimate the value of each policy equally well. To address this issue, we propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy, achieving improved performance against comparators that are well-covered by the observations. We also provide an extension to the setting of adversarial linear contextual bandits, and verify the theoretical guarantees via a set of experiments. Our key algorithmic idea is adapting the notion of pessimistic reward estimators that has been recently popular in the context of off-policy reinforcement learning.\",\"PeriodicalId\":267197,\"journal\":{\"name\":\"International Conference on Algorithmic Learning Theory\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Algorithmic Learning Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2207.08956\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Algorithmic Learning Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2207.08956","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

我们在一种称为off-policy反馈的部分可观察性模型下研究了对抗性土匪问题的在线学习问题。在这个顺序决策问题中，学习者不能直接观察到自己的奖励，而是看到另一个未知策略获得的奖励并行运行(行为策略)。在这种情况下，学习者必须面对另一个挑战，而不是标准的探索-利用困境:由于他们无法控制的有限观察，学习者可能无法同样好地估计每个策略的价值。为了解决这个问题，我们提出了一组算法来保证遗憾界限，这些界限可以根据任何比较器策略和行为策略之间不匹配的自然概念进行缩放，从而在观察结果中很好地覆盖了比较器的情况下实现改进的性能。我们还提供了对抗性线性上下文强盗设置的扩展，并通过一组实验验证了理论保证。我们的关键算法思想是采用最近在非策略强化学习背景下流行的悲观奖励估计器的概念。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Online Learning with Off-Policy Feedback

We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback. In this sequential decision making problem, the learner cannot directly observe its rewards, but instead sees the ones obtained by another unknown policy run in parallel (behavior policy). Instead of a standard exploration-exploitation dilemma, the learner has to face another challenge in this setting: due to limited observations outside of their control, the learner may not be able to estimate the value of each policy equally well. To address this issue, we propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy, achieving improved performance against comparators that are well-covered by the observations. We also provide an extension to the setting of adversarial linear contextual bandits, and verify the theoretical guarantees via a set of experiments. Our key algorithmic idea is adapting the notion of pessimistic reward estimators that has been recently popular in the context of off-policy reinforcement learning.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Algorithmic Learning Theory

自引率

0.00%

发文量