乐观PAC强化学习:实例依赖的观点

International Conference on Algorithmic Learning Theory Pub Date : 2022-07-12 DOI:10.48550/arXiv.2207.05852

Andrea Tirinzoni, Aymen Al Marjani, E. Kaufmann

{"title":"乐观PAC强化学习:实例依赖的观点","authors":"Andrea Tirinzoni, Aymen Al Marjani, E. Kaufmann","doi":"10.48550/arXiv.2207.05852","DOIUrl":null,"url":null,"abstract":"Optimistic algorithms have been extensively studied for regret minimization in episodic tabular MDPs, both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-dependent sample complexity. A negative result of Wagenmaker et al. (2021) suggests that optimistic sampling rules cannot be used to attain the (still elusive) optimal instance-dependent sample complexity. On the positive side, we provide the first instance-dependent bound for an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al., 2021). While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap compared to the value gaps that appear in prior work. Moreover, in MDPs with deterministic transitions, we show that BPI-UCRL is actually near-optimal. On the technical side, our analysis is very simple thanks to a new\"target trick\"of independent interest. We complement these findings with a novel hardness result explaining why the instance-dependent complexity of PAC RL cannot be easily related to that of regret minimization, unlike in the minimax regime.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Optimistic PAC Reinforcement Learning: the Instance-Dependent View\",\"authors\":\"Andrea Tirinzoni, Aymen Al Marjani, E. Kaufmann\",\"doi\":\"10.48550/arXiv.2207.05852\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Optimistic algorithms have been extensively studied for regret minimization in episodic tabular MDPs, both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-dependent sample complexity. A negative result of Wagenmaker et al. (2021) suggests that optimistic sampling rules cannot be used to attain the (still elusive) optimal instance-dependent sample complexity. On the positive side, we provide the first instance-dependent bound for an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al., 2021). While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap compared to the value gaps that appear in prior work. Moreover, in MDPs with deterministic transitions, we show that BPI-UCRL is actually near-optimal. On the technical side, our analysis is very simple thanks to a new\\\"target trick\\\"of independent interest. We complement these findings with a novel hardness result explaining why the instance-dependent complexity of PAC RL cannot be easily related to that of regret minimization, unlike in the minimax regime.\",\"PeriodicalId\":267197,\"journal\":{\"name\":\"International Conference on Algorithmic Learning Theory\",\"volume\":\"83 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Algorithmic Learning Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2207.05852\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Algorithmic Learning Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2207.05852","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

从极大极小和实例依赖的角度来看，乐观算法已被广泛研究用于情景表格mdp中的遗憾最小化。然而，对于PAC RL问题，其目标是以高概率确定接近最优的策略，对于它们的实例依赖的样本复杂性知之甚少。Wagenmaker等人(2021)的否定结果表明，乐观抽样规则不能用于获得(仍然难以捉摸的)最优实例相关样本复杂度。从积极的方面来看，我们为PAC RL, BPI-UCRL的乐观算法提供了第一个实例相关的界，其中只有极小最大值保证可用(Kaufmann et al.， 2021)。虽然我们的边界具有一些最小的访问概率，但与先前工作中出现的值差距相比，它还具有改进的次最优性差距概念。此外，在具有确定性转换的mdp中，我们表明BPI-UCRL实际上是接近最优的。在技术方面，我们的分析非常简单，这要归功于一个新的“目标技巧”。我们用一个新的硬度结果来补充这些发现，解释了为什么PAC RL的实例依赖复杂性不能容易地与遗憾最小化相关联，而不像在极小极大机制中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimistic PAC Reinforcement Learning: the Instance-Dependent View

Optimistic algorithms have been extensively studied for regret minimization in episodic tabular MDPs, both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-dependent sample complexity. A negative result of Wagenmaker et al. (2021) suggests that optimistic sampling rules cannot be used to attain the (still elusive) optimal instance-dependent sample complexity. On the positive side, we provide the first instance-dependent bound for an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al., 2021). While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap compared to the value gaps that appear in prior work. Moreover, in MDPs with deterministic transitions, we show that BPI-UCRL is actually near-optimal. On the technical side, our analysis is very simple thanks to a new"target trick"of independent interest. We complement these findings with a novel hardness result explaining why the instance-dependent complexity of PAC RL cannot be easily related to that of regret minimization, unlike in the minimax regime.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Algorithmic Learning Theory

自引率

0.00%

发文量