Curious Explorer: A Provable Exploration Strategy in Policy Learning

Marco Miani;Maurizio Parton;Marco Romito
{"title":"Curious Explorer: A Provable Exploration Strategy in Policy Learning","authors":"Marco Miani;Maurizio Parton;Marco Romito","doi":"10.1109/TPAMI.2024.3460972","DOIUrl":null,"url":null,"abstract":"A coverage assumption is critical with policy gradient methods, because while the objective function is insensitive to updates in unlikely states, the agent may need improvements in those states to reach a nearly optimal payoff. However, this assumption can be unfeasible in certain environments, for instance in online learning, or when restarts are possible only from a fixed initial state. In these cases, classical policy gradient algorithms like REINFORCE can have poor convergence properties and sample efficiency. Curious Explorer is an iterative state space pure exploration strategy improving coverage of any restart distribution \n<inline-formula><tex-math>$\\rho$</tex-math></inline-formula>\n. Using \n<inline-formula><tex-math>$\\rho$</tex-math></inline-formula>\n and intrinsic rewards, Curious Explorer produces a sequence of policies, each one more exploratory than the previous one, and outputs a restart distribution with coverage based on the state visitation distribution of the exploratory policies. This paper main results are a theoretical upper bound on how often an optimal policy visits poorly visited states, and a bound on the error of the return obtained by REINFORCE without any coverage assumption. Finally, we conduct ablation studies with \n<monospace>REINFORCE</monospace>\n and \n<monospace>TRPO</monospace>\n in two hard-exploration tasks, to support the claim that Curious Explorer can improve the performance of very different policy gradient algorithms.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11422-11431"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10680592/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

A coverage assumption is critical with policy gradient methods, because while the objective function is insensitive to updates in unlikely states, the agent may need improvements in those states to reach a nearly optimal payoff. However, this assumption can be unfeasible in certain environments, for instance in online learning, or when restarts are possible only from a fixed initial state. In these cases, classical policy gradient algorithms like REINFORCE can have poor convergence properties and sample efficiency. Curious Explorer is an iterative state space pure exploration strategy improving coverage of any restart distribution $\rho$ . Using $\rho$ and intrinsic rewards, Curious Explorer produces a sequence of policies, each one more exploratory than the previous one, and outputs a restart distribution with coverage based on the state visitation distribution of the exploratory policies. This paper main results are a theoretical upper bound on how often an optimal policy visits poorly visited states, and a bound on the error of the return obtained by REINFORCE without any coverage assumption. Finally, we conduct ablation studies with REINFORCE and TRPO in two hard-exploration tasks, to support the claim that Curious Explorer can improve the performance of very different policy gradient algorithms.
好奇的探索者:政策学习中的可证明探索策略
覆盖假设对策略梯度法至关重要,因为虽然目标函数对不可能状态下的更新不敏感,但代理可能需要改进这些状态才能达到近乎最优的报酬。然而,在某些环境下,例如在线学习,或只能从固定的初始状态重新开始时,这一假设可能是不可行的。在这种情况下,REINFORCE 等经典策略梯度算法的收敛性和采样效率都很差。好奇探索者是一种迭代状态空间纯探索策略,它能提高任何重启分布 $\rho$ 的覆盖率。好奇探索者使用 $\rho$ 和内在奖励,产生一系列策略,每一个策略都比前一个策略更具探索性,并根据探索策略的状态访问分布输出具有覆盖率的重启分布。本文的主要研究成果是:对最优策略访问欠访问状态的频率提出了理论上限,并对 REINFORCE 在不考虑任何覆盖率假设的情况下获得的回报误差提出了上限。最后,我们用 REINFORCE 和 TRPO 在两个硬探索任务中进行了消融研究,以支持好奇探索者能提高不同策略梯度算法性能的说法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信