Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences

Alan Chan, Hugo Silva, Sungsu Lim, Tadashi Kozuno, A. Mahmood, Martha White
{"title":"Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences","authors":"Alan Chan, Hugo Silva, Sungsu Lim, Tadashi Kozuno, A. Mahmood, Martha White","doi":"10.7939/R3-M4YX-N678","DOIUrl":null,"url":null,"abstract":"Approximate Policy Iteration (API) algorithms alternate between (approximate) policy evaluation and (approximate) greedification. Many different approaches have been explored for approximate policy evaluation, but less is understood about approximate greedification and what choices guarantee policy improvement. In this work, we investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values. In particular, we investigate the difference between the forward and reverse KL divergences, with varying degrees of entropy regularization. We show that the reverse KL has stronger policy improvement guarantees, but that reducing the forward KL can result in a worse policy. We also demonstrate, however, that a large enough reduction of the forward KL can induce improvement under additional assumptions. Empirically, we show on simple continuous-action environments that the forward KL can induce more exploration, but at the cost of a more suboptimal policy. No significant differences were observed in the discrete-action setting or on a suite of benchmark problems. Throughout, we highlight that many policy gradient methods can be seen as an instance of API, with either the forward or reverse KL for the policy update, and discuss next steps for understanding and improving our policy optimization algorithms.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"60 1","pages":"253:1-253:79"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Mach. Learn. Res.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7939/R3-M4YX-N678","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

Approximate Policy Iteration (API) algorithms alternate between (approximate) policy evaluation and (approximate) greedification. Many different approaches have been explored for approximate policy evaluation, but less is understood about approximate greedification and what choices guarantee policy improvement. In this work, we investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values. In particular, we investigate the difference between the forward and reverse KL divergences, with varying degrees of entropy regularization. We show that the reverse KL has stronger policy improvement guarantees, but that reducing the forward KL can result in a worse policy. We also demonstrate, however, that a large enough reduction of the forward KL can induce improvement under additional assumptions. Empirically, we show on simple continuous-action environments that the forward KL can induce more exploration, but at the cost of a more suboptimal policy. No significant differences were observed in the discrete-action setting or on a suite of benchmark problems. Throughout, we highlight that many policy gradient methods can be seen as an instance of API, with either the forward or reverse KL for the policy update, and discuss next steps for understanding and improving our policy optimization algorithms.
策略优化的网格化算子:研究正向和反向KL散度
近似策略迭代(Approximate Policy Iteration, API)算法在(近似)策略评估和(近似)网格化之间交替进行。人们已经探索了许多不同的近似政策评估方法,但对近似化和哪些选择保证政策改进的了解较少。在这项工作中,我们研究了在减少参数化策略与动作值上的玻尔兹曼分布之间的KL散度时的近似网格化。特别地,我们研究了不同熵正则化程度下正向和反向KL散度之间的差异。我们证明反向KL具有更强的政策改进保证,但减小正向KL可能导致更差的政策。然而,我们也证明,在额外的假设下,足够大的前向KL的减少可以诱导改进。从经验上看,我们表明在简单的连续动作环境中,前向KL可以诱导更多的探索,但代价是更次优的策略。在离散动作设置或一组基准问题中没有观察到显著差异。在整个过程中,我们强调许多策略梯度方法可以被视为API的一个实例,具有用于策略更新的正向或反向KL,并讨论了理解和改进策略优化算法的后续步骤。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信