Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences

J. Mach. Learn. Res. Pub Date : 2021-07-17 DOI:10.7939/R3-M4YX-N678

Alan Chan, Hugo Silva, Sungsu Lim, Tadashi Kozuno, A. Mahmood, Martha White

{"title":"Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences","authors":"Alan Chan, Hugo Silva, Sungsu Lim, Tadashi Kozuno, A. Mahmood, Martha White","doi":"10.7939/R3-M4YX-N678","DOIUrl":null,"url":null,"abstract":"Approximate Policy Iteration (API) algorithms alternate between (approximate) policy evaluation and (approximate) greedification. Many different approaches have been explored for approximate policy evaluation, but less is understood about approximate greedification and what choices guarantee policy improvement. In this work, we investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values. In particular, we investigate the difference between the forward and reverse KL divergences, with varying degrees of entropy regularization. We show that the reverse KL has stronger policy improvement guarantees, but that reducing the forward KL can result in a worse policy. We also demonstrate, however, that a large enough reduction of the forward KL can induce improvement under additional assumptions. Empirically, we show on simple continuous-action environments that the forward KL can induce more exploration, but at the cost of a more suboptimal policy. No significant differences were observed in the discrete-action setting or on a suite of benchmark problems. Throughout, we highlight that many policy gradient methods can be seen as an instance of API, with either the forward or reverse KL for the policy update, and discuss next steps for understanding and improving our policy optimization algorithms.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"60 1","pages":"253:1-253:79"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Mach. Learn. Res.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7939/R3-M4YX-N678","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Approximate Policy Iteration (API) algorithms alternate between (approximate) policy evaluation and (approximate) greedification. Many different approaches have been explored for approximate policy evaluation, but less is understood about approximate greedification and what choices guarantee policy improvement. In this work, we investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values. In particular, we investigate the difference between the forward and reverse KL divergences, with varying degrees of entropy regularization. We show that the reverse KL has stronger policy improvement guarantees, but that reducing the forward KL can result in a worse policy. We also demonstrate, however, that a large enough reduction of the forward KL can induce improvement under additional assumptions. Empirically, we show on simple continuous-action environments that the forward KL can induce more exploration, but at the cost of a more suboptimal policy. No significant differences were observed in the discrete-action setting or on a suite of benchmark problems. Throughout, we highlight that many policy gradient methods can be seen as an instance of API, with either the forward or reverse KL for the policy update, and discuss next steps for understanding and improving our policy optimization algorithms.

查看原文本刊更多论文

策略优化的网格化算子:研究正向和反向KL散度

近似策略迭代(Approximate Policy Iteration, API)算法在(近似)策略评估和(近似)网格化之间交替进行。人们已经探索了许多不同的近似政策评估方法，但对近似化和哪些选择保证政策改进的了解较少。在这项工作中，我们研究了在减少参数化策略与动作值上的玻尔兹曼分布之间的KL散度时的近似网格化。特别地，我们研究了不同熵正则化程度下正向和反向KL散度之间的差异。我们证明反向KL具有更强的政策改进保证，但减小正向KL可能导致更差的政策。然而，我们也证明，在额外的假设下，足够大的前向KL的减少可以诱导改进。从经验上看，我们表明在简单的连续动作环境中，前向KL可以诱导更多的探索，但代价是更次优的策略。在离散动作设置或一组基准问题中没有观察到显著差异。在整个过程中，我们强调许多策略梯度方法可以被视为API的一个实例，具有用于策略更新的正向或反向KL，并讨论了理解和改进策略优化算法的后续步骤。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

J. Mach. Learn. Res.

自引率

0.00%

发文量