自然政策梯度方法的几何与收敛性

Information geometry Pub Date : 2023-06-02 DOI:10.1007/s41884-023-00106-z

Johannes Müller, Guido Montúfar

{"title":"自然政策梯度方法的几何与收敛性","authors":"Johannes Müller, Guido Montúfar","doi":"10.1007/s41884-023-00106-z","DOIUrl":null,"url":null,"abstract":"Abstract We study the convergence of several natural policy gradient (NPG) methods in infinite-horizon discounted Markov decision processes with regular policy parametrizations. For a variety of NPGs and reward functions we show that the trajectories in state-action space are solutions of gradient flows with respect to Hessian geometries, based on which we obtain global convergence guarantees and convergence rates. In particular, we show linear convergence for unregularized and regularized NPG flows with the metrics proposed by Kakade and Morimura and co-authors by observing that these arise from the Hessian geometries of conditional entropy and entropy respectively. Further, we obtain sublinear convergence rates for Hessian geometries arising from other convex functions like log-barriers. Finally, we interpret the discrete-time NPG methods with regularized rewards as inexact Newton methods if the NPG is defined with respect to the Hessian geometry of the regularizer. This yields local quadratic convergence rates of these methods for step size equal to the inverse penalization strength.","PeriodicalId":93762,"journal":{"name":"Information geometry","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Geometry and convergence of natural policy gradient methods\",\"authors\":\"Johannes Müller, Guido Montúfar\",\"doi\":\"10.1007/s41884-023-00106-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract We study the convergence of several natural policy gradient (NPG) methods in infinite-horizon discounted Markov decision processes with regular policy parametrizations. For a variety of NPGs and reward functions we show that the trajectories in state-action space are solutions of gradient flows with respect to Hessian geometries, based on which we obtain global convergence guarantees and convergence rates. In particular, we show linear convergence for unregularized and regularized NPG flows with the metrics proposed by Kakade and Morimura and co-authors by observing that these arise from the Hessian geometries of conditional entropy and entropy respectively. Further, we obtain sublinear convergence rates for Hessian geometries arising from other convex functions like log-barriers. Finally, we interpret the discrete-time NPG methods with regularized rewards as inexact Newton methods if the NPG is defined with respect to the Hessian geometry of the regularizer. This yields local quadratic convergence rates of these methods for step size equal to the inverse penalization strength.\",\"PeriodicalId\":93762,\"journal\":{\"name\":\"Information geometry\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information geometry\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s41884-023-00106-z\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information geometry","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41884-023-00106-z","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

摘要研究了具有规则策略参数化的无限视界贴现马尔可夫决策过程中几种自然策略梯度方法的收敛性。对于各种npg和奖励函数，我们证明了状态-作用空间中的轨迹是梯度流相对于Hessian几何的解，在此基础上我们获得了全局收敛保证和收敛速率。特别是，我们通过观察这些分别来自条件熵和熵的Hessian几何，展示了使用Kakade和Morimura及其合著者提出的指标的非正则化和正则化NPG流的线性收敛性。进一步，我们得到了由其他凸函数如对数障碍引起的Hessian几何的次线性收敛速率。最后，我们将具有正则化奖励的离散时间NPG方法解释为非精确牛顿方法，如果NPG是根据正则化器的Hessian几何定义的。这就产生了这些方法在步长等于逆惩罚强度时的局部二次收敛率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Geometry and convergence of natural policy gradient methods

Abstract We study the convergence of several natural policy gradient (NPG) methods in infinite-horizon discounted Markov decision processes with regular policy parametrizations. For a variety of NPGs and reward functions we show that the trajectories in state-action space are solutions of gradient flows with respect to Hessian geometries, based on which we obtain global convergence guarantees and convergence rates. In particular, we show linear convergence for unregularized and regularized NPG flows with the metrics proposed by Kakade and Morimura and co-authors by observing that these arise from the Hessian geometries of conditional entropy and entropy respectively. Further, we obtain sublinear convergence rates for Hessian geometries arising from other convex functions like log-barriers. Finally, we interpret the discrete-time NPG methods with regularized rewards as inexact Newton methods if the NPG is defined with respect to the Hessian geometry of the regularizer. This yields local quadratic convergence rates of these methods for step size equal to the inverse penalization strength.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information geometry

CiteScore

1.70

自引率

0.00%

发文量