Optimization Landscape of Policy Gradient Methods for Discrete-Time Static Output Feedback

IF 9.4 1区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS
Jingliang Duan;Jie Li;Xuyang Chen;Kai Zhao;Shengbo Eben Li;Lin Zhao
{"title":"Optimization Landscape of Policy Gradient Methods for Discrete-Time Static Output Feedback","authors":"Jingliang Duan;Jie Li;Xuyang Chen;Kai Zhao;Shengbo Eben Li;Lin Zhao","doi":"10.1109/TCYB.2023.3323316","DOIUrl":null,"url":null,"abstract":"In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This article analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback (SOF) control in discrete-time LTI systems subject to quadratic cost. We begin by establishing crucial properties of the SOF cost, encompassing coercivity, \n<inline-formula> <tex-math>$L$ </tex-math></inline-formula>\n-smoothness, and \n<inline-formula> <tex-math>$M$ </tex-math></inline-formula>\n-Lipschitz continuous Hessian. Despite the absence of convexity, we leverage these properties to derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods, including the vanilla policy gradient method, the natural policy gradient method, and the Gauss–Newton method. Moreover, we provide proof that the vanilla policy gradient method exhibits linear convergence toward local minima when initialized near such minima. This article concludes by presenting numerical examples that validate our theoretical findings. These results not only characterize the performance of gradient descent for optimizing the SOF problem but also provide insights into the effectiveness of general policy gradient methods within the realm of reinforcement learning.","PeriodicalId":13112,"journal":{"name":"IEEE Transactions on Cybernetics","volume":"54 6","pages":"3588-3601"},"PeriodicalIF":9.4000,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cybernetics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10297124/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This article analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback (SOF) control in discrete-time LTI systems subject to quadratic cost. We begin by establishing crucial properties of the SOF cost, encompassing coercivity, $L$ -smoothness, and $M$ -Lipschitz continuous Hessian. Despite the absence of convexity, we leverage these properties to derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods, including the vanilla policy gradient method, the natural policy gradient method, and the Gauss–Newton method. Moreover, we provide proof that the vanilla policy gradient method exhibits linear convergence toward local minima when initialized near such minima. This article concludes by presenting numerical examples that validate our theoretical findings. These results not only characterize the performance of gradient descent for optimizing the SOF problem but also provide insights into the effectiveness of general policy gradient methods within the realm of reinforcement learning.
离散时间静态输出反馈的策略梯度方法的优化前景。
近年来,在深入研究策略梯度方法的优化前景以实现线性时不变(LTI)系统的最优控制方面取得了重大进展。与状态反馈控制相比,输出反馈控制更为普遍,因为在许多实际设置中可能无法完全观察到系统的基本状态。本文分析了在具有二次成本的离散时间LTI系统中,将策略梯度方法应用于静态输出反馈(SOF)控制时所固有的优化景观。我们首先建立了SOF成本的关键性质,包括矫顽力、L-光滑性和M-Lipschitz连续Hessian。尽管不存在凸性,但我们利用这些性质,推导出了三种策略梯度方法(包括香草策略梯度方法、自然策略梯度方法和高斯-牛顿方法)对平稳点的收敛性(和几乎无量纲率)的新发现。此外,我们还证明了当在局部极小值附近初始化时,vanilla策略梯度方法表现出向局部极小值的线性收敛性。本文最后给出了一些数值例子,验证了我们的理论发现。这些结果不仅表征了梯度下降在优化SOF问题中的性能,而且为强化学习领域中的一般策略梯度方法的有效性提供了见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Cybernetics
IEEE Transactions on Cybernetics COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS
CiteScore
25.40
自引率
11.00%
发文量
1869
期刊介绍: The scope of the IEEE Transactions on Cybernetics includes computational approaches to the field of cybernetics. Specifically, the transactions welcomes papers on communication and control across machines or machine, human, and organizations. The scope includes such areas as computational intelligence, computer vision, neural networks, genetic algorithms, machine learning, fuzzy systems, cognitive systems, decision making, and robotics, to the extent that they contribute to the theme of cybernetics or demonstrate an application of cybernetics principles.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信