Optimization Landscape of Policy Gradient Methods for Discrete-Time Static Output Feedback

IF 9.4 1区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

IEEE Transactions on Cybernetics Pub Date : 2023-10-26 DOI:10.1109/TCYB.2023.3323316

Jingliang Duan;Jie Li;Xuyang Chen;Kai Zhao;Shengbo Eben Li;Lin Zhao

{"title":"Optimization Landscape of Policy Gradient Methods for Discrete-Time Static Output Feedback","authors":"Jingliang Duan;Jie Li;Xuyang Chen;Kai Zhao;Shengbo Eben Li;Lin Zhao","doi":"10.1109/TCYB.2023.3323316","DOIUrl":null,"url":null,"abstract":"In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This article analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback (SOF) control in discrete-time LTI systems subject to quadratic cost. We begin by establishing crucial properties of the SOF cost, encompassing coercivity, \n<inline-formula> <tex-math>$L$ </tex-math></inline-formula>\n-smoothness, and \n<inline-formula> <tex-math>$M$ </tex-math></inline-formula>\n-Lipschitz continuous Hessian. Despite the absence of convexity, we leverage these properties to derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods, including the vanilla policy gradient method, the natural policy gradient method, and the Gauss–Newton method. Moreover, we provide proof that the vanilla policy gradient method exhibits linear convergence toward local minima when initialized near such minima. This article concludes by presenting numerical examples that validate our theoretical findings. These results not only characterize the performance of gradient descent for optimizing the SOF problem but also provide insights into the effectiveness of general policy gradient methods within the realm of reinforcement learning.","PeriodicalId":13112,"journal":{"name":"IEEE Transactions on Cybernetics","volume":"54 6","pages":"3588-3601"},"PeriodicalIF":9.4000,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cybernetics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10297124/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This article analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback (SOF) control in discrete-time LTI systems subject to quadratic cost. We begin by establishing crucial properties of the SOF cost, encompassing coercivity,

$L$

-smoothness, and

$M$

-Lipschitz continuous Hessian. Despite the absence of convexity, we leverage these properties to derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods, including the vanilla policy gradient method, the natural policy gradient method, and the Gauss–Newton method. Moreover, we provide proof that the vanilla policy gradient method exhibits linear convergence toward local minima when initialized near such minima. This article concludes by presenting numerical examples that validate our theoretical findings. These results not only characterize the performance of gradient descent for optimizing the SOF problem but also provide insights into the effectiveness of general policy gradient methods within the realm of reinforcement learning.

查看原文本刊更多论文

离散时间静态输出反馈的策略梯度方法的优化前景。

近年来，在深入研究策略梯度方法的优化前景以实现线性时不变（LTI）系统的最优控制方面取得了重大进展。与状态反馈控制相比，输出反馈控制更为普遍，因为在许多实际设置中可能无法完全观察到系统的基本状态。本文分析了在具有二次成本的离散时间LTI系统中，将策略梯度方法应用于静态输出反馈（SOF）控制时所固有的优化景观。我们首先建立了SOF成本的关键性质，包括矫顽力、L-光滑性和M-Lipschitz连续Hessian。尽管不存在凸性，但我们利用这些性质，推导出了三种策略梯度方法（包括香草策略梯度方法、自然策略梯度方法和高斯-牛顿方法）对平稳点的收敛性（和几乎无量纲率）的新发现。此外，我们还证明了当在局部极小值附近初始化时，vanilla策略梯度方法表现出向局部极小值的线性收敛性。本文最后给出了一些数值例子，验证了我们的理论发现。这些结果不仅表征了梯度下降在优化SOF问题中的性能，而且为强化学习领域中的一般策略梯度方法的有效性提供了见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Cybernetics COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

25.40

自引率

11.00%

发文量

1869

期刊介绍： The scope of the IEEE Transactions on Cybernetics includes computational approaches to the field of cybernetics. Specifically, the transactions welcomes papers on communication and control across machines or machine, human, and organizations. The scope includes such areas as computational intelligence, computer vision, neural networks, genetic algorithms, machine learning, fuzzy systems, cognitive systems, decision making, and robotics, to the extent that they contribute to the theme of cybernetics or demonstrate an application of cybernetics principles.