Guojian Zhan,Xiangteng Zhang,Feihong Zhang,Letian Tao,Shengbo Eben Li
{"title":"Bicriteria Policy Optimization for High-Accuracy Reinforcement Learning.","authors":"Guojian Zhan,Xiangteng Zhang,Feihong Zhang,Letian Tao,Shengbo Eben Li","doi":"10.1109/tnnls.2025.3605362","DOIUrl":null,"url":null,"abstract":"In essence, reinforcement learning (RL) solves optimal control problem (OCP) by employing a neural network (NN) to fit the optimal policy from state to action. The accuracy of policy approximation is often very low in complex control tasks, leading to unsatisfactory control performance compared with online optimal controllers. A primary reason is that the landscape of value function is always not only rugged in most areas but also flat on the bottom, which damages the convergence to the minimum point. To address this issue, we develop a bicriteria policy optimization (BPO) algorithm, which leverages a few optimal demonstration trajectories to guide the policy search at the gradient level. Different from conventional problem definition, BPO seeks to solve a bicriteria OCP, which has two homomorphic objectives: one is from the standard reward signals and the other is to align the demonstration trajectories. We introduce two co-state variables, one for each objectives, and formulate two Hamiltonians for this bicriteria OCP. The resulting new optimality condition preserves the minimum values of both Hamiltonians. Furthermore, we find that gradient conflict is a key obstacle to simultaneously descending both Hamiltonians, and its impact is negatively proportional to the inner product between the ideal and actual gradients. A minimax optimization problem is built at each RL iteration to minimize conflicts between two homomorphic objectives, whose solution for policy updating is referred to as harmonic gradient. By converting its inner optimization loop into a linear programming with convex trust region constraint, we simplify this problem into a single-loop maximization problem with much increased computational efficiency. Experiment tests on both linear and nonlinear control tasks validate the effectiveness of our BPO algorithm on the accuracy improvement of policy network.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"43 1","pages":""},"PeriodicalIF":8.9000,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tnnls.2025.3605362","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In essence, reinforcement learning (RL) solves optimal control problem (OCP) by employing a neural network (NN) to fit the optimal policy from state to action. The accuracy of policy approximation is often very low in complex control tasks, leading to unsatisfactory control performance compared with online optimal controllers. A primary reason is that the landscape of value function is always not only rugged in most areas but also flat on the bottom, which damages the convergence to the minimum point. To address this issue, we develop a bicriteria policy optimization (BPO) algorithm, which leverages a few optimal demonstration trajectories to guide the policy search at the gradient level. Different from conventional problem definition, BPO seeks to solve a bicriteria OCP, which has two homomorphic objectives: one is from the standard reward signals and the other is to align the demonstration trajectories. We introduce two co-state variables, one for each objectives, and formulate two Hamiltonians for this bicriteria OCP. The resulting new optimality condition preserves the minimum values of both Hamiltonians. Furthermore, we find that gradient conflict is a key obstacle to simultaneously descending both Hamiltonians, and its impact is negatively proportional to the inner product between the ideal and actual gradients. A minimax optimization problem is built at each RL iteration to minimize conflicts between two homomorphic objectives, whose solution for policy updating is referred to as harmonic gradient. By converting its inner optimization loop into a linear programming with convex trust region constraint, we simplify this problem into a single-loop maximization problem with much increased computational efficiency. Experiment tests on both linear and nonlinear control tasks validate the effectiveness of our BPO algorithm on the accuracy improvement of policy network.
期刊介绍:
The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.