{"title":"Game-Theoretic Constrained Policy Optimization for Safe Reinforcement Learning.","authors":"Changxin Zhang,Xinglong Zhang,Yixing Lan,Hao Gao,Xin Xu","doi":"10.1109/tnnls.2025.3586603","DOIUrl":null,"url":null,"abstract":"Safe reinforcement learning (RL) aims to optimize the task performance with safety guarantees. One common modeling scheme to study safe RL problems is the constrained Markov decision process (CMDP). However, current safe RL methods within the CMDP framework face challenges in tradeoffs among various objectives and gradient conflicts of policy updating. To cope with these challenges, this article presents a novel safe RL approach called game-theoretic constrained policy optimization (GCPO). The proposed approach formulates the CMDP problem as a general-sum Markov game with multiple players, where a task player seeks to maximize the reward objective, while constraint players aim to minimize constraint objectives until they are fulfilled. By doing so, GCPO adopts the learning mode with multiple subpolicies, each aligned with a distinct objective, that collectively constitute the overall behavior of the agent. The learning convergence of the GCPO can be ensured with the contraction mapping to the Nash equilibrium. Furthermore, a novel dominant timescale update rule is presented for multiplayer policy learning to guarantee constraint satisfaction. The learning convergence and constraint satisfaction of GCPO are theoretically analyzed. Consequently, GCPO eliminates the necessity of tuning tradeoff parameters and mitigates gradient conflicts during multiobjective policy updating. Experimental results show that GCPO outperforms state-of-the-art safe RL algorithms in a quadrotor trajectory tracking task and various high-dimensional robot locomotion benchmarks. Moreover, GCPO exhibits robustness to diverse scales of task rewards and constraint costs without the need for intricate tradeoffs.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"14 1","pages":""},"PeriodicalIF":8.9000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tnnls.2025.3586603","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Safe reinforcement learning (RL) aims to optimize the task performance with safety guarantees. One common modeling scheme to study safe RL problems is the constrained Markov decision process (CMDP). However, current safe RL methods within the CMDP framework face challenges in tradeoffs among various objectives and gradient conflicts of policy updating. To cope with these challenges, this article presents a novel safe RL approach called game-theoretic constrained policy optimization (GCPO). The proposed approach formulates the CMDP problem as a general-sum Markov game with multiple players, where a task player seeks to maximize the reward objective, while constraint players aim to minimize constraint objectives until they are fulfilled. By doing so, GCPO adopts the learning mode with multiple subpolicies, each aligned with a distinct objective, that collectively constitute the overall behavior of the agent. The learning convergence of the GCPO can be ensured with the contraction mapping to the Nash equilibrium. Furthermore, a novel dominant timescale update rule is presented for multiplayer policy learning to guarantee constraint satisfaction. The learning convergence and constraint satisfaction of GCPO are theoretically analyzed. Consequently, GCPO eliminates the necessity of tuning tradeoff parameters and mitigates gradient conflicts during multiobjective policy updating. Experimental results show that GCPO outperforms state-of-the-art safe RL algorithms in a quadrotor trajectory tracking task and various high-dimensional robot locomotion benchmarks. Moreover, GCPO exhibits robustness to diverse scales of task rewards and constraint costs without the need for intricate tradeoffs.
期刊介绍:
The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.