Game-Theoretic Constrained Policy Optimization for Safe Reinforcement Learning.

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE transactions on neural networks and learning systems Pub Date : 2025-07-24 DOI:10.1109/tnnls.2025.3586603

Changxin Zhang,Xinglong Zhang,Yixing Lan,Hao Gao,Xin Xu

{"title":"Game-Theoretic Constrained Policy Optimization for Safe Reinforcement Learning.","authors":"Changxin Zhang,Xinglong Zhang,Yixing Lan,Hao Gao,Xin Xu","doi":"10.1109/tnnls.2025.3586603","DOIUrl":null,"url":null,"abstract":"Safe reinforcement learning (RL) aims to optimize the task performance with safety guarantees. One common modeling scheme to study safe RL problems is the constrained Markov decision process (CMDP). However, current safe RL methods within the CMDP framework face challenges in tradeoffs among various objectives and gradient conflicts of policy updating. To cope with these challenges, this article presents a novel safe RL approach called game-theoretic constrained policy optimization (GCPO). The proposed approach formulates the CMDP problem as a general-sum Markov game with multiple players, where a task player seeks to maximize the reward objective, while constraint players aim to minimize constraint objectives until they are fulfilled. By doing so, GCPO adopts the learning mode with multiple subpolicies, each aligned with a distinct objective, that collectively constitute the overall behavior of the agent. The learning convergence of the GCPO can be ensured with the contraction mapping to the Nash equilibrium. Furthermore, a novel dominant timescale update rule is presented for multiplayer policy learning to guarantee constraint satisfaction. The learning convergence and constraint satisfaction of GCPO are theoretically analyzed. Consequently, GCPO eliminates the necessity of tuning tradeoff parameters and mitigates gradient conflicts during multiobjective policy updating. Experimental results show that GCPO outperforms state-of-the-art safe RL algorithms in a quadrotor trajectory tracking task and various high-dimensional robot locomotion benchmarks. Moreover, GCPO exhibits robustness to diverse scales of task rewards and constraint costs without the need for intricate tradeoffs.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"14 1","pages":""},"PeriodicalIF":8.9000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tnnls.2025.3586603","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Safe reinforcement learning (RL) aims to optimize the task performance with safety guarantees. One common modeling scheme to study safe RL problems is the constrained Markov decision process (CMDP). However, current safe RL methods within the CMDP framework face challenges in tradeoffs among various objectives and gradient conflicts of policy updating. To cope with these challenges, this article presents a novel safe RL approach called game-theoretic constrained policy optimization (GCPO). The proposed approach formulates the CMDP problem as a general-sum Markov game with multiple players, where a task player seeks to maximize the reward objective, while constraint players aim to minimize constraint objectives until they are fulfilled. By doing so, GCPO adopts the learning mode with multiple subpolicies, each aligned with a distinct objective, that collectively constitute the overall behavior of the agent. The learning convergence of the GCPO can be ensured with the contraction mapping to the Nash equilibrium. Furthermore, a novel dominant timescale update rule is presented for multiplayer policy learning to guarantee constraint satisfaction. The learning convergence and constraint satisfaction of GCPO are theoretically analyzed. Consequently, GCPO eliminates the necessity of tuning tradeoff parameters and mitigates gradient conflicts during multiobjective policy updating. Experimental results show that GCPO outperforms state-of-the-art safe RL algorithms in a quadrotor trajectory tracking task and various high-dimensional robot locomotion benchmarks. Moreover, GCPO exhibits robustness to diverse scales of task rewards and constraint costs without the need for intricate tradeoffs.

查看原文本刊更多论文

安全强化学习的博弈论约束策略优化。

安全强化学习（RL）的目的是在保证安全的前提下优化任务性能。研究安全强化学习问题的一种常用建模方案是约束马尔可夫决策过程（CMDP）。然而，当前CMDP框架下的安全RL方法面临着各种目标之间的权衡和策略更新的梯度冲突的挑战。为了应对这些挑战，本文提出了一种新的安全强化学习方法，称为博弈论约束策略优化（GCPO）。提出的方法将CMDP问题表述为具有多个参与者的一般和马尔可夫博弈，其中任务参与者寻求最大化奖励目标，而约束参与者旨在最小化约束目标，直到它们被满足。通过这样做，GCPO采用了具有多个子策略的学习模式，每个子策略与一个不同的目标相一致，这些子策略共同构成了智能体的整体行为。通过对纳什均衡的收缩映射，可以保证GCPO的学习收敛性。在此基础上，提出了一种新的多策略学习的主导时间尺度更新规则，以保证约束满足。从理论上分析了GCPO的学习收敛性和约束满足性。因此，GCPO消除了调整权衡参数的必要性，减轻了多目标策略更新过程中的梯度冲突。实验结果表明，GCPO在四旋翼轨迹跟踪任务和各种高维机器人运动基准测试中优于最先进的安全RL算法。此外，GCPO对不同规模的任务奖励和约束成本表现出鲁棒性，而不需要复杂的权衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.