具有可证明收敛性的受约束强化学习单环深度行为批判器

IF 4.6 2区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Signal Processing Pub Date : 2024-09-16 DOI:10.1109/TSP.2024.3461963

Kexuan Wang;An Liu;Baishuo Lin

{"title":"具有可证明收敛性的受约束强化学习单环深度行为批判器","authors":"Kexuan Wang;An Liu;Baishuo Lin","doi":"10.1109/TSP.2024.3461963","DOIUrl":null,"url":null,"abstract":"Deep actor-critic (DAC) algorithms, which combine actor-critic with deep neural network (DNN), have been among the most prevalent reinforcement learning algorithms for decision-making problems in simulated environments. However, the existing DAC algorithms are still not mature to solve realistic problems with non-convex stochastic constraints and high cost to interact with the environment. In this paper, we propose a single-loop DAC (SLDAC) algorithmic framework for general constrained reinforcement learning problems. In the actor module, the constrained stochastic successive convex approximation (CSSCA) method is applied to better handle the non-convex stochastic objective and constraints. In the critic module, the critic DNNs are only updated once or a few finite times for each iteration, which simplifies the algorithm to a single-loop framework. Moreover, the variance of the policy gradient estimation is reduced by reusing observations from the old policy. The single-loop design and the observation reuse effectively reduce the agent-environment interaction cost and computational complexity. Despite the biased policy gradient estimation incurred by the single-loop design and observation reuse, we prove that the SLDAC with a feasible initial point can converge to a Karush-Kuhn-Tuker (KKT) point of the original problem almost surely. Simulations show that the SLDAC algorithm can achieve superior performance with much lower interaction cost.","PeriodicalId":13330,"journal":{"name":"IEEE Transactions on Signal Processing","volume":"72 ","pages":"4871-4887"},"PeriodicalIF":4.6000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Single-Loop Deep Actor-Critic for Constrained Reinforcement Learning With Provable Convergence\",\"authors\":\"Kexuan Wang;An Liu;Baishuo Lin\",\"doi\":\"10.1109/TSP.2024.3461963\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep actor-critic (DAC) algorithms, which combine actor-critic with deep neural network (DNN), have been among the most prevalent reinforcement learning algorithms for decision-making problems in simulated environments. However, the existing DAC algorithms are still not mature to solve realistic problems with non-convex stochastic constraints and high cost to interact with the environment. In this paper, we propose a single-loop DAC (SLDAC) algorithmic framework for general constrained reinforcement learning problems. In the actor module, the constrained stochastic successive convex approximation (CSSCA) method is applied to better handle the non-convex stochastic objective and constraints. In the critic module, the critic DNNs are only updated once or a few finite times for each iteration, which simplifies the algorithm to a single-loop framework. Moreover, the variance of the policy gradient estimation is reduced by reusing observations from the old policy. The single-loop design and the observation reuse effectively reduce the agent-environment interaction cost and computational complexity. Despite the biased policy gradient estimation incurred by the single-loop design and observation reuse, we prove that the SLDAC with a feasible initial point can converge to a Karush-Kuhn-Tuker (KKT) point of the original problem almost surely. Simulations show that the SLDAC algorithm can achieve superior performance with much lower interaction cost.\",\"PeriodicalId\":13330,\"journal\":{\"name\":\"IEEE Transactions on Signal Processing\",\"volume\":\"72 \",\"pages\":\"4871-4887\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10681174/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10681174/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

深度行为批判（DAC）算法将行为批判与深度神经网络（DNN）相结合，是模拟环境中决策问题最常用的强化学习算法之一。然而，现有的 DAC 算法在解决具有非凸随机约束和高环境交互成本的现实问题方面仍不成熟。在本文中，我们针对一般约束强化学习问题提出了一种单环 DAC（SLDAC）算法框架。在行动者模块中，应用了约束随机连续凸近似（CSSCA）方法，以更好地处理非凸随机目标和约束。在批判者模块中，批判者 DNN 在每次迭代中只更新一次或有限的几次，从而将算法简化为单循环框架。此外，通过重复使用旧策略的观测数据，还能降低策略梯度估计的方差。单循环设计和观测重复使用有效降低了代理与环境的交互成本和计算复杂度。尽管单环设计和观测重用会导致策略梯度估计出现偏差，但我们证明了具有可行初始点的 SLDAC 几乎肯定能收敛到原始问题的卡鲁什-库恩-图克（KKT）点。仿真表明，SLDAC 算法能以更低的交互成本实现更优越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Single-Loop Deep Actor-Critic for Constrained Reinforcement Learning With Provable Convergence

Deep actor-critic (DAC) algorithms, which combine actor-critic with deep neural network (DNN), have been among the most prevalent reinforcement learning algorithms for decision-making problems in simulated environments. However, the existing DAC algorithms are still not mature to solve realistic problems with non-convex stochastic constraints and high cost to interact with the environment. In this paper, we propose a single-loop DAC (SLDAC) algorithmic framework for general constrained reinforcement learning problems. In the actor module, the constrained stochastic successive convex approximation (CSSCA) method is applied to better handle the non-convex stochastic objective and constraints. In the critic module, the critic DNNs are only updated once or a few finite times for each iteration, which simplifies the algorithm to a single-loop framework. Moreover, the variance of the policy gradient estimation is reduced by reusing observations from the old policy. The single-loop design and the observation reuse effectively reduce the agent-environment interaction cost and computational complexity. Despite the biased policy gradient estimation incurred by the single-loop design and observation reuse, we prove that the SLDAC with a feasible initial point can converge to a Karush-Kuhn-Tuker (KKT) point of the original problem almost surely. Simulations show that the SLDAC algorithm can achieve superior performance with much lower interaction cost.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Signal Processing 工程技术-工程：电子与电气

CiteScore

11.20

自引率

9.30%

发文量

310

审稿时长

3.0 months

期刊介绍： The IEEE Transactions on Signal Processing covers novel theory, algorithms, performance analyses and applications of techniques for the processing, understanding, learning, retrieval, mining, and extraction of information from signals. The term “signal” includes, among others, audio, video, speech, image, communication, geophysical, sonar, radar, medical and musical signals. Examples of topics of interest include, but are not limited to, information processing and the theory and application of filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals.