具有可证明收敛性的受约束强化学习单环深度行为批判器

IF 4.6 2区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Kexuan Wang;An Liu;Baishuo Lin
{"title":"具有可证明收敛性的受约束强化学习单环深度行为批判器","authors":"Kexuan Wang;An Liu;Baishuo Lin","doi":"10.1109/TSP.2024.3461963","DOIUrl":null,"url":null,"abstract":"Deep actor-critic (DAC) algorithms, which combine actor-critic with deep neural network (DNN), have been among the most prevalent reinforcement learning algorithms for decision-making problems in simulated environments. However, the existing DAC algorithms are still not mature to solve realistic problems with non-convex stochastic constraints and high cost to interact with the environment. In this paper, we propose a single-loop DAC (SLDAC) algorithmic framework for general constrained reinforcement learning problems. In the actor module, the constrained stochastic successive convex approximation (CSSCA) method is applied to better handle the non-convex stochastic objective and constraints. In the critic module, the critic DNNs are only updated once or a few finite times for each iteration, which simplifies the algorithm to a single-loop framework. Moreover, the variance of the policy gradient estimation is reduced by reusing observations from the old policy. The single-loop design and the observation reuse effectively reduce the agent-environment interaction cost and computational complexity. Despite the biased policy gradient estimation incurred by the single-loop design and observation reuse, we prove that the SLDAC with a feasible initial point can converge to a Karush-Kuhn-Tuker (KKT) point of the original problem almost surely. Simulations show that the SLDAC algorithm can achieve superior performance with much lower interaction cost.","PeriodicalId":13330,"journal":{"name":"IEEE Transactions on Signal Processing","volume":"72 ","pages":"4871-4887"},"PeriodicalIF":4.6000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Single-Loop Deep Actor-Critic for Constrained Reinforcement Learning With Provable Convergence\",\"authors\":\"Kexuan Wang;An Liu;Baishuo Lin\",\"doi\":\"10.1109/TSP.2024.3461963\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep actor-critic (DAC) algorithms, which combine actor-critic with deep neural network (DNN), have been among the most prevalent reinforcement learning algorithms for decision-making problems in simulated environments. However, the existing DAC algorithms are still not mature to solve realistic problems with non-convex stochastic constraints and high cost to interact with the environment. In this paper, we propose a single-loop DAC (SLDAC) algorithmic framework for general constrained reinforcement learning problems. In the actor module, the constrained stochastic successive convex approximation (CSSCA) method is applied to better handle the non-convex stochastic objective and constraints. In the critic module, the critic DNNs are only updated once or a few finite times for each iteration, which simplifies the algorithm to a single-loop framework. Moreover, the variance of the policy gradient estimation is reduced by reusing observations from the old policy. The single-loop design and the observation reuse effectively reduce the agent-environment interaction cost and computational complexity. Despite the biased policy gradient estimation incurred by the single-loop design and observation reuse, we prove that the SLDAC with a feasible initial point can converge to a Karush-Kuhn-Tuker (KKT) point of the original problem almost surely. Simulations show that the SLDAC algorithm can achieve superior performance with much lower interaction cost.\",\"PeriodicalId\":13330,\"journal\":{\"name\":\"IEEE Transactions on Signal Processing\",\"volume\":\"72 \",\"pages\":\"4871-4887\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10681174/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10681174/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

深度行为批判(DAC)算法将行为批判与深度神经网络(DNN)相结合,是模拟环境中决策问题最常用的强化学习算法之一。然而,现有的 DAC 算法在解决具有非凸随机约束和高环境交互成本的现实问题方面仍不成熟。在本文中,我们针对一般约束强化学习问题提出了一种单环 DAC(SLDAC)算法框架。在行动者模块中,应用了约束随机连续凸近似(CSSCA)方法,以更好地处理非凸随机目标和约束。在批判者模块中,批判者 DNN 在每次迭代中只更新一次或有限的几次,从而将算法简化为单循环框架。此外,通过重复使用旧策略的观测数据,还能降低策略梯度估计的方差。单循环设计和观测重复使用有效降低了代理与环境的交互成本和计算复杂度。尽管单环设计和观测重用会导致策略梯度估计出现偏差,但我们证明了具有可行初始点的 SLDAC 几乎肯定能收敛到原始问题的卡鲁什-库恩-图克(KKT)点。仿真表明,SLDAC 算法能以更低的交互成本实现更优越的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Single-Loop Deep Actor-Critic for Constrained Reinforcement Learning With Provable Convergence
Deep actor-critic (DAC) algorithms, which combine actor-critic with deep neural network (DNN), have been among the most prevalent reinforcement learning algorithms for decision-making problems in simulated environments. However, the existing DAC algorithms are still not mature to solve realistic problems with non-convex stochastic constraints and high cost to interact with the environment. In this paper, we propose a single-loop DAC (SLDAC) algorithmic framework for general constrained reinforcement learning problems. In the actor module, the constrained stochastic successive convex approximation (CSSCA) method is applied to better handle the non-convex stochastic objective and constraints. In the critic module, the critic DNNs are only updated once or a few finite times for each iteration, which simplifies the algorithm to a single-loop framework. Moreover, the variance of the policy gradient estimation is reduced by reusing observations from the old policy. The single-loop design and the observation reuse effectively reduce the agent-environment interaction cost and computational complexity. Despite the biased policy gradient estimation incurred by the single-loop design and observation reuse, we prove that the SLDAC with a feasible initial point can converge to a Karush-Kuhn-Tuker (KKT) point of the original problem almost surely. Simulations show that the SLDAC algorithm can achieve superior performance with much lower interaction cost.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Transactions on Signal Processing
IEEE Transactions on Signal Processing 工程技术-工程:电子与电气
CiteScore
11.20
自引率
9.30%
发文量
310
审稿时长
3.0 months
期刊介绍: The IEEE Transactions on Signal Processing covers novel theory, algorithms, performance analyses and applications of techniques for the processing, understanding, learning, retrieval, mining, and extraction of information from signals. The term “signal” includes, among others, audio, video, speech, image, communication, geophysical, sonar, radar, medical and musical signals. Examples of topics of interest include, but are not limited to, information processing and the theory and application of filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信