An approximate policy iteration viewpoint of actor–critic algorithms

IF 4.8 2区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS
Zaiwei Chen , Siva Theja Maguluri
{"title":"An approximate policy iteration viewpoint of actor–critic algorithms","authors":"Zaiwei Chen ,&nbsp;Siva Theja Maguluri","doi":"10.1016/j.automatica.2025.112395","DOIUrl":null,"url":null,"abstract":"<div><div>In this work, we establish sample complexity guarantees for a broad class of policy-space algorithms for reinforcement learning. A policy-space algorithm comprises an actor for policy improvement and a critic for policy evaluation. For the actor, we analyze update rules such as softmax, <span><math><mi>ϵ</mi></math></span>-greedy, and the celebrated natural policy gradient (NPG). Unlike traditional gradient-based analyses, we view NPG as an approximate policy iteration method. This perspective allows us to leverage the Bellman operator’s properties to show that NPG (without regularization) achieves geometric convergence to a globally optimal policy with increasing stepsizes. For the critic, we study TD-learning with linear function approximation and off-policy sampling. To address the instability of TD-learning in this setting, we propose a stable framework using multi-step returns and generalized importance sampling factors, including two specific algorithms: <span><math><mi>λ</mi></math></span>-averaged <span><math><mi>Q</mi></math></span>-trace and two-sided <span><math><mi>Q</mi></math></span>-trace. We also provide a finite-sample analysis for the critic. Combining the geometric convergence of the actor with the finite-sample results of the critic, we establish for the first time an overall sample complexity of <span><math><mrow><mover><mrow><mi>O</mi></mrow><mrow><mo>̃</mo></mrow></mover><mrow><mo>(</mo><msup><mrow><mi>ϵ</mi></mrow><mrow><mo>−</mo><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> for finding an optimal policy (up to a function approximation error) using policy-space methods under off-policy sampling and linear function approximation.</div></div>","PeriodicalId":55413,"journal":{"name":"Automatica","volume":"179 ","pages":"Article 112395"},"PeriodicalIF":4.8000,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automatica","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0005109825002894","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

In this work, we establish sample complexity guarantees for a broad class of policy-space algorithms for reinforcement learning. A policy-space algorithm comprises an actor for policy improvement and a critic for policy evaluation. For the actor, we analyze update rules such as softmax, ϵ-greedy, and the celebrated natural policy gradient (NPG). Unlike traditional gradient-based analyses, we view NPG as an approximate policy iteration method. This perspective allows us to leverage the Bellman operator’s properties to show that NPG (without regularization) achieves geometric convergence to a globally optimal policy with increasing stepsizes. For the critic, we study TD-learning with linear function approximation and off-policy sampling. To address the instability of TD-learning in this setting, we propose a stable framework using multi-step returns and generalized importance sampling factors, including two specific algorithms: λ-averaged Q-trace and two-sided Q-trace. We also provide a finite-sample analysis for the critic. Combining the geometric convergence of the actor with the finite-sample results of the critic, we establish for the first time an overall sample complexity of Õ(ϵ2) for finding an optimal policy (up to a function approximation error) using policy-space methods under off-policy sampling and linear function approximation.
行动者-批评家算法的近似策略迭代观点
在这项工作中,我们为一大类用于强化学习的策略空间算法建立了样本复杂性保证。策略空间算法包括一个用于政策改进的参与者和一个用于政策评估的批评者。对于参与者,我们分析了更新规则,如softmax、ϵ-greedy和著名的自然策略梯度(NPG)。与传统的基于梯度的分析不同,我们将NPG视为一种近似的策略迭代方法。这个视角允许我们利用Bellman算子的性质来证明NPG(没有正则化)随着步长的增加而达到全局最优策略的几何收敛。对于批评者,我们研究了线性函数近似和离策略抽样的td学习。为了解决这种情况下td学习的不稳定性,我们提出了一个使用多步回归和广义重要采样因子的稳定框架,包括两种特定算法:λ平均q -迹和双边q -迹。我们还为评论家提供有限样本分析。将参与者的几何收敛性与评论家的有限样本结果相结合,我们首次建立了总体样本复杂度为Õ(λ−2),用于在非策略采样和线性函数近似下使用策略空间方法找到最优策略(直至函数近似误差)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Automatica
Automatica 工程技术-工程:电子与电气
CiteScore
10.70
自引率
7.80%
发文量
617
审稿时长
5 months
期刊介绍: Automatica is a leading archival publication in the field of systems and control. The field encompasses today a broad set of areas and topics, and is thriving not only within itself but also in terms of its impact on other fields, such as communications, computers, biology, energy and economics. Since its inception in 1963, Automatica has kept abreast with the evolution of the field over the years, and has emerged as a leading publication driving the trends in the field. After being founded in 1963, Automatica became a journal of the International Federation of Automatic Control (IFAC) in 1969. It features a characteristic blend of theoretical and applied papers of archival, lasting value, reporting cutting edge research results by authors across the globe. It features articles in distinct categories, including regular, brief and survey papers, technical communiqués, correspondence items, as well as reviews on published books of interest to the readership. It occasionally publishes special issues on emerging new topics or established mature topics of interest to a broad audience. Automatica solicits original high-quality contributions in all the categories listed above, and in all areas of systems and control interpreted in a broad sense and evolving constantly. They may be submitted directly to a subject editor or to the Editor-in-Chief if not sure about the subject area. Editorial procedures in place assure careful, fair, and prompt handling of all submitted articles. Accepted papers appear in the journal in the shortest time feasible given production time constraints.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信