n-step temporal difference learning with optimal n

IF 5.9 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Automatica Pub Date : 2025-06-18 DOI:10.1016/j.automatica.2025.112449

Lakshmi Mandal, Shalabh Bhatnagar

{"title":"n-step temporal difference learning with optimal n","authors":"Lakshmi Mandal, Shalabh Bhatnagar","doi":"10.1016/j.automatica.2025.112449","DOIUrl":null,"url":null,"abstract":"<div><div>We consider the problem of finding the optimal value of <span><math><mi>n</mi></math></span> in the <span><math><mi>n</mi></math></span>-step temporal difference (TD) learning algorithm. Our objective function for the optimization problem is the average root mean squared error (RMSE). We find the optimal <span><math><mi>n</mi></math></span> by resorting to a model-free optimization technique involving a one-simulation simultaneous perturbation stochastic approximation (SPSA) based procedure. Whereas SPSA is a zeroth-order continuous optimization procedure, we adapt it to the discrete optimization setting by using a random projection operator. We prove the asymptotic convergence of the recursion by showing that the sequence of <span><math><mi>n</mi></math></span>-updates obtained using zeroth-order stochastic gradient search converges almost surely to an internally chain transitive invariant set of an associated differential inclusion. This results in convergence of the discrete parameter sequence to the optimal <span><math><mi>n</mi></math></span> in <span><math><mi>n</mi></math></span>-step TD. Through experiments, we show that the optimal value of <span><math><mi>n</mi></math></span> is achieved with our SDPSA algorithm for arbitrary initial values. We further show using numerical evaluations that SDPSA outperforms the state-of-the-art discrete parameter stochastic optimization algorithm ‘Optimal Computing Budget Allocation (OCBA)’ on benchmark RL tasks.</div></div>","PeriodicalId":55413,"journal":{"name":"Automatica","volume":"179 ","pages":"Article 112449"},"PeriodicalIF":5.9000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automatica","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0005109825003437","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

We consider the problem of finding the optimal value of

n

in the

n

-step temporal difference (TD) learning algorithm. Our objective function for the optimization problem is the average root mean squared error (RMSE). We find the optimal

n

by resorting to a model-free optimization technique involving a one-simulation simultaneous perturbation stochastic approximation (SPSA) based procedure. Whereas SPSA is a zeroth-order continuous optimization procedure, we adapt it to the discrete optimization setting by using a random projection operator. We prove the asymptotic convergence of the recursion by showing that the sequence of

n

-updates obtained using zeroth-order stochastic gradient search converges almost surely to an internally chain transitive invariant set of an associated differential inclusion. This results in convergence of the discrete parameter sequence to the optimal

n

n

-step TD. Through experiments, we show that the optimal value of

n

is achieved with our SDPSA algorithm for arbitrary initial values. We further show using numerical evaluations that SDPSA outperforms the state-of-the-art discrete parameter stochastic optimization algorithm ‘Optimal Computing Budget Allocation (OCBA)’ on benchmark RL tasks.

查看原文本刊更多论文

具有最优n的n步时间差分学习

研究了n步时间差分（TD）学习算法中n的最优取值问题。我们优化问题的目标函数是平均均方根误差（RMSE）。我们通过一种无模型优化技术找到最优n，该技术涉及基于单仿真同时摄动随机逼近（SPSA）的过程。鉴于SPSA是一个零阶连续优化过程，我们通过使用随机投影算子使其适应于离散优化设置。通过证明用零阶随机梯度搜索得到的n个更新序列几乎肯定地收敛于相关微分包含的内链传递不变集，证明了递推的渐近收敛性。这使得离散参数序列在n步TD中收敛到最优n。通过实验，我们证明了我们的SDPSA算法可以在任意初始值下获得n的最优值。我们进一步通过数值评估表明，SDPSA在基准RL任务上优于最先进的离散参数随机优化算法“最优计算预算分配（OCBA）”。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Automatica 工程技术-工程：电子与电气

CiteScore

10.70

自引率

7.80%

发文量

617

审稿时长

5 months

期刊介绍： Automatica is a leading archival publication in the field of systems and control. The field encompasses today a broad set of areas and topics, and is thriving not only within itself but also in terms of its impact on other fields, such as communications, computers, biology, energy and economics. Since its inception in 1963, Automatica has kept abreast with the evolution of the field over the years, and has emerged as a leading publication driving the trends in the field. After being founded in 1963, Automatica became a journal of the International Federation of Automatic Control (IFAC) in 1969. It features a characteristic blend of theoretical and applied papers of archival, lasting value, reporting cutting edge research results by authors across the globe. It features articles in distinct categories, including regular, brief and survey papers, technical communiqués, correspondence items, as well as reviews on published books of interest to the readership. It occasionally publishes special issues on emerging new topics or established mature topics of interest to a broad audience. Automatica solicits original high-quality contributions in all the categories listed above, and in all areas of systems and control interpreted in a broad sense and evolving constantly. They may be submitted directly to a subject editor or to the Editor-in-Chief if not sure about the subject area. Editorial procedures in place assure careful, fair, and prompt handling of all submitted articles. Accepted papers appear in the journal in the shortest time feasible given production time constraints.