Enhancing model learning in reinforcement learning through Q-function-guided trajectory alignment

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Intelligence Pub Date : 2025-04-29 DOI:10.1007/s10489-024-06083-9

Xin Du, Shan Zhong, Shengrong Gong, Yali Si, Zhenyu Qi

{"title":"Enhancing model learning in reinforcement learning through Q-function-guided trajectory alignment","authors":"Xin Du, Shan Zhong, Shengrong Gong, Yali Si, Zhenyu Qi","doi":"10.1007/s10489-024-06083-9","DOIUrl":null,"url":null,"abstract":"<div><p>Model-based reinforcement learning (MBRL) methods hold great promise for achieving excellent sample efficiency by fitting a dynamics model to previously observed data and leveraging it for RL or planning. However, the resulting trajectories may diverge from actual-world trajectories due to the accumulation of errors in multi-step model sampling, particularly for longer horizons. This undermines the performance of MBRL and significantly affects sample efficiency. Therefore, we present a trajectory alignment capable of aligning simulated trajectories with their real counterparts from any initial random state and with adaptive length, enabling the preparation of paired real-simulated samples to minimize compounding errors. Additionally, we design a Q-function function to estimate Q values for the paired real-simulated samples. The simulated samples whose Q-value difference from the real ones surpasses a given threshold will be discarded, thus preventing the model from over-fitting to erroneous samples. Experimental results demonstrate that both trajectory alignment and Q-function guided sample filtration contribute to improving policy and sample efficiency. Our method surpasses previous state-of-the-art model-based approaches in both sample efficiency and asymptotic performance across a series of challenging control tasks. The code is open source and available at https://github.com/duxin0618/qgtambpo.git.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 10","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-024-06083-9","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Model-based reinforcement learning (MBRL) methods hold great promise for achieving excellent sample efficiency by fitting a dynamics model to previously observed data and leveraging it for RL or planning. However, the resulting trajectories may diverge from actual-world trajectories due to the accumulation of errors in multi-step model sampling, particularly for longer horizons. This undermines the performance of MBRL and significantly affects sample efficiency. Therefore, we present a trajectory alignment capable of aligning simulated trajectories with their real counterparts from any initial random state and with adaptive length, enabling the preparation of paired real-simulated samples to minimize compounding errors. Additionally, we design a Q-function function to estimate Q values for the paired real-simulated samples. The simulated samples whose Q-value difference from the real ones surpasses a given threshold will be discarded, thus preventing the model from over-fitting to erroneous samples. Experimental results demonstrate that both trajectory alignment and Q-function guided sample filtration contribute to improving policy and sample efficiency. Our method surpasses previous state-of-the-art model-based approaches in both sample efficiency and asymptotic performance across a series of challenging control tasks. The code is open source and available at https://github.com/duxin0618/qgtambpo.git.

查看原文本刊更多论文

通过q函数引导轨迹对齐增强强化学习中的模型学习

基于模型的强化学习（MBRL）方法通过将动态模型拟合到先前观察到的数据并利用它进行强化学习或规划，有望实现出色的样本效率。然而，由于多步模型采样误差的积累，特别是对于较长的视界，所得到的轨迹可能与实际世界的轨迹偏离。这破坏了MBRL的性能，并显著影响了样本效率。因此，我们提出了一种轨迹对齐方法，能够将模拟轨迹与任何初始随机状态下的真实轨迹对齐，并具有自适应长度，从而能够制备成对的真实模拟样本，以最大限度地减少复合误差。此外，我们设计了一个Q函数函数来估计成对的真实模拟样本的Q值。与真实样本的q值差超过给定阈值的模拟样本将被丢弃，从而防止模型对错误样本的过拟合。实验结果表明，轨迹对准和q函数引导的样本过滤都有助于提高策略和样本效率。在一系列具有挑战性的控制任务中，我们的方法在样本效率和渐近性能方面超越了以前最先进的基于模型的方法。代码是开源的，可以在https://github.com/duxin0618/qgtambpo.git上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Intelligence 工程技术-计算机：人工智能

CiteScore

6.60

自引率

20.80%

发文量

1361

审稿时长

5.9 months

期刊介绍： With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.