Enhancing model learning in reinforcement learning through Q-function-guided trajectory alignment

IF 3.4 2区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Xin Du, Shan Zhong, Shengrong Gong, Yali Si, Zhenyu Qi
{"title":"Enhancing model learning in reinforcement learning through Q-function-guided trajectory alignment","authors":"Xin Du,&nbsp;Shan Zhong,&nbsp;Shengrong Gong,&nbsp;Yali Si,&nbsp;Zhenyu Qi","doi":"10.1007/s10489-024-06083-9","DOIUrl":null,"url":null,"abstract":"<div><p>Model-based reinforcement learning (MBRL) methods hold great promise for achieving excellent sample efficiency by fitting a dynamics model to previously observed data and leveraging it for RL or planning. However, the resulting trajectories may diverge from actual-world trajectories due to the accumulation of errors in multi-step model sampling, particularly for longer horizons. This undermines the performance of MBRL and significantly affects sample efficiency. Therefore, we present a trajectory alignment capable of aligning simulated trajectories with their real counterparts from any initial random state and with adaptive length, enabling the preparation of paired real-simulated samples to minimize compounding errors. Additionally, we design a Q-function function to estimate Q values for the paired real-simulated samples. The simulated samples whose Q-value difference from the real ones surpasses a given threshold will be discarded, thus preventing the model from over-fitting to erroneous samples. Experimental results demonstrate that both trajectory alignment and Q-function guided sample filtration contribute to improving policy and sample efficiency. Our method surpasses previous state-of-the-art model-based approaches in both sample efficiency and asymptotic performance across a series of challenging control tasks. The code is open source and available at https://github.com/duxin0618/qgtambpo.git.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 10","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-024-06083-9","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Model-based reinforcement learning (MBRL) methods hold great promise for achieving excellent sample efficiency by fitting a dynamics model to previously observed data and leveraging it for RL or planning. However, the resulting trajectories may diverge from actual-world trajectories due to the accumulation of errors in multi-step model sampling, particularly for longer horizons. This undermines the performance of MBRL and significantly affects sample efficiency. Therefore, we present a trajectory alignment capable of aligning simulated trajectories with their real counterparts from any initial random state and with adaptive length, enabling the preparation of paired real-simulated samples to minimize compounding errors. Additionally, we design a Q-function function to estimate Q values for the paired real-simulated samples. The simulated samples whose Q-value difference from the real ones surpasses a given threshold will be discarded, thus preventing the model from over-fitting to erroneous samples. Experimental results demonstrate that both trajectory alignment and Q-function guided sample filtration contribute to improving policy and sample efficiency. Our method surpasses previous state-of-the-art model-based approaches in both sample efficiency and asymptotic performance across a series of challenging control tasks. The code is open source and available at https://github.com/duxin0618/qgtambpo.git.

通过q函数引导轨迹对齐增强强化学习中的模型学习
基于模型的强化学习(MBRL)方法通过将动态模型拟合到先前观察到的数据并利用它进行强化学习或规划,有望实现出色的样本效率。然而,由于多步模型采样误差的积累,特别是对于较长的视界,所得到的轨迹可能与实际世界的轨迹偏离。这破坏了MBRL的性能,并显著影响了样本效率。因此,我们提出了一种轨迹对齐方法,能够将模拟轨迹与任何初始随机状态下的真实轨迹对齐,并具有自适应长度,从而能够制备成对的真实模拟样本,以最大限度地减少复合误差。此外,我们设计了一个Q函数函数来估计成对的真实模拟样本的Q值。与真实样本的q值差超过给定阈值的模拟样本将被丢弃,从而防止模型对错误样本的过拟合。实验结果表明,轨迹对准和q函数引导的样本过滤都有助于提高策略和样本效率。在一系列具有挑战性的控制任务中,我们的方法在样本效率和渐近性能方面超越了以前最先进的基于模型的方法。代码是开源的,可以在https://github.com/duxin0618/qgtambpo.git上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Applied Intelligence
Applied Intelligence 工程技术-计算机:人工智能
CiteScore
6.60
自引率
20.80%
发文量
1361
审稿时长
5.9 months
期刊介绍: With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信