用于离线强化学习的 Q 值正则化决策 ConvFormer

arXiv - CS - Robotics Pub Date : 2024-09-12 DOI:arxiv-2409.08062

Teng Yan, Zhendong Ruan, Yaobang Cai, Yu Han, Wenxian Li, Yang Zhang

{"title":"用于离线强化学习的 Q 值正则化决策 ConvFormer","authors":"Teng Yan, Zhendong Ruan, Yaobang Cai, Yu Han, Wenxian Li, Yang Zhang","doi":"arxiv-2409.08062","DOIUrl":null,"url":null,"abstract":"As a data-driven paradigm, offline reinforcement learning (Offline RL) has\nbeen formulated as sequence modeling, where the Decision Transformer (DT) has\ndemonstrated exceptional capabilities. Unlike previous reinforcement learning\nmethods that fit value functions or compute policy gradients, DT adjusts the\nautoregressive model based on the expected returns, past states, and actions,\nusing a causally masked Transformer to output the optimal action. However, due\nto the inconsistency between the sampled returns within a single trajectory and\nthe optimal returns across multiple trajectories, it is challenging to set an\nexpected return to output the optimal action and stitch together suboptimal\ntrajectories. Decision ConvFormer (DC) is easier to understand in the context\nof modeling RL trajectories within a Markov Decision Process compared to DT. We\npropose the Q-value Regularized Decision ConvFormer (QDC), which combines the\nunderstanding of RL trajectories by DC and incorporates a term that maximizes\naction values using dynamic programming methods during training. This ensures\nthat the expected returns of the sampled actions are consistent with the\noptimal returns. QDC achieves excellent performance on the D4RL benchmark,\noutperforming or approaching the optimal level in all tested environments. It\nparticularly demonstrates outstanding competitiveness in trajectory stitching\ncapability.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"45 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Q-value Regularized Decision ConvFormer for Offline Reinforcement Learning\",\"authors\":\"Teng Yan, Zhendong Ruan, Yaobang Cai, Yu Han, Wenxian Li, Yang Zhang\",\"doi\":\"arxiv-2409.08062\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As a data-driven paradigm, offline reinforcement learning (Offline RL) has\\nbeen formulated as sequence modeling, where the Decision Transformer (DT) has\\ndemonstrated exceptional capabilities. Unlike previous reinforcement learning\\nmethods that fit value functions or compute policy gradients, DT adjusts the\\nautoregressive model based on the expected returns, past states, and actions,\\nusing a causally masked Transformer to output the optimal action. However, due\\nto the inconsistency between the sampled returns within a single trajectory and\\nthe optimal returns across multiple trajectories, it is challenging to set an\\nexpected return to output the optimal action and stitch together suboptimal\\ntrajectories. Decision ConvFormer (DC) is easier to understand in the context\\nof modeling RL trajectories within a Markov Decision Process compared to DT. We\\npropose the Q-value Regularized Decision ConvFormer (QDC), which combines the\\nunderstanding of RL trajectories by DC and incorporates a term that maximizes\\naction values using dynamic programming methods during training. This ensures\\nthat the expected returns of the sampled actions are consistent with the\\noptimal returns. QDC achieves excellent performance on the D4RL benchmark,\\noutperforming or approaching the optimal level in all tested environments. It\\nparticularly demonstrates outstanding competitiveness in trajectory stitching\\ncapability.\",\"PeriodicalId\":501031,\"journal\":{\"name\":\"arXiv - CS - Robotics\",\"volume\":\"45 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Robotics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08062\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Robotics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08062","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

作为一种数据驱动范例，离线强化学习（Offline RL）被表述为序列建模，其中的决策转换器（DT）已经展现出非凡的能力。与以往拟合值函数或计算策略梯度的强化学习方法不同，DT 根据预期收益、过去状态和行动调整自回归模型，并使用因果掩蔽变换器输出最优行动。然而，由于单条轨迹中的采样收益与多条轨迹中的最优收益不一致，要设置一个预期收益来输出最优行动并将次优轨迹拼接在一起是很有挑战性的。与 DT 相比，Decision ConvFormer（DC）在马尔可夫决策过程中的 RL 轨迹建模中更容易理解。我们提出了 Q 值正则化决策 ConvFormer（QDC），它结合了 DC 对 RL 轨迹的理解，并在训练过程中加入了使用动态编程方法最大化行动值的术语。这确保了采样行动的预期收益与最优收益一致。QDC 在 D4RL 基准测试中表现出色，在所有测试环境中都超过或接近最优水平。它在轨迹拼接能力方面尤其表现出了出色的竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Q-value Regularized Decision ConvFormer for Offline Reinforcement Learning

As a data-driven paradigm, offline reinforcement learning (Offline RL) has been formulated as sequence modeling, where the Decision Transformer (DT) has demonstrated exceptional capabilities. Unlike previous reinforcement learning methods that fit value functions or compute policy gradients, DT adjusts the autoregressive model based on the expected returns, past states, and actions, using a causally masked Transformer to output the optimal action. However, due to the inconsistency between the sampled returns within a single trajectory and the optimal returns across multiple trajectories, it is challenging to set an expected return to output the optimal action and stitch together suboptimal trajectories. Decision ConvFormer (DC) is easier to understand in the context of modeling RL trajectories within a Markov Decision Process compared to DT. We propose the Q-value Regularized Decision ConvFormer (QDC), which combines the understanding of RL trajectories by DC and incorporates a term that maximizes action values using dynamic programming methods during training. This ensures that the expected returns of the sampled actions are consistent with the optimal returns. QDC achieves excellent performance on the D4RL benchmark, outperforming or approaching the optimal level in all tested environments. It particularly demonstrates outstanding competitiveness in trajectory stitching capability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Robotics

自引率

0.00%

发文量