PCDT: Pessimistic Critic Decision Transformer for Offline Reinforcement Learning

IF 8.7 1区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS
Xuesong Wang;Hengrui Zhang;Jiazhi Zhang;C. L. Philip Chen;Yuhu Cheng
{"title":"PCDT: Pessimistic Critic Decision Transformer for Offline Reinforcement Learning","authors":"Xuesong Wang;Hengrui Zhang;Jiazhi Zhang;C. L. Philip Chen;Yuhu Cheng","doi":"10.1109/TSMC.2025.3583392","DOIUrl":null,"url":null,"abstract":"decision transformer (DT), as a conditional sequence modeling (CSM) approach, learns the action distribution for each state using historical information, such as trajectory returns, offering a supervised learning paradigm for offline reinforcement learning (Offline RL). However, due to the fact that DT solely concentrates on an individual trajectory with high returns-to-go, it neglects the potential for constructing optimal trajectories by combining sequences of different actions. In other words, traditional DT lacks the trajectory stitching capability. To address the concern, a novel DT (PCDT) for Offline RL is proposed. Our approach begins by pretraining a standard DT to explicitly capture behavior sequences. Next, we apply the sequence importance sampling to penalize actions that significantly deviate from these behavior sequences, thereby constructing a pessimistic critic. Finally, Q-values are integrated into the policy update process, enabling the learned policy to approximate the behavior policy while favoring actions associated with the highest Q-value. Theoretical analysis shows that the sequence importance sampling in pessimistic critic decision transformer (PCDT) establishes a pessimistic lower bound, while the value optimality ensures that PCDT is capable of learning the optimal policy. Results on the D4RL benchmark tasks and ablation studies show that PCDT inherits the strengths of actor–critic (AC) and CSM methods, achieving the highest normalized scores on challenging sparse-reward and long-horizon tasks. Our code are available at <uri>https://github.com/Henry0132/PCDT</uri>.","PeriodicalId":48915,"journal":{"name":"IEEE Transactions on Systems Man Cybernetics-Systems","volume":"55 10","pages":"7247-7258"},"PeriodicalIF":8.7000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Systems Man Cybernetics-Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11078293/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

decision transformer (DT), as a conditional sequence modeling (CSM) approach, learns the action distribution for each state using historical information, such as trajectory returns, offering a supervised learning paradigm for offline reinforcement learning (Offline RL). However, due to the fact that DT solely concentrates on an individual trajectory with high returns-to-go, it neglects the potential for constructing optimal trajectories by combining sequences of different actions. In other words, traditional DT lacks the trajectory stitching capability. To address the concern, a novel DT (PCDT) for Offline RL is proposed. Our approach begins by pretraining a standard DT to explicitly capture behavior sequences. Next, we apply the sequence importance sampling to penalize actions that significantly deviate from these behavior sequences, thereby constructing a pessimistic critic. Finally, Q-values are integrated into the policy update process, enabling the learned policy to approximate the behavior policy while favoring actions associated with the highest Q-value. Theoretical analysis shows that the sequence importance sampling in pessimistic critic decision transformer (PCDT) establishes a pessimistic lower bound, while the value optimality ensures that PCDT is capable of learning the optimal policy. Results on the D4RL benchmark tasks and ablation studies show that PCDT inherits the strengths of actor–critic (AC) and CSM methods, achieving the highest normalized scores on challenging sparse-reward and long-horizon tasks. Our code are available at https://github.com/Henry0132/PCDT.
PCDT:离线强化学习的悲观批评决策转换器
决策转换器(DT)作为一种条件序列建模(CSM)方法,利用历史信息(如轨迹返回)学习每个状态的动作分布,为离线强化学习(offline RL)提供了一种监督学习范式。然而,由于DT只关注具有高收益的单个轨迹,它忽略了通过组合不同动作序列构建最优轨迹的潜力。也就是说,传统的DT缺乏轨迹拼接能力。为了解决这一问题,本文提出了一种新的离线RL算法(PCDT)。我们的方法从预训练标准DT开始,以显式捕获行为序列。接下来,我们应用序列重要性抽样来惩罚明显偏离这些行为序列的行为,从而构建悲观批评。最后,将q值集成到策略更新过程中,使学习到的策略能够近似行为策略,同时倾向于与最高q值相关的动作。理论分析表明,悲观批判决策变压器(PCDT)中的序列重要性抽样建立了一个悲观下界,而值最优性保证了PCDT能够学习到最优策略。D4RL基准任务和消退研究的结果表明,PCDT继承了actor-critic (AC)和CSM方法的优势,在具有挑战性的稀疏奖励和长线任务上获得了最高的归一化分数。我们的代码可在https://github.com/Henry0132/PCDT上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Systems Man Cybernetics-Systems
IEEE Transactions on Systems Man Cybernetics-Systems AUTOMATION & CONTROL SYSTEMS-COMPUTER SCIENCE, CYBERNETICS
CiteScore
18.50
自引率
11.50%
发文量
812
审稿时长
6 months
期刊介绍: The IEEE Transactions on Systems, Man, and Cybernetics: Systems encompasses the fields of systems engineering, covering issue formulation, analysis, and modeling throughout the systems engineering lifecycle phases. It addresses decision-making, issue interpretation, systems management, processes, and various methods such as optimization, modeling, and simulation in the development and deployment of large systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信