PCDT: Pessimistic Critic Decision Transformer for Offline Reinforcement Learning

IF 8.7 1区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

IEEE Transactions on Systems Man Cybernetics-Systems Pub Date : 2025-07-11 DOI:10.1109/TSMC.2025.3583392

Xuesong Wang;Hengrui Zhang;Jiazhi Zhang;C. L. Philip Chen;Yuhu Cheng

{"title":"PCDT: Pessimistic Critic Decision Transformer for Offline Reinforcement Learning","authors":"Xuesong Wang;Hengrui Zhang;Jiazhi Zhang;C. L. Philip Chen;Yuhu Cheng","doi":"10.1109/TSMC.2025.3583392","DOIUrl":null,"url":null,"abstract":"decision transformer (DT), as a conditional sequence modeling (CSM) approach, learns the action distribution for each state using historical information, such as trajectory returns, offering a supervised learning paradigm for offline reinforcement learning (Offline RL). However, due to the fact that DT solely concentrates on an individual trajectory with high returns-to-go, it neglects the potential for constructing optimal trajectories by combining sequences of different actions. In other words, traditional DT lacks the trajectory stitching capability. To address the concern, a novel DT (PCDT) for Offline RL is proposed. Our approach begins by pretraining a standard DT to explicitly capture behavior sequences. Next, we apply the sequence importance sampling to penalize actions that significantly deviate from these behavior sequences, thereby constructing a pessimistic critic. Finally, Q-values are integrated into the policy update process, enabling the learned policy to approximate the behavior policy while favoring actions associated with the highest Q-value. Theoretical analysis shows that the sequence importance sampling in pessimistic critic decision transformer (PCDT) establishes a pessimistic lower bound, while the value optimality ensures that PCDT is capable of learning the optimal policy. Results on the D4RL benchmark tasks and ablation studies show that PCDT inherits the strengths of actor–critic (AC) and CSM methods, achieving the highest normalized scores on challenging sparse-reward and long-horizon tasks. Our code are available at <uri>https://github.com/Henry0132/PCDT</uri>.","PeriodicalId":48915,"journal":{"name":"IEEE Transactions on Systems Man Cybernetics-Systems","volume":"55 10","pages":"7247-7258"},"PeriodicalIF":8.7000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Systems Man Cybernetics-Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11078293/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

decision transformer (DT), as a conditional sequence modeling (CSM) approach, learns the action distribution for each state using historical information, such as trajectory returns, offering a supervised learning paradigm for offline reinforcement learning (Offline RL). However, due to the fact that DT solely concentrates on an individual trajectory with high returns-to-go, it neglects the potential for constructing optimal trajectories by combining sequences of different actions. In other words, traditional DT lacks the trajectory stitching capability. To address the concern, a novel DT (PCDT) for Offline RL is proposed. Our approach begins by pretraining a standard DT to explicitly capture behavior sequences. Next, we apply the sequence importance sampling to penalize actions that significantly deviate from these behavior sequences, thereby constructing a pessimistic critic. Finally, Q-values are integrated into the policy update process, enabling the learned policy to approximate the behavior policy while favoring actions associated with the highest Q-value. Theoretical analysis shows that the sequence importance sampling in pessimistic critic decision transformer (PCDT) establishes a pessimistic lower bound, while the value optimality ensures that PCDT is capable of learning the optimal policy. Results on the D4RL benchmark tasks and ablation studies show that PCDT inherits the strengths of actor–critic (AC) and CSM methods, achieving the highest normalized scores on challenging sparse-reward and long-horizon tasks. Our code are available at https://github.com/Henry0132/PCDT.

查看原文本刊更多论文

PCDT：离线强化学习的悲观批评决策转换器

决策转换器（DT）作为一种条件序列建模（CSM）方法，利用历史信息（如轨迹返回）学习每个状态的动作分布，为离线强化学习（offline RL）提供了一种监督学习范式。然而，由于DT只关注具有高收益的单个轨迹，它忽略了通过组合不同动作序列构建最优轨迹的潜力。也就是说，传统的DT缺乏轨迹拼接能力。为了解决这一问题，本文提出了一种新的离线RL算法（PCDT）。我们的方法从预训练标准DT开始，以显式捕获行为序列。接下来，我们应用序列重要性抽样来惩罚明显偏离这些行为序列的行为，从而构建悲观批评。最后，将q值集成到策略更新过程中，使学习到的策略能够近似行为策略，同时倾向于与最高q值相关的动作。理论分析表明，悲观批判决策变压器（PCDT）中的序列重要性抽样建立了一个悲观下界，而值最优性保证了PCDT能够学习到最优策略。D4RL基准任务和消退研究的结果表明，PCDT继承了actor-critic （AC）和CSM方法的优势，在具有挑战性的稀疏奖励和长线任务上获得了最高的归一化分数。我们的代码可在https://github.com/Henry0132/PCDT上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Systems Man Cybernetics-Systems AUTOMATION & CONTROL SYSTEMS-COMPUTER SCIENCE, CYBERNETICS

CiteScore

18.50

自引率

11.50%

发文量

812

审稿时长

6 months

期刊介绍： The IEEE Transactions on Systems, Man, and Cybernetics: Systems encompasses the fields of systems engineering, covering issue formulation, analysis, and modeling throughout the systems engineering lifecycle phases. It addresses decision-making, issue interpretation, systems management, processes, and various methods such as optimization, modeling, and simulation in the development and deployment of large systems.