Xuesong Wang;Hengrui Zhang;Jiazhi Zhang;C. L. Philip Chen;Yuhu Cheng
{"title":"PCDT: Pessimistic Critic Decision Transformer for Offline Reinforcement Learning","authors":"Xuesong Wang;Hengrui Zhang;Jiazhi Zhang;C. L. Philip Chen;Yuhu Cheng","doi":"10.1109/TSMC.2025.3583392","DOIUrl":null,"url":null,"abstract":"decision transformer (DT), as a conditional sequence modeling (CSM) approach, learns the action distribution for each state using historical information, such as trajectory returns, offering a supervised learning paradigm for offline reinforcement learning (Offline RL). However, due to the fact that DT solely concentrates on an individual trajectory with high returns-to-go, it neglects the potential for constructing optimal trajectories by combining sequences of different actions. In other words, traditional DT lacks the trajectory stitching capability. To address the concern, a novel DT (PCDT) for Offline RL is proposed. Our approach begins by pretraining a standard DT to explicitly capture behavior sequences. Next, we apply the sequence importance sampling to penalize actions that significantly deviate from these behavior sequences, thereby constructing a pessimistic critic. Finally, Q-values are integrated into the policy update process, enabling the learned policy to approximate the behavior policy while favoring actions associated with the highest Q-value. Theoretical analysis shows that the sequence importance sampling in pessimistic critic decision transformer (PCDT) establishes a pessimistic lower bound, while the value optimality ensures that PCDT is capable of learning the optimal policy. Results on the D4RL benchmark tasks and ablation studies show that PCDT inherits the strengths of actor–critic (AC) and CSM methods, achieving the highest normalized scores on challenging sparse-reward and long-horizon tasks. Our code are available at <uri>https://github.com/Henry0132/PCDT</uri>.","PeriodicalId":48915,"journal":{"name":"IEEE Transactions on Systems Man Cybernetics-Systems","volume":"55 10","pages":"7247-7258"},"PeriodicalIF":8.7000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Systems Man Cybernetics-Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11078293/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
decision transformer (DT), as a conditional sequence modeling (CSM) approach, learns the action distribution for each state using historical information, such as trajectory returns, offering a supervised learning paradigm for offline reinforcement learning (Offline RL). However, due to the fact that DT solely concentrates on an individual trajectory with high returns-to-go, it neglects the potential for constructing optimal trajectories by combining sequences of different actions. In other words, traditional DT lacks the trajectory stitching capability. To address the concern, a novel DT (PCDT) for Offline RL is proposed. Our approach begins by pretraining a standard DT to explicitly capture behavior sequences. Next, we apply the sequence importance sampling to penalize actions that significantly deviate from these behavior sequences, thereby constructing a pessimistic critic. Finally, Q-values are integrated into the policy update process, enabling the learned policy to approximate the behavior policy while favoring actions associated with the highest Q-value. Theoretical analysis shows that the sequence importance sampling in pessimistic critic decision transformer (PCDT) establishes a pessimistic lower bound, while the value optimality ensures that PCDT is capable of learning the optimal policy. Results on the D4RL benchmark tasks and ablation studies show that PCDT inherits the strengths of actor–critic (AC) and CSM methods, achieving the highest normalized scores on challenging sparse-reward and long-horizon tasks. Our code are available at https://github.com/Henry0132/PCDT.
期刊介绍:
The IEEE Transactions on Systems, Man, and Cybernetics: Systems encompasses the fields of systems engineering, covering issue formulation, analysis, and modeling throughout the systems engineering lifecycle phases. It addresses decision-making, issue interpretation, systems management, processes, and various methods such as optimization, modeling, and simulation in the development and deployment of large systems.