Aggregated masked autoencoding for offline reinforcement learning

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters Pub Date : 2025-08-25 DOI:10.1016/j.patrec.2025.08.007

Changqing Yuan , Yongfang Xie , Shiwen Xie , Zhaohui Tang , Zongze Wu

{"title":"Aggregated masked autoencoding for offline reinforcement learning","authors":"Changqing Yuan , Yongfang Xie , Shiwen Xie , Zhaohui Tang , Zongze Wu","doi":"10.1016/j.patrec.2025.08.007","DOIUrl":null,"url":null,"abstract":"<div><div>Viewing offline reinforcement learning (RL) as a sequence modeling problem has emerged as a new research trend. Recent approaches leverage self-supervised learning to improve sequence representations, yet most rely on state sequences for pretraining, thereby disrupting the intrinsic state–action coupling, which complicates the distinction of trajectory bifurcations caused by action quality differences. Moreover, actions from stochastic policies in offline datasets may cause low-quality state transitions to be mistakenly identified as salient information, hindering representation learning and degrading policy performance. To mitigate these issues, we propose aggregated masked future prediction (AMFP), a self-supervised learning framework for offline RL. AMFP introduces a new pretext task that combines weighted aggregation and masked autoencoding through global fusion tokens to perform aggregated masked reconstruction. The weighted aggregation mechanism is to assign higher weights to samples that are semantically similar to the anchor in the representation space, enabling the model to emphasize reliable state transitions and suppress misleading transitions from stochastic or low-quality actions. Meanwhile, the global fusion tokens serve a dual purpose: they facilitate the integration of weighted aggregation and masked autoencoding, and, after encoding, function as compressed representations of the state trajectory and implicit action-state coupling. The encoded representations are then utilized as the latent contextual factor to guide policy learning and improve robustness. Experimental evaluation on D4RL benchmarks demonstrates the effectiveness of our method in improving policy learning.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"197 ","pages":"Pages 312-318"},"PeriodicalIF":3.3000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525002867","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Viewing offline reinforcement learning (RL) as a sequence modeling problem has emerged as a new research trend. Recent approaches leverage self-supervised learning to improve sequence representations, yet most rely on state sequences for pretraining, thereby disrupting the intrinsic state–action coupling, which complicates the distinction of trajectory bifurcations caused by action quality differences. Moreover, actions from stochastic policies in offline datasets may cause low-quality state transitions to be mistakenly identified as salient information, hindering representation learning and degrading policy performance. To mitigate these issues, we propose aggregated masked future prediction (AMFP), a self-supervised learning framework for offline RL. AMFP introduces a new pretext task that combines weighted aggregation and masked autoencoding through global fusion tokens to perform aggregated masked reconstruction. The weighted aggregation mechanism is to assign higher weights to samples that are semantically similar to the anchor in the representation space, enabling the model to emphasize reliable state transitions and suppress misleading transitions from stochastic or low-quality actions. Meanwhile, the global fusion tokens serve a dual purpose: they facilitate the integration of weighted aggregation and masked autoencoding, and, after encoding, function as compressed representations of the state trajectory and implicit action-state coupling. The encoded representations are then utilized as the latent contextual factor to guide policy learning and improve robustness. Experimental evaluation on D4RL benchmarks demonstrates the effectiveness of our method in improving policy learning.

查看原文本刊更多论文

用于离线强化学习的聚合掩码自动编码

将离线强化学习（RL）视为序列建模问题已成为新的研究趋势。最近的方法利用自监督学习来改进序列表示，但大多数依赖状态序列进行预训练，从而破坏了内在的状态-动作耦合，这使得由动作质量差异引起的轨迹分叉的区分变得复杂。此外，离线数据集中随机策略的动作可能会导致低质量的状态转换被错误地识别为显著信息，从而阻碍表征学习并降低策略性能。为了缓解这些问题，我们提出了聚合掩模未来预测（AMFP），这是一种用于离线强化学习的自监督学习框架。AMFP引入了一种新的借口任务，通过全局融合令牌将加权聚合和掩码自动编码相结合，进行聚合掩码重构。加权聚合机制是为语义上与表示空间中的锚点相似的样本分配更高的权重，使模型能够强调可靠的状态转换，并抑制随机或低质量动作的误导性转换。同时，全局融合令牌具有双重作用：促进加权聚合和掩码自编码的集成，编码后作为状态轨迹的压缩表示和隐式动作-状态耦合。然后将编码的表示用作潜在的上下文因素来指导策略学习并提高鲁棒性。D4RL基准的实验评估证明了我们的方法在改善策略学习方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition Letters 工程技术-计算机：人工智能

CiteScore

12.40

自引率

5.90%

发文量

287

审稿时长

9.1 months

期刊介绍： Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.