Aggregated masked autoencoding for offline reinforcement learning

IF 3.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Changqing Yuan , Yongfang Xie , Shiwen Xie , Zhaohui Tang , Zongze Wu
{"title":"Aggregated masked autoencoding for offline reinforcement learning","authors":"Changqing Yuan ,&nbsp;Yongfang Xie ,&nbsp;Shiwen Xie ,&nbsp;Zhaohui Tang ,&nbsp;Zongze Wu","doi":"10.1016/j.patrec.2025.08.007","DOIUrl":null,"url":null,"abstract":"<div><div>Viewing offline reinforcement learning (RL) as a sequence modeling problem has emerged as a new research trend. Recent approaches leverage self-supervised learning to improve sequence representations, yet most rely on state sequences for pretraining, thereby disrupting the intrinsic state–action coupling, which complicates the distinction of trajectory bifurcations caused by action quality differences. Moreover, actions from stochastic policies in offline datasets may cause low-quality state transitions to be mistakenly identified as salient information, hindering representation learning and degrading policy performance. To mitigate these issues, we propose aggregated masked future prediction (AMFP), a self-supervised learning framework for offline RL. AMFP introduces a new pretext task that combines weighted aggregation and masked autoencoding through global fusion tokens to perform aggregated masked reconstruction. The weighted aggregation mechanism is to assign higher weights to samples that are semantically similar to the anchor in the representation space, enabling the model to emphasize reliable state transitions and suppress misleading transitions from stochastic or low-quality actions. Meanwhile, the global fusion tokens serve a dual purpose: they facilitate the integration of weighted aggregation and masked autoencoding, and, after encoding, function as compressed representations of the state trajectory and implicit action-state coupling. The encoded representations are then utilized as the latent contextual factor to guide policy learning and improve robustness. Experimental evaluation on D4RL benchmarks demonstrates the effectiveness of our method in improving policy learning.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"197 ","pages":"Pages 312-318"},"PeriodicalIF":3.3000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525002867","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Viewing offline reinforcement learning (RL) as a sequence modeling problem has emerged as a new research trend. Recent approaches leverage self-supervised learning to improve sequence representations, yet most rely on state sequences for pretraining, thereby disrupting the intrinsic state–action coupling, which complicates the distinction of trajectory bifurcations caused by action quality differences. Moreover, actions from stochastic policies in offline datasets may cause low-quality state transitions to be mistakenly identified as salient information, hindering representation learning and degrading policy performance. To mitigate these issues, we propose aggregated masked future prediction (AMFP), a self-supervised learning framework for offline RL. AMFP introduces a new pretext task that combines weighted aggregation and masked autoencoding through global fusion tokens to perform aggregated masked reconstruction. The weighted aggregation mechanism is to assign higher weights to samples that are semantically similar to the anchor in the representation space, enabling the model to emphasize reliable state transitions and suppress misleading transitions from stochastic or low-quality actions. Meanwhile, the global fusion tokens serve a dual purpose: they facilitate the integration of weighted aggregation and masked autoencoding, and, after encoding, function as compressed representations of the state trajectory and implicit action-state coupling. The encoded representations are then utilized as the latent contextual factor to guide policy learning and improve robustness. Experimental evaluation on D4RL benchmarks demonstrates the effectiveness of our method in improving policy learning.
用于离线强化学习的聚合掩码自动编码
将离线强化学习(RL)视为序列建模问题已成为新的研究趋势。最近的方法利用自监督学习来改进序列表示,但大多数依赖状态序列进行预训练,从而破坏了内在的状态-动作耦合,这使得由动作质量差异引起的轨迹分叉的区分变得复杂。此外,离线数据集中随机策略的动作可能会导致低质量的状态转换被错误地识别为显著信息,从而阻碍表征学习并降低策略性能。为了缓解这些问题,我们提出了聚合掩模未来预测(AMFP),这是一种用于离线强化学习的自监督学习框架。AMFP引入了一种新的借口任务,通过全局融合令牌将加权聚合和掩码自动编码相结合,进行聚合掩码重构。加权聚合机制是为语义上与表示空间中的锚点相似的样本分配更高的权重,使模型能够强调可靠的状态转换,并抑制随机或低质量动作的误导性转换。同时,全局融合令牌具有双重作用:促进加权聚合和掩码自编码的集成,编码后作为状态轨迹的压缩表示和隐式动作-状态耦合。然后将编码的表示用作潜在的上下文因素来指导策略学习并提高鲁棒性。D4RL基准的实验评估证明了我们的方法在改善策略学习方面的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Pattern Recognition Letters
Pattern Recognition Letters 工程技术-计算机:人工智能
CiteScore
12.40
自引率
5.90%
发文量
287
审稿时长
9.1 months
期刊介绍: Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信