Dual-head prediction and reconstruction with coarse-to-fine masks for visual reinforcement learning.

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-09-29 DOI:10.1016/j.neunet.2025.108149

Yun Zhou, Yuqiang Wu, Qiaoyun Wu, Chunyu Tan, Shu Zhan, Richang Hong

{"title":"Dual-head prediction and reconstruction with coarse-to-fine masks for visual reinforcement learning.","authors":"Yun Zhou, Yuqiang Wu, Qiaoyun Wu, Chunyu Tan, Shu Zhan, Richang Hong","doi":"10.1016/j.neunet.2025.108149","DOIUrl":null,"url":null,"abstract":"<p><p>In situations of limited experience and high-dimensional input data, effective representation learning plays a vital role in enabling visual reinforcement learning (RL) to excel in diverse tasks. To better leverage the agent's sampled trajectory during the training process, we introduce the DPRM approach, which involves a Dual-head Prediction and Reconstruction task with coarse-to-fine Masks in RL. The DPRM method tackles these challenges through integration of coarse-to-fine masks with a dual-head prediction-reconstruction (DHPR) architecture, complemented by a coordinate-based spatial coding strategy (CSCS). The CSCS enhances the spatial information of the observation state, facilitating the capture of motion changes between continuous context states. Furthermore, the coarse-to-fine masks gradually refine, guiding the following DHPR model to learn essential features and semantics more effectively. Built on a transformer architecture, DHPR introduces a novel triplet input token comprising two consecutive actions paired with an observation state. This design facilitates bidirectional prediction of past and future states from temporal extremities while efficiently reconstructing masked latent features throughout state sequences. Experimental results on both multiple continuous control (DeepMind Control Suite benchmarks) and discrete control (Atari) tasks demonstrate that the DPRM algorithm significantly enhances performance, leading to higher reward accumulation and faster convergence. Code is available athere.</p>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"194 ","pages":"108149"},"PeriodicalIF":6.3000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1016/j.neunet.2025.108149","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In situations of limited experience and high-dimensional input data, effective representation learning plays a vital role in enabling visual reinforcement learning (RL) to excel in diverse tasks. To better leverage the agent's sampled trajectory during the training process, we introduce the DPRM approach, which involves a Dual-head Prediction and Reconstruction task with coarse-to-fine Masks in RL. The DPRM method tackles these challenges through integration of coarse-to-fine masks with a dual-head prediction-reconstruction (DHPR) architecture, complemented by a coordinate-based spatial coding strategy (CSCS). The CSCS enhances the spatial information of the observation state, facilitating the capture of motion changes between continuous context states. Furthermore, the coarse-to-fine masks gradually refine, guiding the following DHPR model to learn essential features and semantics more effectively. Built on a transformer architecture, DHPR introduces a novel triplet input token comprising two consecutive actions paired with an observation state. This design facilitates bidirectional prediction of past and future states from temporal extremities while efficiently reconstructing masked latent features throughout state sequences. Experimental results on both multiple continuous control (DeepMind Control Suite benchmarks) and discrete control (Atari) tasks demonstrate that the DPRM algorithm significantly enhances performance, leading to higher reward accumulation and faster convergence. Code is available athere.

查看原文本刊更多论文

视觉强化学习用粗到精掩模的双头预测与重建。

在经验有限和高维输入数据的情况下，有效的表征学习对于视觉强化学习（RL）在各种任务中脱颖而出起着至关重要的作用。为了在训练过程中更好地利用智能体的采样轨迹，我们引入了DPRM方法，该方法涉及RL中使用粗到细掩模的双头预测和重建任务。DPRM方法通过将粗到细的掩模与双头部预测重建（DHPR）架构相结合，并辅以基于坐标的空间编码策略（CSCS），解决了这些挑战。CSCS增强了观察状态的空间信息，便于捕捉连续上下文状态之间的运动变化。此外，从粗到细的掩模逐渐细化，指导后续DHPR模型更有效地学习本质特征和语义。DHPR建立在变压器体系结构上，引入了一种新的三元组输入令牌，该令牌由两个连续的操作和一个观察状态配对组成。这种设计有助于从时间端点双向预测过去和未来的状态，同时有效地重建整个状态序列中的隐藏潜在特征。在多个连续控制（DeepMind control Suite基准测试）和离散控制（Atari）任务上的实验结果表明，DPRM算法显著提高了性能，导致更高的奖励积累和更快的收敛。代码在那里可用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.