Dynamic Weights and Prior Reward in Policy Fusion for Compound Agent Learning

IF 6.6 4区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Intelligent Systems and Technology Pub Date : 2023-11-14 DOI:10.1145/3623405

Meng Xu, Yechao She, Yang Jin, Jianping Wang

{"title":"Dynamic Weights and Prior Reward in Policy Fusion for Compound Agent Learning","authors":"Meng Xu, Yechao She, Yang Jin, Jianping Wang","doi":"10.1145/3623405","DOIUrl":null,"url":null,"abstract":"<p>In Deep Reinforcement Learning (DRL) domain, a compound learning task is often decomposed into several sub-tasks in a divide-and-conquer manner, each trained separately and then fused concurrently to achieve the original task, referred to as policy fusion. However, the state-of-the-art (SOTA) policy fusion methods treat the importance of sub-tasks equally throughout the task process, eliminating the possibility of the agent relying on different sub-tasks at various stages. To address this limitation, we propose a generic policy fusion approach, referred to as Policy Fusion Learning with Dynamic Weights and Prior Reward (PFLDWPR), to automate the time-varying selection of sub-tasks. Specifically, PFLDWPR produces a time-varying one-hot vector for sub-tasks to dynamically select a suitable sub-task and mask the rest throughout the entire task process, enabling the fused strategy to optimally guide the agent in executing the compound task. The sub-tasks with the dynamic one-hot vector are then aggregated to obtain the action policy for the original task. Moreover, we collect sub-tasks’s rewards at the pre-training stage as a prior reward, which, along with the current reward, is used to train the policy fusion network. Thus, this approach reduces fusion bias by leveraging prior experience. Experimental results under three popular learning tasks demonstrate that the proposed method significantly improves three SOTA policy fusion methods in terms of task duration, episode reward, and score difference.</p>","PeriodicalId":48967,"journal":{"name":"ACM Transactions on Intelligent Systems and Technology","volume":"35 1","pages":""},"PeriodicalIF":6.6000,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Intelligent Systems and Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3623405","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In Deep Reinforcement Learning (DRL) domain, a compound learning task is often decomposed into several sub-tasks in a divide-and-conquer manner, each trained separately and then fused concurrently to achieve the original task, referred to as policy fusion. However, the state-of-the-art (SOTA) policy fusion methods treat the importance of sub-tasks equally throughout the task process, eliminating the possibility of the agent relying on different sub-tasks at various stages. To address this limitation, we propose a generic policy fusion approach, referred to as Policy Fusion Learning with Dynamic Weights and Prior Reward (PFLDWPR), to automate the time-varying selection of sub-tasks. Specifically, PFLDWPR produces a time-varying one-hot vector for sub-tasks to dynamically select a suitable sub-task and mask the rest throughout the entire task process, enabling the fused strategy to optimally guide the agent in executing the compound task. The sub-tasks with the dynamic one-hot vector are then aggregated to obtain the action policy for the original task. Moreover, we collect sub-tasks’s rewards at the pre-training stage as a prior reward, which, along with the current reward, is used to train the policy fusion network. Thus, this approach reduces fusion bias by leveraging prior experience. Experimental results under three popular learning tasks demonstrate that the proposed method significantly improves three SOTA policy fusion methods in terms of task duration, episode reward, and score difference.

查看原文本刊更多论文

复合智能体学习策略融合中的动态权重和先验奖励

在深度强化学习(Deep Reinforcement Learning, DRL)领域中，通常将复合学习任务以分而治之的方式分解为若干子任务，每个子任务分别进行训练，然后并发融合以实现原始任务，称为策略融合。然而，最先进的(SOTA)策略融合方法在整个任务过程中平等地对待子任务的重要性，消除了智能体在不同阶段依赖不同子任务的可能性。为了解决这一限制，我们提出了一种通用的策略融合方法，称为带有动态权重和先验奖励的策略融合学习(PFLDWPR)，以自动选择随时间变化的子任务。具体而言，PFLDWPR为子任务生成一个时变的单热向量，在整个任务过程中动态选择合适的子任务并屏蔽其他子任务，使融合策略能够最优地引导智能体执行复合任务。然后对具有动态单热向量的子任务进行聚合，以获得原始任务的操作策略。此外，我们在预训练阶段收集子任务的奖励作为先验奖励，与当前奖励一起用于训练策略融合网络。因此，这种方法通过利用先前的经验来减少融合偏差。在三种常见学习任务下的实验结果表明，该方法在任务持续时间、情节奖励和分数差方面显著改进了三种SOTA策略融合方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Intelligent Systems and Technology COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

9.30

自引率

2.00%

发文量

131

期刊介绍： ACM Transactions on Intelligent Systems and Technology is a scholarly journal that publishes the highest quality papers on intelligent systems, applicable algorithms and technology with a multi-disciplinary perspective. An intelligent system is one that uses artificial intelligence (AI) techniques to offer important services (e.g., as a component of a larger system) to allow integrated systems to perceive, reason, learn, and act intelligently in the real world. ACM TIST is published quarterly (six issues a year). Each issue has 8-11 regular papers, with around 20 published journal pages or 10,000 words per paper. Additional references, proofs, graphs or detailed experiment results can be submitted as a separate appendix, while excessively lengthy papers will be rejected automatically. Authors can include online-only appendices for additional content of their published papers and are encouraged to share their code and/or data with other readers.