Attention-based Partial Decoupling of Policy and Value for Generalization in Reinforcement Learning

2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA) Pub Date : 2022-12-01 DOI:10.1109/ICMLA55696.2022.00011

N. Nafi, Creighton Glasscock, W. Hsu

{"title":"Attention-based Partial Decoupling of Policy and Value for Generalization in Reinforcement Learning","authors":"N. Nafi, Creighton Glasscock, W. Hsu","doi":"10.1109/ICMLA55696.2022.00011","DOIUrl":null,"url":null,"abstract":"In this work, we introduce Attention-based Partially Decoupled Actor-Critic (APDAC), an actor-critic architecture for generalization in reinforcement learning, which partially separates the policy and the value functions. To learn directly from images, traditional actor-critic architectures use a shared network to represent the policy and value functions. While a shared representation allows parameter and feature sharing, it can also lead to overfitting that catastrophically damages generalization performance. On the other hand, two separate networks for policy and value can help to avoid overfitting and reduce the generalization gap, but at the cost of added complexity both in terms of architecture design and computation time. APDAC is a hybrid architecture that builds upon the combined strengths of both architectures by sharing initial layer blocks of the network and separating the later ones for policy and value. APDAC incorporates an attention mechanism to enable robust representation learning. We present meaningful visualization of the policy and value that explains the perception of the trained agent. Our empirical analysis, including an ablation study, shows that APDAC significantly outperforms the standard PPO baseline on the challenging RL generalization benchmark Procgen and achieves performance that is competitive with the recent state-of-the-art method (IDAAC) while using fewer convolutional layers and requiring less computational time. Our code is available at https://github.com/nasiknafi/apdac.","PeriodicalId":128160,"journal":{"name":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA55696.2022.00011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

In this work, we introduce Attention-based Partially Decoupled Actor-Critic (APDAC), an actor-critic architecture for generalization in reinforcement learning, which partially separates the policy and the value functions. To learn directly from images, traditional actor-critic architectures use a shared network to represent the policy and value functions. While a shared representation allows parameter and feature sharing, it can also lead to overfitting that catastrophically damages generalization performance. On the other hand, two separate networks for policy and value can help to avoid overfitting and reduce the generalization gap, but at the cost of added complexity both in terms of architecture design and computation time. APDAC is a hybrid architecture that builds upon the combined strengths of both architectures by sharing initial layer blocks of the network and separating the later ones for policy and value. APDAC incorporates an attention mechanism to enable robust representation learning. We present meaningful visualization of the policy and value that explains the perception of the trained agent. Our empirical analysis, including an ablation study, shows that APDAC significantly outperforms the standard PPO baseline on the challenging RL generalization benchmark Procgen and achieves performance that is competitive with the recent state-of-the-art method (IDAAC) while using fewer convolutional layers and requiring less computational time. Our code is available at https://github.com/nasiknafi/apdac.

查看原文本刊更多论文

基于注意力的强化学习泛化策略与值的部分解耦

在这项工作中，我们引入了基于注意力的部分解耦行为者-批评者(APDAC)，这是一种用于强化学习泛化的行为者-批评者架构，它部分分离了策略和价值函数。为了直接从图像中学习，传统的演员评论架构使用共享网络来表示策略和价值函数。虽然共享表示允许参数和特征共享，但它也可能导致过度拟合，从而灾难性地损害泛化性能。另一方面，策略和值的两个独立网络可以帮助避免过拟合并减少泛化差距，但代价是在架构设计和计算时间方面增加了复杂性。APDAC是一种混合体系结构，它通过共享网络的初始层块并根据策略和价值分离后来的层块，从而建立在两种体系结构的综合优势之上。APDAC结合了一个注意机制来实现稳健的表示学习。我们提出了有意义的可视化策略和值，解释了训练代理的感知。我们的实证分析，包括消蚀研究，表明APDAC在具有挑战性的RL泛化基准Procgen上显著优于标准PPO基线，并且在使用更少的卷积层和更少的计算时间的情况下，实现了与最新的最先进方法(IDAAC)竞争的性能。我们的代码可在https://github.com/nasiknafi/apdac上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)

自引率

0.00%

发文量