Phasic parallel-network policy: a deep reinforcement learning framework based on action correlation

IF 3.3 3区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS
Jiahao Li, Tianhan Gao, Qingwei Mi
{"title":"Phasic parallel-network policy: a deep reinforcement learning framework based on action correlation","authors":"Jiahao Li, Tianhan Gao, Qingwei Mi","doi":"10.1007/s00607-024-01329-3","DOIUrl":null,"url":null,"abstract":"<p>Reinforcement learning algorithms show significant variations in performance across different environments. Optimization for reinforcement learning thus becomes the major research task since the instability and unpredictability of the reinforcement learning algorithms have consistently hindered their generalization capabilities. In this study, we address this issue by optimizing the algorithm itself rather than environment-specific optimizations. We start by tackling the uncertainty caused by the mutual influence of original action interferences, aiming to enhance the overall performance. The <i>Phasic Parallel-Network Policy</i> (PPP), which is a deep reinforcement learning framework. It diverges from the traditional policy actor-critic method by grouping the action space based on action correlations. The PPP incorporates parallel network structures and combines network optimization strategies. With the assistance of the value network, the training process is divided into different specific stages, namely the Extra-group Policy Phase and the Inter-group Optimization Phase. PPP breaks through the traditional unit learning structure. The experimental results indicate that it not only optimizes training effectiveness but also reduces training steps, enhances sample efficiency, and significantly improves stability and generalization.</p>","PeriodicalId":10718,"journal":{"name":"Computing","volume":"34 1","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00607-024-01329-3","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Reinforcement learning algorithms show significant variations in performance across different environments. Optimization for reinforcement learning thus becomes the major research task since the instability and unpredictability of the reinforcement learning algorithms have consistently hindered their generalization capabilities. In this study, we address this issue by optimizing the algorithm itself rather than environment-specific optimizations. We start by tackling the uncertainty caused by the mutual influence of original action interferences, aiming to enhance the overall performance. The Phasic Parallel-Network Policy (PPP), which is a deep reinforcement learning framework. It diverges from the traditional policy actor-critic method by grouping the action space based on action correlations. The PPP incorporates parallel network structures and combines network optimization strategies. With the assistance of the value network, the training process is divided into different specific stages, namely the Extra-group Policy Phase and the Inter-group Optimization Phase. PPP breaks through the traditional unit learning structure. The experimental results indicate that it not only optimizes training effectiveness but also reduces training steps, enhances sample efficiency, and significantly improves stability and generalization.

Abstract Image

相位平行网络策略:基于行动相关性的深度强化学习框架
强化学习算法在不同环境下的性能差异很大。由于强化学习算法的不稳定性和不可预测性一直阻碍着它们的泛化能力,因此优化强化学习算法就成了主要的研究任务。在本研究中,我们通过优化算法本身而不是特定环境的优化来解决这一问题。我们首先解决了原始动作干扰相互影响造成的不确定性,旨在提高整体性能。相位并行网络策略(PPP)是一种深度强化学习框架。它不同于传统的策略行动者批判方法,而是根据行动相关性对行动空间进行分组。PPP 融合了并行网络结构,并结合了网络优化策略。在价值网络的辅助下,训练过程被划分为不同的具体阶段,即组外策略阶段和组间优化阶段。PPP 突破了传统的单元学习结构。实验结果表明,它不仅优化了训练效果,还减少了训练步骤,提高了样本效率,并显著提高了稳定性和泛化能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computing
Computing 工程技术-计算机:理论方法
CiteScore
8.20
自引率
2.70%
发文量
107
审稿时长
3 months
期刊介绍: Computing publishes original papers, short communications and surveys on all fields of computing. The contributions should be written in English and may be of theoretical or applied nature, the essential criteria are computational relevance and systematic foundation of results.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信