Maximum Information Measure Policies in Reinforcement Learning with Deep Energy-Based Model

K. Sharma, Bhopendra Singh, Edwin Hernan Ramirez Asis, R. Regine, Suman Rajest S, V. P. Mishra
{"title":"Maximum Information Measure Policies in Reinforcement Learning with Deep Energy-Based Model","authors":"K. Sharma, Bhopendra Singh, Edwin Hernan Ramirez Asis, R. Regine, Suman Rajest S, V. P. Mishra","doi":"10.1109/ICCIKE51210.2021.9410756","DOIUrl":null,"url":null,"abstract":"we provided a framework for the acquisition of articulated electricity regulations for consistent states and actions, but it has only been attainable in summarised domains since then. Developers adapt our environment to learning maximum entropy policies, leading to a simple Q-learning service, which communicates the global optimum through a Boltzmann distribution. We could use previously approved amortized Stein perturbation theory logistic regression rather than estimated observations from that distribution form to obtain a stochastic diffusion network. In simulated studies with underwater and walking robots, we confirm that the entire algorithm's cost provides increased exploration or term frequency that allows the transfer of skills between tasks. We also draw a comparison to critical actor methods, which can represent on the accompanying energy-based model conducting approximate inference. Misleading multiplayer uses the recompense power to ensure that the user is further from either the evolutionary algorithms but has now evolved to become a massive task in developing intelligent exploration for deep reinforcement learning. In a misleading game, nearly all cutting-edge research techniques, including those qualify superstition yet, even with self-recompenses, which achieves enhanced outcomes in the sparse re-ward game, often easily collapse into global optimization traps. We are introducing another exploration tactic called Maximum Entropy Expand (MEE) to remedy this shortage (MEE). Based on entropy rewards but the off-actor-critical reinforced learning algorithm, we split the entity adventurer policy into two equal parts, namely, the target rule and the adventure policy. The explorer law is used to interact with the world, and the target rule is used to create trajectories, with the higher precision of the targets to be achieved as the goal of optimization. The optimization goal of the targeted approach is to maximize extrinsic rewards in order to achieve the global result. The ideal experience replay used to remove the catastrophic forgetting issue that leads to the operator's information becoming non-normalized during the off-exploitation period. To prevent the vulnerable, diverging, and generated by the dangerous triad, an on-policy form change is used specifically. Users analyse data likening our strategy with a region technique for deep learning, involving grid world experimentation techniques and deceptively recompense Dota 2 environments. The case illustrates that the MME strategy tends to be productive in escaping the current paper's coercive incentive trap and learning the correct strategic plan.","PeriodicalId":254711,"journal":{"name":"2021 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCIKE51210.2021.9410756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25

Abstract

we provided a framework for the acquisition of articulated electricity regulations for consistent states and actions, but it has only been attainable in summarised domains since then. Developers adapt our environment to learning maximum entropy policies, leading to a simple Q-learning service, which communicates the global optimum through a Boltzmann distribution. We could use previously approved amortized Stein perturbation theory logistic regression rather than estimated observations from that distribution form to obtain a stochastic diffusion network. In simulated studies with underwater and walking robots, we confirm that the entire algorithm's cost provides increased exploration or term frequency that allows the transfer of skills between tasks. We also draw a comparison to critical actor methods, which can represent on the accompanying energy-based model conducting approximate inference. Misleading multiplayer uses the recompense power to ensure that the user is further from either the evolutionary algorithms but has now evolved to become a massive task in developing intelligent exploration for deep reinforcement learning. In a misleading game, nearly all cutting-edge research techniques, including those qualify superstition yet, even with self-recompenses, which achieves enhanced outcomes in the sparse re-ward game, often easily collapse into global optimization traps. We are introducing another exploration tactic called Maximum Entropy Expand (MEE) to remedy this shortage (MEE). Based on entropy rewards but the off-actor-critical reinforced learning algorithm, we split the entity adventurer policy into two equal parts, namely, the target rule and the adventure policy. The explorer law is used to interact with the world, and the target rule is used to create trajectories, with the higher precision of the targets to be achieved as the goal of optimization. The optimization goal of the targeted approach is to maximize extrinsic rewards in order to achieve the global result. The ideal experience replay used to remove the catastrophic forgetting issue that leads to the operator's information becoming non-normalized during the off-exploitation period. To prevent the vulnerable, diverging, and generated by the dangerous triad, an on-policy form change is used specifically. Users analyse data likening our strategy with a region technique for deep learning, involving grid world experimentation techniques and deceptively recompense Dota 2 environments. The case illustrates that the MME strategy tends to be productive in escaping the current paper's coercive incentive trap and learning the correct strategic plan.
基于深度能量模型的强化学习中的最大信息度量策略
我们提供了一个框架,为一致的国家和行动提供了明确的电力法规,但从那时起,它只能在总结领域实现。开发人员调整我们的环境来学习最大熵策略,导致一个简单的q学习服务,它通过玻尔兹曼分布传达全局最优。我们可以使用先前批准的平摊Stein摄动理论逻辑回归,而不是从该分布形式估计的观测值来获得随机扩散网络。在水下和行走机器人的模拟研究中,我们证实了整个算法的成本增加了探索或术语频率,从而允许在任务之间转移技能。我们还对临界参与者方法进行了比较,临界参与者方法可以表示伴随的基于能量的模型进行近似推理。误导多人游戏使用补偿能力来确保用户远离进化算法,但现在已经发展成为开发深度强化学习的智能探索的一项巨大任务。在一个误导的博弈中,几乎所有的尖端研究技术,包括那些还属于迷信的技术,即使有自我补偿,在稀疏奖励博弈中实现了增强的结果,也常常很容易陷入全局优化陷阱。我们正在引入另一种称为最大熵扩展(MEE)的探索策略来弥补这种不足(MEE)。基于熵奖励和非关键行为者强化学习算法,将实体冒险策略分为目标规则和冒险策略两个相等的部分。利用探索者法则与世界交互,利用目标规则创建轨迹,以实现更高精度的目标为优化目标。定向方法的优化目标是使外部奖励最大化,以达到全局结果。理想的经验回放用于消除灾难性的遗忘问题,该问题导致操作员的信息在非开发期间变得非规范化。为了防止易受伤害的、发散的和由危险三元组合产生的,专门使用了一种非政策形式的更改。用户分析数据,将我们的策略与深度学习的区域技术进行比较,包括网格世界实验技术和欺骗性的Dota 2环境。案例表明,MME策略在摆脱当前论文的强制性激励陷阱和学习正确的战略计划方面往往具有成效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信