Maximum Information Measure Policies in Reinforcement Learning with Deep Energy-Based Model

2021 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE) Pub Date : 2021-03-17 DOI:10.1109/ICCIKE51210.2021.9410756

K. Sharma, Bhopendra Singh, Edwin Hernan Ramirez Asis, R. Regine, Suman Rajest S, V. P. Mishra

{"title":"Maximum Information Measure Policies in Reinforcement Learning with Deep Energy-Based Model","authors":"K. Sharma, Bhopendra Singh, Edwin Hernan Ramirez Asis, R. Regine, Suman Rajest S, V. P. Mishra","doi":"10.1109/ICCIKE51210.2021.9410756","DOIUrl":null,"url":null,"abstract":"we provided a framework for the acquisition of articulated electricity regulations for consistent states and actions, but it has only been attainable in summarised domains since then. Developers adapt our environment to learning maximum entropy policies, leading to a simple Q-learning service, which communicates the global optimum through a Boltzmann distribution. We could use previously approved amortized Stein perturbation theory logistic regression rather than estimated observations from that distribution form to obtain a stochastic diffusion network. In simulated studies with underwater and walking robots, we confirm that the entire algorithm's cost provides increased exploration or term frequency that allows the transfer of skills between tasks. We also draw a comparison to critical actor methods, which can represent on the accompanying energy-based model conducting approximate inference. Misleading multiplayer uses the recompense power to ensure that the user is further from either the evolutionary algorithms but has now evolved to become a massive task in developing intelligent exploration for deep reinforcement learning. In a misleading game, nearly all cutting-edge research techniques, including those qualify superstition yet, even with self-recompenses, which achieves enhanced outcomes in the sparse re-ward game, often easily collapse into global optimization traps. We are introducing another exploration tactic called Maximum Entropy Expand (MEE) to remedy this shortage (MEE). Based on entropy rewards but the off-actor-critical reinforced learning algorithm, we split the entity adventurer policy into two equal parts, namely, the target rule and the adventure policy. The explorer law is used to interact with the world, and the target rule is used to create trajectories, with the higher precision of the targets to be achieved as the goal of optimization. The optimization goal of the targeted approach is to maximize extrinsic rewards in order to achieve the global result. The ideal experience replay used to remove the catastrophic forgetting issue that leads to the operator's information becoming non-normalized during the off-exploitation period. To prevent the vulnerable, diverging, and generated by the dangerous triad, an on-policy form change is used specifically. Users analyse data likening our strategy with a region technique for deep learning, involving grid world experimentation techniques and deceptively recompense Dota 2 environments. The case illustrates that the MME strategy tends to be productive in escaping the current paper's coercive incentive trap and learning the correct strategic plan.","PeriodicalId":254711,"journal":{"name":"2021 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCIKE51210.2021.9410756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

we provided a framework for the acquisition of articulated electricity regulations for consistent states and actions, but it has only been attainable in summarised domains since then. Developers adapt our environment to learning maximum entropy policies, leading to a simple Q-learning service, which communicates the global optimum through a Boltzmann distribution. We could use previously approved amortized Stein perturbation theory logistic regression rather than estimated observations from that distribution form to obtain a stochastic diffusion network. In simulated studies with underwater and walking robots, we confirm that the entire algorithm's cost provides increased exploration or term frequency that allows the transfer of skills between tasks. We also draw a comparison to critical actor methods, which can represent on the accompanying energy-based model conducting approximate inference. Misleading multiplayer uses the recompense power to ensure that the user is further from either the evolutionary algorithms but has now evolved to become a massive task in developing intelligent exploration for deep reinforcement learning. In a misleading game, nearly all cutting-edge research techniques, including those qualify superstition yet, even with self-recompenses, which achieves enhanced outcomes in the sparse re-ward game, often easily collapse into global optimization traps. We are introducing another exploration tactic called Maximum Entropy Expand (MEE) to remedy this shortage (MEE). Based on entropy rewards but the off-actor-critical reinforced learning algorithm, we split the entity adventurer policy into two equal parts, namely, the target rule and the adventure policy. The explorer law is used to interact with the world, and the target rule is used to create trajectories, with the higher precision of the targets to be achieved as the goal of optimization. The optimization goal of the targeted approach is to maximize extrinsic rewards in order to achieve the global result. The ideal experience replay used to remove the catastrophic forgetting issue that leads to the operator's information becoming non-normalized during the off-exploitation period. To prevent the vulnerable, diverging, and generated by the dangerous triad, an on-policy form change is used specifically. Users analyse data likening our strategy with a region technique for deep learning, involving grid world experimentation techniques and deceptively recompense Dota 2 environments. The case illustrates that the MME strategy tends to be productive in escaping the current paper's coercive incentive trap and learning the correct strategic plan.

查看原文本刊更多论文

基于深度能量模型的强化学习中的最大信息度量策略

我们提供了一个框架，为一致的国家和行动提供了明确的电力法规，但从那时起，它只能在总结领域实现。开发人员调整我们的环境来学习最大熵策略，导致一个简单的q学习服务，它通过玻尔兹曼分布传达全局最优。我们可以使用先前批准的平摊Stein摄动理论逻辑回归，而不是从该分布形式估计的观测值来获得随机扩散网络。在水下和行走机器人的模拟研究中，我们证实了整个算法的成本增加了探索或术语频率，从而允许在任务之间转移技能。我们还对临界参与者方法进行了比较，临界参与者方法可以表示伴随的基于能量的模型进行近似推理。误导多人游戏使用补偿能力来确保用户远离进化算法，但现在已经发展成为开发深度强化学习的智能探索的一项巨大任务。在一个误导的博弈中，几乎所有的尖端研究技术，包括那些还属于迷信的技术，即使有自我补偿，在稀疏奖励博弈中实现了增强的结果，也常常很容易陷入全局优化陷阱。我们正在引入另一种称为最大熵扩展(MEE)的探索策略来弥补这种不足(MEE)。基于熵奖励和非关键行为者强化学习算法，将实体冒险策略分为目标规则和冒险策略两个相等的部分。利用探索者法则与世界交互，利用目标规则创建轨迹，以实现更高精度的目标为优化目标。定向方法的优化目标是使外部奖励最大化，以达到全局结果。理想的经验回放用于消除灾难性的遗忘问题，该问题导致操作员的信息在非开发期间变得非规范化。为了防止易受伤害的、发散的和由危险三元组合产生的，专门使用了一种非政策形式的更改。用户分析数据，将我们的策略与深度学习的区域技术进行比较，包括网格世界实验技术和欺骗性的Dota 2环境。案例表明，MME策略在摆脱当前论文的强制性激励陷阱和学习正确的战略计划方面往往具有成效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE)

自引率

0.00%

发文量