Playing games with reinforcement learning via perceiving orientation and exploring diversity

Dong Zhang, Le Yang, Haobin Shi, Fangqing Mou, Mengkai Hu
{"title":"Playing games with reinforcement learning via perceiving orientation and exploring diversity","authors":"Dong Zhang, Le Yang, Haobin Shi, Fangqing Mou, Mengkai Hu","doi":"10.1109/PIC.2017.8359509","DOIUrl":null,"url":null,"abstract":"The reinforcement learning can guide the agents to perform optimally under various complex environments. Although reinforcement learning has brought breakthrough for many domains, they are constrained by two bottlenecks: extremely delayed reward signal and the trade-off between diversity and speed. In this paper, we propose a novel framework to alleviate those two bottlenecks. For the delayed reward, we introduce a new term, named the orientation perception term, to calculate the award for each state. For a series of actions successfully leading to the target state, this term takes a difference to each state and assigns award to all states on the pathway, rather than only offers award to the target state. This mechanism allows the learning algorithm to percept the orientation information by distinguishing different states. For the trade-off between diversity and speed, we integrate the curriculum learning into the exploration process and propose the diversity exploration scheme. In the beginning, this scheme is prone to exploring the unexecuted action so as to discover the optimal action series. With the learning process carrying on, the scheme gradually relays more on the acquired knowledge and reduces the random probability. Such randomicity to certainty diversity exploration scheme guides the learning scheme to achieve proper balance between strategy diversity and convergency speed. We name the complete framework OpDe Reinforcement Learning and prove the algorithm convergence. Experiments on a standard platform demonstrate the effectiveness of the complete framework.","PeriodicalId":370588,"journal":{"name":"2017 International Conference on Progress in Informatics and Computing (PIC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Progress in Informatics and Computing (PIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PIC.2017.8359509","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The reinforcement learning can guide the agents to perform optimally under various complex environments. Although reinforcement learning has brought breakthrough for many domains, they are constrained by two bottlenecks: extremely delayed reward signal and the trade-off between diversity and speed. In this paper, we propose a novel framework to alleviate those two bottlenecks. For the delayed reward, we introduce a new term, named the orientation perception term, to calculate the award for each state. For a series of actions successfully leading to the target state, this term takes a difference to each state and assigns award to all states on the pathway, rather than only offers award to the target state. This mechanism allows the learning algorithm to percept the orientation information by distinguishing different states. For the trade-off between diversity and speed, we integrate the curriculum learning into the exploration process and propose the diversity exploration scheme. In the beginning, this scheme is prone to exploring the unexecuted action so as to discover the optimal action series. With the learning process carrying on, the scheme gradually relays more on the acquired knowledge and reduces the random probability. Such randomicity to certainty diversity exploration scheme guides the learning scheme to achieve proper balance between strategy diversity and convergency speed. We name the complete framework OpDe Reinforcement Learning and prove the algorithm convergence. Experiments on a standard platform demonstrate the effectiveness of the complete framework.
通过感知取向和探索多样性来玩强化学习游戏
强化学习可以指导智能体在各种复杂环境下的最佳表现。尽管强化学习在许多领域带来了突破,但它们受到两个瓶颈的制约:极度延迟的奖励信号和多样性与速度之间的权衡。在本文中,我们提出了一个新的框架来缓解这两个瓶颈。对于延迟奖励,我们引入了一个新的术语,称为取向感知术语,用于计算每个状态的奖励。对于成功到达目标状态的一系列动作,这个术语对每个状态都取一个差值,并给路径上的所有状态分配奖励,而不是只给目标状态提供奖励。该机制允许学习算法通过区分不同的状态来感知方向信息。为了权衡多样性与速度之间的关系,我们将课程学习融入到探索过程中,提出了多样性探索方案。一开始,该方案倾向于探索未执行的动作,以发现最优的动作系列。随着学习过程的进行,该方案逐渐依赖于所学知识,降低了随机概率。这种随机性到确定性的多样性探索方案指导学习方案在策略多样性和收敛速度之间达到适当的平衡。我们将完整的框架命名为OpDe强化学习,并证明了算法的收敛性。在标准平台上的实验验证了整个框架的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信