通过感知取向和探索多样性来玩强化学习游戏

2017 International Conference on Progress in Informatics and Computing (PIC) Pub Date : 2017-12-01 DOI:10.1109/PIC.2017.8359509

Dong Zhang, Le Yang, Haobin Shi, Fangqing Mou, Mengkai Hu

{"title":"通过感知取向和探索多样性来玩强化学习游戏","authors":"Dong Zhang, Le Yang, Haobin Shi, Fangqing Mou, Mengkai Hu","doi":"10.1109/PIC.2017.8359509","DOIUrl":null,"url":null,"abstract":"The reinforcement learning can guide the agents to perform optimally under various complex environments. Although reinforcement learning has brought breakthrough for many domains, they are constrained by two bottlenecks: extremely delayed reward signal and the trade-off between diversity and speed. In this paper, we propose a novel framework to alleviate those two bottlenecks. For the delayed reward, we introduce a new term, named the orientation perception term, to calculate the award for each state. For a series of actions successfully leading to the target state, this term takes a difference to each state and assigns award to all states on the pathway, rather than only offers award to the target state. This mechanism allows the learning algorithm to percept the orientation information by distinguishing different states. For the trade-off between diversity and speed, we integrate the curriculum learning into the exploration process and propose the diversity exploration scheme. In the beginning, this scheme is prone to exploring the unexecuted action so as to discover the optimal action series. With the learning process carrying on, the scheme gradually relays more on the acquired knowledge and reduces the random probability. Such randomicity to certainty diversity exploration scheme guides the learning scheme to achieve proper balance between strategy diversity and convergency speed. We name the complete framework OpDe Reinforcement Learning and prove the algorithm convergence. Experiments on a standard platform demonstrate the effectiveness of the complete framework.","PeriodicalId":370588,"journal":{"name":"2017 International Conference on Progress in Informatics and Computing (PIC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Playing games with reinforcement learning via perceiving orientation and exploring diversity\",\"authors\":\"Dong Zhang, Le Yang, Haobin Shi, Fangqing Mou, Mengkai Hu\",\"doi\":\"10.1109/PIC.2017.8359509\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The reinforcement learning can guide the agents to perform optimally under various complex environments. Although reinforcement learning has brought breakthrough for many domains, they are constrained by two bottlenecks: extremely delayed reward signal and the trade-off between diversity and speed. In this paper, we propose a novel framework to alleviate those two bottlenecks. For the delayed reward, we introduce a new term, named the orientation perception term, to calculate the award for each state. For a series of actions successfully leading to the target state, this term takes a difference to each state and assigns award to all states on the pathway, rather than only offers award to the target state. This mechanism allows the learning algorithm to percept the orientation information by distinguishing different states. For the trade-off between diversity and speed, we integrate the curriculum learning into the exploration process and propose the diversity exploration scheme. In the beginning, this scheme is prone to exploring the unexecuted action so as to discover the optimal action series. With the learning process carrying on, the scheme gradually relays more on the acquired knowledge and reduces the random probability. Such randomicity to certainty diversity exploration scheme guides the learning scheme to achieve proper balance between strategy diversity and convergency speed. We name the complete framework OpDe Reinforcement Learning and prove the algorithm convergence. Experiments on a standard platform demonstrate the effectiveness of the complete framework.\",\"PeriodicalId\":370588,\"journal\":{\"name\":\"2017 International Conference on Progress in Informatics and Computing (PIC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 International Conference on Progress in Informatics and Computing (PIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PIC.2017.8359509\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Progress in Informatics and Computing (PIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PIC.2017.8359509","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

强化学习可以指导智能体在各种复杂环境下的最佳表现。尽管强化学习在许多领域带来了突破，但它们受到两个瓶颈的制约:极度延迟的奖励信号和多样性与速度之间的权衡。在本文中，我们提出了一个新的框架来缓解这两个瓶颈。对于延迟奖励，我们引入了一个新的术语，称为取向感知术语，用于计算每个状态的奖励。对于成功到达目标状态的一系列动作，这个术语对每个状态都取一个差值，并给路径上的所有状态分配奖励，而不是只给目标状态提供奖励。该机制允许学习算法通过区分不同的状态来感知方向信息。为了权衡多样性与速度之间的关系，我们将课程学习融入到探索过程中，提出了多样性探索方案。一开始，该方案倾向于探索未执行的动作，以发现最优的动作系列。随着学习过程的进行，该方案逐渐依赖于所学知识，降低了随机概率。这种随机性到确定性的多样性探索方案指导学习方案在策略多样性和收敛速度之间达到适当的平衡。我们将完整的框架命名为OpDe强化学习，并证明了算法的收敛性。在标准平台上的实验验证了整个框架的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Playing games with reinforcement learning via perceiving orientation and exploring diversity

The reinforcement learning can guide the agents to perform optimally under various complex environments. Although reinforcement learning has brought breakthrough for many domains, they are constrained by two bottlenecks: extremely delayed reward signal and the trade-off between diversity and speed. In this paper, we propose a novel framework to alleviate those two bottlenecks. For the delayed reward, we introduce a new term, named the orientation perception term, to calculate the award for each state. For a series of actions successfully leading to the target state, this term takes a difference to each state and assigns award to all states on the pathway, rather than only offers award to the target state. This mechanism allows the learning algorithm to percept the orientation information by distinguishing different states. For the trade-off between diversity and speed, we integrate the curriculum learning into the exploration process and propose the diversity exploration scheme. In the beginning, this scheme is prone to exploring the unexecuted action so as to discover the optimal action series. With the learning process carrying on, the scheme gradually relays more on the acquired knowledge and reduces the random probability. Such randomicity to certainty diversity exploration scheme guides the learning scheme to achieve proper balance between strategy diversity and convergency speed. We name the complete framework OpDe Reinforcement Learning and prove the algorithm convergence. Experiments on a standard platform demonstrate the effectiveness of the complete framework.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 International Conference on Progress in Informatics and Computing (PIC)

自引率

0.00%

发文量