On Improving the Learning of Long-Term historical Information for Tasks with Partial Observability

2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC) Pub Date : 2020-07-01 DOI:10.1109/DSC50466.2020.00042

Xinwen Wang, Xin Li, Linjing Lai

{"title":"On Improving the Learning of Long-Term historical Information for Tasks with Partial Observability","authors":"Xinwen Wang, Xin Li, Linjing Lai","doi":"10.1109/DSC50466.2020.00042","DOIUrl":null,"url":null,"abstract":"Reinforcement learning (RL) has been recognized as the powerful tool to handle many real-work tasks of decision making, data mining and, information retrieval. Many well-developed RL algorithms have been developed, however tasks involved with partially observable environment, e.g, POMDPs (Partially Observable Markov Decision Processes) are still very challenging. Recent attempts to address this issue is to memorize the long-term historical information by using deep neural networks. And the common strategy is to leverage the recurrent networks, e.g., Long Short-Term Memory(LSTM), to retain/encode the historical information to estimate the true state of environments, given the partial observability. However, when confronted with rather long history dependent problems and irregular data sampling, the conventional LSTM is ill-suited for the problem and difficult to be trained due to the well-known gradient vanishing and the inadequacy of capturing long-term history. In this paper, we propose to utilize Phased LSTM to solve the POMDP tasks, which introduces an additional time gate to periodically update the memory cell, helping the neural framework to 1) maintain the information of the long-term, 2) and propagate the gradient better to facilitate the training of reinforcement learning model with recurrent structure. To further adapt to reinforcement learning and boost the performance, we also propose a Self-Phased LSTM with incorporating a periodic gate, which is able to generate a dynamic periodic gate to adjust automatically for more tasks, especially the notorious ones with sparse rewards. Our experimental results verify the effectiveness of leveraging on such Phased LSTM and Self-Phased LSTM for POMDP tasks.","PeriodicalId":423182,"journal":{"name":"2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC)","volume":"42 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSC50466.2020.00042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Reinforcement learning (RL) has been recognized as the powerful tool to handle many real-work tasks of decision making, data mining and, information retrieval. Many well-developed RL algorithms have been developed, however tasks involved with partially observable environment, e.g, POMDPs (Partially Observable Markov Decision Processes) are still very challenging. Recent attempts to address this issue is to memorize the long-term historical information by using deep neural networks. And the common strategy is to leverage the recurrent networks, e.g., Long Short-Term Memory(LSTM), to retain/encode the historical information to estimate the true state of environments, given the partial observability. However, when confronted with rather long history dependent problems and irregular data sampling, the conventional LSTM is ill-suited for the problem and difficult to be trained due to the well-known gradient vanishing and the inadequacy of capturing long-term history. In this paper, we propose to utilize Phased LSTM to solve the POMDP tasks, which introduces an additional time gate to periodically update the memory cell, helping the neural framework to 1) maintain the information of the long-term, 2) and propagate the gradient better to facilitate the training of reinforcement learning model with recurrent structure. To further adapt to reinforcement learning and boost the performance, we also propose a Self-Phased LSTM with incorporating a periodic gate, which is able to generate a dynamic periodic gate to adjust automatically for more tasks, especially the notorious ones with sparse rewards. Our experimental results verify the effectiveness of leveraging on such Phased LSTM and Self-Phased LSTM for POMDP tasks.

查看原文本刊更多论文

提高部分可观察任务长期历史信息的学习

强化学习(RL)已被公认为是处理决策、数据挖掘和信息检索等许多实际工作任务的强大工具。许多完善的强化学习算法已经开发出来，但是涉及部分可观察环境的任务，例如pomdp(部分可观察马尔可夫决策过程)仍然非常具有挑战性。最近解决这个问题的尝试是使用深度神经网络来记忆长期的历史信息。常见的策略是利用循环网络，例如，长短期记忆(LSTM)，保留/编码历史信息，以估计环境的真实状态，给定部分可观察性。然而，当面对较长的历史依赖问题和不规则的数据采样时，传统的LSTM由于众所周知的梯度消失和捕获长期历史的不足而不适合该问题，并且难以训练。在本文中，我们提出利用阶段性LSTM来解决POMDP任务，它引入了一个额外的时间门来周期性地更新记忆单元，帮助神经框架1)保持长期的信息，2)更好地传播梯度，以促进具有循环结构的强化学习模型的训练。为了进一步适应强化学习并提高性能，我们还提出了一个包含周期门的自相位LSTM，它能够生成一个动态周期门来自动调整更多的任务，特别是那些具有稀疏奖励的恶名昭著的任务。我们的实验结果验证了利用这种分阶段LSTM和自分阶段LSTM进行POMDP任务的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC)

自引率

0.00%

发文量