部分可观察马尔可夫决策过程的混合强化学习

2007 International Symposium on Computational Intelligence in Robotics and Automation Pub Date : 2007-06-20 DOI:10.1109/CIRA.2007.382910

L. Dung, T. Komeda, M. Takagi

{"title":"部分可观察马尔可夫决策过程的混合强化学习","authors":"L. Dung, T. Komeda, M. Takagi","doi":"10.1109/CIRA.2007.382910","DOIUrl":null,"url":null,"abstract":"Reinforcement learning has been widely used to solve problems with a little feedback from environment. Q learning can solve full observable Markov decision processes quite well. For partially observable Markov decision processes (POMDPs), a recurrent neural network (RNN) can be used to approximate Q values. However, learning time for these problems is typically very long. In this paper, Mixed Reinforcement Learning is presented to find an optimal policy for POMDPs in a shorter learning time. This method uses both a Q value table and a RNN. Q value table stores Q values for full observable states and the RNN approximates Q values for hidden states. An observable degree is calculated for each state while the agent explores the environment. If the observable degree is less than a threshold, the state is considered as a hidden state. Results of experiment in lighting grid world problem show that the proposed method enables an agent to acquire a policy, as good as the policy acquired by using only a RNN, with better learning performance.","PeriodicalId":301626,"journal":{"name":"2007 International Symposium on Computational Intelligence in Robotics and Automation","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Mixed Reinforcement Learning for Partially Observable Markov Decision Process\",\"authors\":\"L. Dung, T. Komeda, M. Takagi\",\"doi\":\"10.1109/CIRA.2007.382910\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reinforcement learning has been widely used to solve problems with a little feedback from environment. Q learning can solve full observable Markov decision processes quite well. For partially observable Markov decision processes (POMDPs), a recurrent neural network (RNN) can be used to approximate Q values. However, learning time for these problems is typically very long. In this paper, Mixed Reinforcement Learning is presented to find an optimal policy for POMDPs in a shorter learning time. This method uses both a Q value table and a RNN. Q value table stores Q values for full observable states and the RNN approximates Q values for hidden states. An observable degree is calculated for each state while the agent explores the environment. If the observable degree is less than a threshold, the state is considered as a hidden state. Results of experiment in lighting grid world problem show that the proposed method enables an agent to acquire a policy, as good as the policy acquired by using only a RNN, with better learning performance.\",\"PeriodicalId\":301626,\"journal\":{\"name\":\"2007 International Symposium on Computational Intelligence in Robotics and Automation\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2007 International Symposium on Computational Intelligence in Robotics and Automation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIRA.2007.382910\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 International Symposium on Computational Intelligence in Robotics and Automation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIRA.2007.382910","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

强化学习已被广泛应用于解决环境反馈较少的问题。Q学习可以很好地解决全可观察马尔可夫决策过程。对于部分可观察马尔可夫决策过程(pomdp)，递归神经网络(RNN)可以用来逼近Q值。然而，这些问题的学习时间通常很长。本文提出混合强化学习方法，在较短的学习时间内找到最优的pomdp策略。这种方法同时使用Q值表和RNN。Q值表存储完整可观察状态的Q值，RNN近似隐藏状态的Q值。当智能体探索环境时，计算每个状态的可观察度。如果可观察度小于阈值，则认为该状态为隐藏状态。在照明网格世界问题中的实验结果表明，该方法可以使智能体获得与仅使用RNN获得的策略一样好的策略，并且具有更好的学习性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Mixed Reinforcement Learning for Partially Observable Markov Decision Process

Reinforcement learning has been widely used to solve problems with a little feedback from environment. Q learning can solve full observable Markov decision processes quite well. For partially observable Markov decision processes (POMDPs), a recurrent neural network (RNN) can be used to approximate Q values. However, learning time for these problems is typically very long. In this paper, Mixed Reinforcement Learning is presented to find an optimal policy for POMDPs in a shorter learning time. This method uses both a Q value table and a RNN. Q value table stores Q values for full observable states and the RNN approximates Q values for hidden states. An observable degree is calculated for each state while the agent explores the environment. If the observable degree is less than a threshold, the state is considered as a hidden state. Results of experiment in lighting grid world problem show that the proposed method enables an agent to acquire a policy, as good as the policy acquired by using only a RNN, with better learning performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2007 International Symposium on Computational Intelligence in Robotics and Automation

自引率

0.00%

发文量