部分可观察马尔可夫决策过程的混合强化学习

L. Dung, T. Komeda, M. Takagi
{"title":"部分可观察马尔可夫决策过程的混合强化学习","authors":"L. Dung, T. Komeda, M. Takagi","doi":"10.1109/CIRA.2007.382910","DOIUrl":null,"url":null,"abstract":"Reinforcement learning has been widely used to solve problems with a little feedback from environment. Q learning can solve full observable Markov decision processes quite well. For partially observable Markov decision processes (POMDPs), a recurrent neural network (RNN) can be used to approximate Q values. However, learning time for these problems is typically very long. In this paper, Mixed Reinforcement Learning is presented to find an optimal policy for POMDPs in a shorter learning time. This method uses both a Q value table and a RNN. Q value table stores Q values for full observable states and the RNN approximates Q values for hidden states. An observable degree is calculated for each state while the agent explores the environment. If the observable degree is less than a threshold, the state is considered as a hidden state. Results of experiment in lighting grid world problem show that the proposed method enables an agent to acquire a policy, as good as the policy acquired by using only a RNN, with better learning performance.","PeriodicalId":301626,"journal":{"name":"2007 International Symposium on Computational Intelligence in Robotics and Automation","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Mixed Reinforcement Learning for Partially Observable Markov Decision Process\",\"authors\":\"L. Dung, T. Komeda, M. Takagi\",\"doi\":\"10.1109/CIRA.2007.382910\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reinforcement learning has been widely used to solve problems with a little feedback from environment. Q learning can solve full observable Markov decision processes quite well. For partially observable Markov decision processes (POMDPs), a recurrent neural network (RNN) can be used to approximate Q values. However, learning time for these problems is typically very long. In this paper, Mixed Reinforcement Learning is presented to find an optimal policy for POMDPs in a shorter learning time. This method uses both a Q value table and a RNN. Q value table stores Q values for full observable states and the RNN approximates Q values for hidden states. An observable degree is calculated for each state while the agent explores the environment. If the observable degree is less than a threshold, the state is considered as a hidden state. Results of experiment in lighting grid world problem show that the proposed method enables an agent to acquire a policy, as good as the policy acquired by using only a RNN, with better learning performance.\",\"PeriodicalId\":301626,\"journal\":{\"name\":\"2007 International Symposium on Computational Intelligence in Robotics and Automation\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2007 International Symposium on Computational Intelligence in Robotics and Automation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CIRA.2007.382910\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 International Symposium on Computational Intelligence in Robotics and Automation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIRA.2007.382910","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

强化学习已被广泛应用于解决环境反馈较少的问题。Q学习可以很好地解决全可观察马尔可夫决策过程。对于部分可观察马尔可夫决策过程(pomdp),递归神经网络(RNN)可以用来逼近Q值。然而,这些问题的学习时间通常很长。本文提出混合强化学习方法,在较短的学习时间内找到最优的pomdp策略。这种方法同时使用Q值表和RNN。Q值表存储完整可观察状态的Q值,RNN近似隐藏状态的Q值。当智能体探索环境时,计算每个状态的可观察度。如果可观察度小于阈值,则认为该状态为隐藏状态。在照明网格世界问题中的实验结果表明,该方法可以使智能体获得与仅使用RNN获得的策略一样好的策略,并且具有更好的学习性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Mixed Reinforcement Learning for Partially Observable Markov Decision Process
Reinforcement learning has been widely used to solve problems with a little feedback from environment. Q learning can solve full observable Markov decision processes quite well. For partially observable Markov decision processes (POMDPs), a recurrent neural network (RNN) can be used to approximate Q values. However, learning time for these problems is typically very long. In this paper, Mixed Reinforcement Learning is presented to find an optimal policy for POMDPs in a shorter learning time. This method uses both a Q value table and a RNN. Q value table stores Q values for full observable states and the RNN approximates Q values for hidden states. An observable degree is calculated for each state while the agent explores the environment. If the observable degree is less than a threshold, the state is considered as a hidden state. Results of experiment in lighting grid world problem show that the proposed method enables an agent to acquire a policy, as good as the policy acquired by using only a RNN, with better learning performance.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信