Hierarchical approximate policy iteration with binary-tree state space decomposition.

IEEE transactions on neural networks Pub Date : 2011-12-01 Epub Date: 2011-10-10 DOI:10.1109/TNN.2011.2168422
Xin Xu, Chunming Liu, Simon X Yang, Dewen Hu
{"title":"Hierarchical approximate policy iteration with binary-tree state space decomposition.","authors":"Xin Xu,&nbsp;Chunming Liu,&nbsp;Simon X Yang,&nbsp;Dewen Hu","doi":"10.1109/TNN.2011.2168422","DOIUrl":null,"url":null,"abstract":"<p><p>In recent years, approximate policy iteration (API) has attracted increasing attention in reinforcement learning (RL), e.g., least-squares policy iteration (LSPI) and its kernelized version, the kernel-based LSPI algorithm. However, it remains difficult for API algorithms to obtain near-optimal policies for Markov decision processes (MDPs) with large or continuous state spaces. To address this problem, this paper presents a hierarchical API (HAPI) method with binary-tree state space decomposition for RL in a class of absorbing MDPs, which can be formulated as time-optimal learning control tasks. In the proposed method, after collecting samples adaptively in the state space of the original MDP, a learning-based decomposition strategy of sample sets was designed to implement the binary-tree state space decomposition process. Then, API algorithms were used on the sample subsets to approximate local optimal policies of sub-MDPs. The original MDP was decomposed into a binary-tree structure of absorbing sub-MDPs, constructed during the learning process, thus, local near-optimal policies were approximated by API algorithms with reduced complexity and higher precision. Furthermore, because of the improved quality of local policies, the combined global policy performed better than the near-optimal policy obtained by a single API algorithm in the original MDP. Three learning control problems, including path-tracking control of a real mobile robot, were studied to evaluate the performance of the HAPI method. With the same setting for basis function selection and sample collection, the proposed HAPI obtained better near-optimal policies than previous API methods such as LSPI and KLSPI.</p>","PeriodicalId":13434,"journal":{"name":"IEEE transactions on neural networks","volume":"22 12","pages":"1863-77"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TNN.2011.2168422","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TNN.2011.2168422","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2011/10/10 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 37

Abstract

In recent years, approximate policy iteration (API) has attracted increasing attention in reinforcement learning (RL), e.g., least-squares policy iteration (LSPI) and its kernelized version, the kernel-based LSPI algorithm. However, it remains difficult for API algorithms to obtain near-optimal policies for Markov decision processes (MDPs) with large or continuous state spaces. To address this problem, this paper presents a hierarchical API (HAPI) method with binary-tree state space decomposition for RL in a class of absorbing MDPs, which can be formulated as time-optimal learning control tasks. In the proposed method, after collecting samples adaptively in the state space of the original MDP, a learning-based decomposition strategy of sample sets was designed to implement the binary-tree state space decomposition process. Then, API algorithms were used on the sample subsets to approximate local optimal policies of sub-MDPs. The original MDP was decomposed into a binary-tree structure of absorbing sub-MDPs, constructed during the learning process, thus, local near-optimal policies were approximated by API algorithms with reduced complexity and higher precision. Furthermore, because of the improved quality of local policies, the combined global policy performed better than the near-optimal policy obtained by a single API algorithm in the original MDP. Three learning control problems, including path-tracking control of a real mobile robot, were studied to evaluate the performance of the HAPI method. With the same setting for basis function selection and sample collection, the proposed HAPI obtained better near-optimal policies than previous API methods such as LSPI and KLSPI.

基于二叉树状态空间分解的分层近似策略迭代。
近年来,近似策略迭代(API)在强化学习(RL)中引起了越来越多的关注,例如最小二乘策略迭代(LSPI)及其核化版本,即基于核的LSPI算法。然而,对于具有大或连续状态空间的马尔可夫决策过程(mdp), API算法仍然难以获得接近最优的策略。为了解决这一问题,本文提出了一种具有二叉树状态空间分解的分层API (HAPI)方法,用于一类吸收MDPs中的RL,该方法可表述为时间最优学习控制任务。该方法在原MDP状态空间中自适应采集样本后,设计基于学习的样本集分解策略,实现二叉树状态空间分解过程。然后,在样本子集上使用API算法来近似子mdp的局部最优策略。将原始MDP分解为吸收子MDP的二叉树结构,并在学习过程中构造,从而通过API算法逼近局部近最优策略,降低了复杂度,提高了精度。此外,由于改进了局部策略的质量,组合全局策略的性能优于原始MDP中单个API算法获得的近最优策略。以实际移动机器人的路径跟踪控制为例,研究了HAPI方法的学习控制性能。在基函数选择和样本收集设置相同的情况下,所提出的HAPI比以前的API方法(如LSPI和KLSPI)获得了更好的近最优策略。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE transactions on neural networks
IEEE transactions on neural networks 工程技术-工程:电子与电气
自引率
0.00%
发文量
2
审稿时长
8.7 months
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信