Hierarchical approximate policy iteration with binary-tree state space decomposition.

IEEE transactions on neural networks Pub Date : 2011-12-01 Epub Date: 2011-10-10 DOI:10.1109/TNN.2011.2168422

Xin Xu, Chunming Liu, Simon X Yang, Dewen Hu

{"title":"Hierarchical approximate policy iteration with binary-tree state space decomposition.","authors":"Xin Xu, Chunming Liu, Simon X Yang, Dewen Hu","doi":"10.1109/TNN.2011.2168422","DOIUrl":null,"url":null,"abstract":"<p><p>In recent years, approximate policy iteration (API) has attracted increasing attention in reinforcement learning (RL), e.g., least-squares policy iteration (LSPI) and its kernelized version, the kernel-based LSPI algorithm. However, it remains difficult for API algorithms to obtain near-optimal policies for Markov decision processes (MDPs) with large or continuous state spaces. To address this problem, this paper presents a hierarchical API (HAPI) method with binary-tree state space decomposition for RL in a class of absorbing MDPs, which can be formulated as time-optimal learning control tasks. In the proposed method, after collecting samples adaptively in the state space of the original MDP, a learning-based decomposition strategy of sample sets was designed to implement the binary-tree state space decomposition process. Then, API algorithms were used on the sample subsets to approximate local optimal policies of sub-MDPs. The original MDP was decomposed into a binary-tree structure of absorbing sub-MDPs, constructed during the learning process, thus, local near-optimal policies were approximated by API algorithms with reduced complexity and higher precision. Furthermore, because of the improved quality of local policies, the combined global policy performed better than the near-optimal policy obtained by a single API algorithm in the original MDP. Three learning control problems, including path-tracking control of a real mobile robot, were studied to evaluate the performance of the HAPI method. With the same setting for basis function selection and sample collection, the proposed HAPI obtained better near-optimal policies than previous API methods such as LSPI and KLSPI.</p>","PeriodicalId":13434,"journal":{"name":"IEEE transactions on neural networks","volume":"22 12","pages":"1863-77"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TNN.2011.2168422","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TNN.2011.2168422","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2011/10/10 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

Abstract

In recent years, approximate policy iteration (API) has attracted increasing attention in reinforcement learning (RL), e.g., least-squares policy iteration (LSPI) and its kernelized version, the kernel-based LSPI algorithm. However, it remains difficult for API algorithms to obtain near-optimal policies for Markov decision processes (MDPs) with large or continuous state spaces. To address this problem, this paper presents a hierarchical API (HAPI) method with binary-tree state space decomposition for RL in a class of absorbing MDPs, which can be formulated as time-optimal learning control tasks. In the proposed method, after collecting samples adaptively in the state space of the original MDP, a learning-based decomposition strategy of sample sets was designed to implement the binary-tree state space decomposition process. Then, API algorithms were used on the sample subsets to approximate local optimal policies of sub-MDPs. The original MDP was decomposed into a binary-tree structure of absorbing sub-MDPs, constructed during the learning process, thus, local near-optimal policies were approximated by API algorithms with reduced complexity and higher precision. Furthermore, because of the improved quality of local policies, the combined global policy performed better than the near-optimal policy obtained by a single API algorithm in the original MDP. Three learning control problems, including path-tracking control of a real mobile robot, were studied to evaluate the performance of the HAPI method. With the same setting for basis function selection and sample collection, the proposed HAPI obtained better near-optimal policies than previous API methods such as LSPI and KLSPI.

查看原文本刊更多论文

基于二叉树状态空间分解的分层近似策略迭代。

近年来，近似策略迭代(API)在强化学习(RL)中引起了越来越多的关注，例如最小二乘策略迭代(LSPI)及其核化版本，即基于核的LSPI算法。然而，对于具有大或连续状态空间的马尔可夫决策过程(mdp)， API算法仍然难以获得接近最优的策略。为了解决这一问题，本文提出了一种具有二叉树状态空间分解的分层API (HAPI)方法，用于一类吸收MDPs中的RL，该方法可表述为时间最优学习控制任务。该方法在原MDP状态空间中自适应采集样本后，设计基于学习的样本集分解策略，实现二叉树状态空间分解过程。然后，在样本子集上使用API算法来近似子mdp的局部最优策略。将原始MDP分解为吸收子MDP的二叉树结构，并在学习过程中构造，从而通过API算法逼近局部近最优策略，降低了复杂度，提高了精度。此外，由于改进了局部策略的质量，组合全局策略的性能优于原始MDP中单个API算法获得的近最优策略。以实际移动机器人的路径跟踪控制为例，研究了HAPI方法的学习控制性能。在基函数选择和样本收集设置相同的情况下，所提出的HAPI比以前的API方法(如LSPI和KLSPI)获得了更好的近最优策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on neural networks 工程技术-工程：电子与电气

自引率

0.00%

发文量

审稿时长

8.7 months