Reinforcement learning in multidimensional continuous action spaces

2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) Pub Date : 2011-04-11 DOI:10.1109/ADPRL.2011.5967381

Jason Pazis, M. Lagoudakis

{"title":"Reinforcement learning in multidimensional continuous action spaces","authors":"Jason Pazis, M. Lagoudakis","doi":"10.1109/ADPRL.2011.5967381","DOIUrl":null,"url":null,"abstract":"The majority of learning algorithms available today focus on approximating the state (V ) or state-action (Q) value function and efficient action selection comes as an afterthought. On the other hand, real-world problems tend to have large action spaces, where evaluating every possible action becomes impractical. This mismatch presents a major obstacle in successfully applying reinforcement learning to real-world problems. In this paper we present an effective approach to learning and acting in domains with multidimensional and/or continuous control variables where efficient action selection is embedded in the learning process. Instead of learning and representing the state or state-action value function of the MDP, we learn a value function over an implied augmented MDP, where states represent collections of actions in the original MDP and transitions represent choices eliminating parts of the action space at each step. Action selection in the original MDP is reduced to a binary search by the agent in the transformed MDP, with computational complexity logarithmic in the number of actions, or equivalently linear in the number of action dimensions. Our method can be combined with any discrete-action reinforcement learning algorithm for learning multidimensional continuous-action policies using a state value approximator in the transformed MDP. Our preliminary results with two well-known reinforcement learning algorithms (Least-Squares Policy Iteration and Fitted Q-Iteration) on two continuous action domains (1-dimensional inverted pendulum regulator, 2-dimensional bicycle balancing) demonstrate the viability and the potential of the proposed approach.","PeriodicalId":406195,"journal":{"name":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ADPRL.2011.5967381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

The majority of learning algorithms available today focus on approximating the state (V ) or state-action (Q) value function and efficient action selection comes as an afterthought. On the other hand, real-world problems tend to have large action spaces, where evaluating every possible action becomes impractical. This mismatch presents a major obstacle in successfully applying reinforcement learning to real-world problems. In this paper we present an effective approach to learning and acting in domains with multidimensional and/or continuous control variables where efficient action selection is embedded in the learning process. Instead of learning and representing the state or state-action value function of the MDP, we learn a value function over an implied augmented MDP, where states represent collections of actions in the original MDP and transitions represent choices eliminating parts of the action space at each step. Action selection in the original MDP is reduced to a binary search by the agent in the transformed MDP, with computational complexity logarithmic in the number of actions, or equivalently linear in the number of action dimensions. Our method can be combined with any discrete-action reinforcement learning algorithm for learning multidimensional continuous-action policies using a state value approximator in the transformed MDP. Our preliminary results with two well-known reinforcement learning algorithms (Least-Squares Policy Iteration and Fitted Q-Iteration) on two continuous action domains (1-dimensional inverted pendulum regulator, 2-dimensional bicycle balancing) demonstrate the viability and the potential of the proposed approach.

查看原文本刊更多论文

多维连续动作空间中的强化学习

目前可用的大多数学习算法都集中在近似状态(V)或状态-动作(Q)值函数上，而有效的动作选择是事后才想到的。另一方面，现实世界的问题往往具有较大的操作空间，因此评估每个可能的操作变得不切实际。这种不匹配是成功将强化学习应用于现实问题的主要障碍。在本文中，我们提出了一种在具有多维和/或连续控制变量的领域中学习和行动的有效方法，其中有效的行动选择嵌入在学习过程中。我们不是学习和表示MDP的状态或状态-动作值函数，而是在一个隐含的增强MDP上学习一个值函数，其中状态表示原始MDP中的动作集合，转换表示在每个步骤中消除部分动作空间的选择。原始MDP中的操作选择被转换后的MDP中的代理简化为二元搜索，其计算复杂度在操作数量上是对数的，或者在操作维数上是等价的线性的。我们的方法可以与任何离散行为强化学习算法相结合，在转换后的MDP中使用状态值逼近器来学习多维连续行为策略。我们在两个连续动作域(一维倒立摆调节器，二维自行车平衡)上使用两种著名的强化学习算法(最小二乘策略迭代和拟合q迭代)的初步结果证明了所提出方法的可行性和潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)

自引率

0.00%

发文量