Directed exploration of policy space using support vector classifiers

2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) Pub Date : 2011-04-11 DOI:10.1109/ADPRL.2011.5967389

Ioannis Rexakis, M. Lagoudakis

{"title":"Directed exploration of policy space using support vector classifiers","authors":"Ioannis Rexakis, M. Lagoudakis","doi":"10.1109/ADPRL.2011.5967389","DOIUrl":null,"url":null,"abstract":"Good policies in reinforcement learning problems typically exhibit significant structure. Several recent learning approaches based on the approximate policy iteration scheme suggest the use of classifiers for capturing this structure and representing policies compactly. Nevertheless, the space of possible policies, even under such structured representations, is huge and needs to be explored carefully to avoid computationally expensive simulations (rollouts) needed to probe the improved policy and obtain training samples at various points over the state space. Regarding rollouts as a scarce resource, we propose a method for directed exploration of policy space using support vector classifiers. We use a collection of binary support vector classifiers to represent policies, whereby each of these classifiers corresponds to a single action and captures the parts of the state space where this action dominates over the other actions. After an initial training phase with rollouts uniformly distributed over the entire state space, we use the support vectors of the classifiers to identify the critical parts of the state space with boundaries between different action choices in the represented policy. The policy is subsequently improved by probing the state space only at points around the support vectors that are distributed perpendicularly to the separating border. This directed focus on critical parts of the state space iteratively leads to the gradual refinement and improvement of the underlying policy and delivers excellent control policies in only a few iterations with a conservative use of rollouts. We demonstrate the proposed approach on three standard reinforcement learning domains: inverted pendulum, mountain car, and acrobot.","PeriodicalId":406195,"journal":{"name":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ADPRL.2011.5967389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Good policies in reinforcement learning problems typically exhibit significant structure. Several recent learning approaches based on the approximate policy iteration scheme suggest the use of classifiers for capturing this structure and representing policies compactly. Nevertheless, the space of possible policies, even under such structured representations, is huge and needs to be explored carefully to avoid computationally expensive simulations (rollouts) needed to probe the improved policy and obtain training samples at various points over the state space. Regarding rollouts as a scarce resource, we propose a method for directed exploration of policy space using support vector classifiers. We use a collection of binary support vector classifiers to represent policies, whereby each of these classifiers corresponds to a single action and captures the parts of the state space where this action dominates over the other actions. After an initial training phase with rollouts uniformly distributed over the entire state space, we use the support vectors of the classifiers to identify the critical parts of the state space with boundaries between different action choices in the represented policy. The policy is subsequently improved by probing the state space only at points around the support vectors that are distributed perpendicularly to the separating border. This directed focus on critical parts of the state space iteratively leads to the gradual refinement and improvement of the underlying policy and delivers excellent control policies in only a few iterations with a conservative use of rollouts. We demonstrate the proposed approach on three standard reinforcement learning domains: inverted pendulum, mountain car, and acrobot.

查看原文本刊更多论文

使用支持向量分类器对策略空间进行定向探索

在强化学习问题中，好的策略通常表现出显著的结构。最近几种基于近似策略迭代方案的学习方法建议使用分类器来捕获该结构并紧凑地表示策略。然而，即使在这种结构化表示下，可能策略的空间也是巨大的，需要仔细探索，以避免在状态空间的不同点上探测改进策略和获取训练样本所需的计算昂贵的模拟(推出)。考虑到部署是一种稀缺资源，我们提出了一种使用支持向量分类器对策略空间进行定向探索的方法。我们使用一组二进制支持向量分类器来表示策略，其中每个分类器对应于单个操作，并捕获状态空间中该操作优于其他操作的部分。在初始训练阶段均匀分布在整个状态空间之后，我们使用分类器的支持向量来识别状态空间的关键部分，并在表示策略中的不同动作选择之间设置边界。该策略随后通过仅探测垂直分布于分离边界的支持向量周围的点的状态空间而得到改进。这种对状态空间关键部分的直接关注迭代导致了底层策略的逐步细化和改进，并在少量迭代中通过保守地使用rollroll来交付出色的控制策略。我们在三个标准的强化学习领域:倒立摆、山地车和acrobot上展示了所提出的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)

自引率

0.00%

发文量