Multimodal Parameter-exploring Policy Gradients

2010 Ninth International Conference on Machine Learning and Applications Pub Date : 2010-12-12 DOI:10.1109/ICMLA.2010.24

Frank Sehnke, Alex Graves, Christian Osendorfer, J. Schmidhuber

引用次数: 15

Abstract

Policy Gradients with Parameter-based Exploration (PGPE) is a novel model-free reinforcement learning method that alleviates the problem of high-variance gradient estimates encountered in normal policy gradient methods. It has been shown to drastically speed up convergence for several large-scale reinforcement learning tasks. However the independent normal distributions used by PGPE to search through parameter space are inadequate for some problems with multimodal reward surfaces. This paper extends the basic PGPE algorithm to use multimodal mixture distributions for each parameter, while remaining efficient. Experimental results on the Rastrigin function and the inverted pendulum benchmark demonstrate the advantages of this modification, with faster convergence to better optima.

查看原文本刊更多论文

多模态参数探索策略梯度

策略梯度与参数探索(PGPE)是一种新的无模型强化学习方法，它缓解了常规策略梯度方法中遇到的高方差梯度估计问题。它已经被证明大大加快了几个大规模强化学习任务的收敛速度。然而，PGPE在参数空间中搜索的独立正态分布对于一些具有多模态奖励曲面的问题来说是不够的。本文扩展了基本的PGPE算法，在保持效率的同时对每个参数使用多模态混合分布。在Rastrigin函数和倒立摆基准上的实验结果表明了这种改进的优点，收敛速度更快，优化效果更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 Ninth International Conference on Machine Learning and Applications

自引率

0.00%

发文量