Variance-Reduced Conservative Policy Iteration

International Conference on Algorithmic Learning Theory Pub Date : 2022-12-12 DOI:10.48550/arXiv.2212.06283

Naman Agarwal, Brian Bullins, Karan Singh

引用次数: 2

Abstract

We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\varepsilon$-functional local optimum from $O(\varepsilon^{-4})$ to $O(\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\varepsilon$-global optimality after sampling $O(\varepsilon^{-2})$ times, improving upon the previously established $O(\varepsilon^{-3})$ sample requirement.

查看原文本刊更多论文

减方差保守策略迭代

我们研究了将强化学习简化为策略空间上一系列经验风险最小化问题的样本复杂性。与策略梯度算法的参数空间相反，这种基于约简的算法在函数空间中表现出局部收敛性，因此不受策略类可能的非线性或不连续参数化的影响。我们提出了一种方差减少的保守策略迭代变体，它提高了从$O(\varepsilon^{-4})$到$O(\varepsilon^{-3})$生成$\varepsilon$函数局部最优的样本复杂度。在状态覆盖和策略完备性假设下，该算法在采样$O(\varepsilon^{-2})$次后具有$\varepsilon$-全局最优性，改进了先前建立的$O(\varepsilon^{-3})$样本需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Algorithmic Learning Theory

自引率

0.00%

发文量