减方差保守策略迭代

International Conference on Algorithmic Learning Theory Pub Date : 2022-12-12 DOI:10.48550/arXiv.2212.06283

Naman Agarwal, Brian Bullins, Karan Singh

{"title":"减方差保守策略迭代","authors":"Naman Agarwal, Brian Bullins, Karan Singh","doi":"10.48550/arXiv.2212.06283","DOIUrl":null,"url":null,"abstract":"We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\\varepsilon$-functional local optimum from $O(\\varepsilon^{-4})$ to $O(\\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\\varepsilon$-global optimality after sampling $O(\\varepsilon^{-2})$ times, improving upon the previously established $O(\\varepsilon^{-3})$ sample requirement.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"354 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Variance-Reduced Conservative Policy Iteration\",\"authors\":\"Naman Agarwal, Brian Bullins, Karan Singh\",\"doi\":\"10.48550/arXiv.2212.06283\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\\\\varepsilon$-functional local optimum from $O(\\\\varepsilon^{-4})$ to $O(\\\\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\\\\varepsilon$-global optimality after sampling $O(\\\\varepsilon^{-2})$ times, improving upon the previously established $O(\\\\varepsilon^{-3})$ sample requirement.\",\"PeriodicalId\":267197,\"journal\":{\"name\":\"International Conference on Algorithmic Learning Theory\",\"volume\":\"354 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Algorithmic Learning Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2212.06283\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Algorithmic Learning Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2212.06283","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

我们研究了将强化学习简化为策略空间上一系列经验风险最小化问题的样本复杂性。与策略梯度算法的参数空间相反，这种基于约简的算法在函数空间中表现出局部收敛性，因此不受策略类可能的非线性或不连续参数化的影响。我们提出了一种方差减少的保守策略迭代变体，它提高了从$O(\varepsilon^{-4})$到$O(\varepsilon^{-3})$生成$\varepsilon$函数局部最优的样本复杂度。在状态覆盖和策略完备性假设下，该算法在采样$O(\varepsilon^{-2})$次后具有$\varepsilon$-全局最优性，改进了先前建立的$O(\varepsilon^{-3})$样本需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Variance-Reduced Conservative Policy Iteration

We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\varepsilon$-functional local optimum from $O(\varepsilon^{-4})$ to $O(\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\varepsilon$-global optimality after sampling $O(\varepsilon^{-2})$ times, improving upon the previously established $O(\varepsilon^{-3})$ sample requirement.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Algorithmic Learning Theory

自引率

0.00%

发文量