减方差保守策略迭代

Naman Agarwal, Brian Bullins, Karan Singh
{"title":"减方差保守策略迭代","authors":"Naman Agarwal, Brian Bullins, Karan Singh","doi":"10.48550/arXiv.2212.06283","DOIUrl":null,"url":null,"abstract":"We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\\varepsilon$-functional local optimum from $O(\\varepsilon^{-4})$ to $O(\\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\\varepsilon$-global optimality after sampling $O(\\varepsilon^{-2})$ times, improving upon the previously established $O(\\varepsilon^{-3})$ sample requirement.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"354 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Variance-Reduced Conservative Policy Iteration\",\"authors\":\"Naman Agarwal, Brian Bullins, Karan Singh\",\"doi\":\"10.48550/arXiv.2212.06283\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\\\\varepsilon$-functional local optimum from $O(\\\\varepsilon^{-4})$ to $O(\\\\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\\\\varepsilon$-global optimality after sampling $O(\\\\varepsilon^{-2})$ times, improving upon the previously established $O(\\\\varepsilon^{-3})$ sample requirement.\",\"PeriodicalId\":267197,\"journal\":{\"name\":\"International Conference on Algorithmic Learning Theory\",\"volume\":\"354 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Algorithmic Learning Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2212.06283\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Algorithmic Learning Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2212.06283","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

我们研究了将强化学习简化为策略空间上一系列经验风险最小化问题的样本复杂性。与策略梯度算法的参数空间相反,这种基于约简的算法在函数空间中表现出局部收敛性,因此不受策略类可能的非线性或不连续参数化的影响。我们提出了一种方差减少的保守策略迭代变体,它提高了从$O(\varepsilon^{-4})$到$O(\varepsilon^{-3})$生成$\varepsilon$函数局部最优的样本复杂度。在状态覆盖和策略完备性假设下,该算法在采样$O(\varepsilon^{-2})$次后具有$\varepsilon$-全局最优性,改进了先前建立的$O(\varepsilon^{-3})$样本需求。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Variance-Reduced Conservative Policy Iteration
We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\varepsilon$-functional local optimum from $O(\varepsilon^{-4})$ to $O(\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\varepsilon$-global optimality after sampling $O(\varepsilon^{-2})$ times, improving upon the previously established $O(\varepsilon^{-3})$ sample requirement.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信