{"title":"利用线性函数逼近进行可证明高效的无限视距平均回报强化学习","authors":"Woojin Chae, Dabeen Lee","doi":"arxiv-2409.10772","DOIUrl":null,"url":null,"abstract":"This paper proposes a computationally tractable algorithm for learning\ninfinite-horizon average-reward linear Markov decision processes (MDPs) and\nlinear mixture MDPs under the Bellman optimality condition. While guaranteeing\ncomputational efficiency, our algorithm for linear MDPs achieves the best-known\nregret upper bound of\n$\\widetilde{\\mathcal{O}}(d^{3/2}\\mathrm{sp}(v^*)\\sqrt{T})$ over $T$ time steps\nwhere $\\mathrm{sp}(v^*)$ is the span of the optimal bias function $v^*$ and $d$\nis the dimension of the feature mapping. For linear mixture MDPs, our algorithm\nattains a regret bound of\n$\\widetilde{\\mathcal{O}}(d\\cdot\\mathrm{sp}(v^*)\\sqrt{T})$. The algorithm\napplies novel techniques to control the covering number of the value function\nclass and the span of optimistic estimators of the value function, which is of\nindependent interest.","PeriodicalId":501286,"journal":{"name":"arXiv - MATH - Optimization and Control","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation\",\"authors\":\"Woojin Chae, Dabeen Lee\",\"doi\":\"arxiv-2409.10772\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a computationally tractable algorithm for learning\\ninfinite-horizon average-reward linear Markov decision processes (MDPs) and\\nlinear mixture MDPs under the Bellman optimality condition. While guaranteeing\\ncomputational efficiency, our algorithm for linear MDPs achieves the best-known\\nregret upper bound of\\n$\\\\widetilde{\\\\mathcal{O}}(d^{3/2}\\\\mathrm{sp}(v^*)\\\\sqrt{T})$ over $T$ time steps\\nwhere $\\\\mathrm{sp}(v^*)$ is the span of the optimal bias function $v^*$ and $d$\\nis the dimension of the feature mapping. For linear mixture MDPs, our algorithm\\nattains a regret bound of\\n$\\\\widetilde{\\\\mathcal{O}}(d\\\\cdot\\\\mathrm{sp}(v^*)\\\\sqrt{T})$. The algorithm\\napplies novel techniques to control the covering number of the value function\\nclass and the span of optimistic estimators of the value function, which is of\\nindependent interest.\",\"PeriodicalId\":501286,\"journal\":{\"name\":\"arXiv - MATH - Optimization and Control\",\"volume\":\"3 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - MATH - Optimization and Control\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.10772\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - MATH - Optimization and Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10772","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation
This paper proposes a computationally tractable algorithm for learning
infinite-horizon average-reward linear Markov decision processes (MDPs) and
linear mixture MDPs under the Bellman optimality condition. While guaranteeing
computational efficiency, our algorithm for linear MDPs achieves the best-known
regret upper bound of
$\widetilde{\mathcal{O}}(d^{3/2}\mathrm{sp}(v^*)\sqrt{T})$ over $T$ time steps
where $\mathrm{sp}(v^*)$ is the span of the optimal bias function $v^*$ and $d$
is the dimension of the feature mapping. For linear mixture MDPs, our algorithm
attains a regret bound of
$\widetilde{\mathcal{O}}(d\cdot\mathrm{sp}(v^*)\sqrt{T})$. The algorithm
applies novel techniques to control the covering number of the value function
class and the span of optimistic estimators of the value function, which is of
independent interest.