基于一般函数逼近的强化学习模型选择

Machine learning and knowledge discovery in databases : European Conference, ECML PKDD ... : proceedings. ECML PKDD (Conference) Pub Date : 2022-07-06 DOI:10.48550/arXiv.2207.02992

Avishek Ghosh, Sayak Ray Chowdhury

{"title":"基于一般函数逼近的强化学习模型选择","authors":"Avishek Ghosh, Sayak Ray Chowdhury","doi":"10.48550/arXiv.2207.02992","DOIUrl":null,"url":null,"abstract":"We consider model selection for classic Reinforcement Learning (RL) environments -- Multi Armed Bandits (MABs) and Markov Decision Processes (MDPs) -- under general function approximations. In the model selection framework, we do not know the function classes, denoted by $\\mathcal{F}$ and $\\mathcal{M}$, where the true models -- reward generating function for MABs and and transition kernel for MDPs -- lie, respectively. Instead, we are given $M$ nested function (hypothesis) classes such that true models are contained in at-least one such class. In this paper, we propose and analyze efficient model selection algorithms for MABs and MDPs, that \\emph{adapt} to the smallest function class (among the nested $M$ classes) containing the true underlying model. Under a separability assumption on the nested hypothesis classes, we show that the cumulative regret of our adaptive algorithms match to that of an oracle which knows the correct function classes (i.e., $\\cF$ and $\\cM$) a priori. Furthermore, for both the settings, we show that the cost of model selection is an additive term in the regret having weak (logarithmic) dependence on the learning horizon $T$.","PeriodicalId":74091,"journal":{"name":"Machine learning and knowledge discovery in databases : European Conference, ECML PKDD ... : proceedings. ECML PKDD (Conference)","volume":"1 1","pages":"148-164"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Model Selection in Reinforcement Learning with General Function Approximations\",\"authors\":\"Avishek Ghosh, Sayak Ray Chowdhury\",\"doi\":\"10.48550/arXiv.2207.02992\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider model selection for classic Reinforcement Learning (RL) environments -- Multi Armed Bandits (MABs) and Markov Decision Processes (MDPs) -- under general function approximations. In the model selection framework, we do not know the function classes, denoted by $\\\\mathcal{F}$ and $\\\\mathcal{M}$, where the true models -- reward generating function for MABs and and transition kernel for MDPs -- lie, respectively. Instead, we are given $M$ nested function (hypothesis) classes such that true models are contained in at-least one such class. In this paper, we propose and analyze efficient model selection algorithms for MABs and MDPs, that \\\\emph{adapt} to the smallest function class (among the nested $M$ classes) containing the true underlying model. Under a separability assumption on the nested hypothesis classes, we show that the cumulative regret of our adaptive algorithms match to that of an oracle which knows the correct function classes (i.e., $\\\\cF$ and $\\\\cM$) a priori. Furthermore, for both the settings, we show that the cost of model selection is an additive term in the regret having weak (logarithmic) dependence on the learning horizon $T$.\",\"PeriodicalId\":74091,\"journal\":{\"name\":\"Machine learning and knowledge discovery in databases : European Conference, ECML PKDD ... : proceedings. ECML PKDD (Conference)\",\"volume\":\"1 1\",\"pages\":\"148-164\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine learning and knowledge discovery in databases : European Conference, ECML PKDD ... : proceedings. ECML PKDD (Conference)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2207.02992\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning and knowledge discovery in databases : European Conference, ECML PKDD ... : proceedings. ECML PKDD (Conference)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2207.02992","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

我们考虑经典强化学习(RL)环境的模型选择-多武装强盗(MABs)和马尔可夫决策过程(mdp) -在一般函数近似下。在模型选择框架中，我们不知道用$\mathcal{F}$和$\mathcal{M}$表示的函数类，它们分别是真正的模型——mab的奖励生成函数和mdp的转换内核。相反，我们得到$M$嵌套函数(假设)类，这样真实的模型至少包含在一个这样的类中。在本文中，我们提出并分析了mab和mdp的有效模型选择算法，该算法\emph{适应}到包含真实底层模型的最小函数类(在嵌套的$M$类中)。在嵌套假设类的可分性假设下，我们证明了自适应算法的累积遗憾与先验地知道正确函数类(即$\cF$和$\cM$)的oracle的累积遗憾相匹配。此外，对于这两种设置，我们表明模型选择的成本是遗憾中的一个附加项，对学习视界具有弱(对数)依赖性$T$。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Model Selection in Reinforcement Learning with General Function Approximations

We consider model selection for classic Reinforcement Learning (RL) environments -- Multi Armed Bandits (MABs) and Markov Decision Processes (MDPs) -- under general function approximations. In the model selection framework, we do not know the function classes, denoted by $\mathcal{F}$ and $\mathcal{M}$, where the true models -- reward generating function for MABs and and transition kernel for MDPs -- lie, respectively. Instead, we are given $M$ nested function (hypothesis) classes such that true models are contained in at-least one such class. In this paper, we propose and analyze efficient model selection algorithms for MABs and MDPs, that \emph{adapt} to the smallest function class (among the nested $M$ classes) containing the true underlying model. Under a separability assumption on the nested hypothesis classes, we show that the cumulative regret of our adaptive algorithms match to that of an oracle which knows the correct function classes (i.e., $\cF$ and $\cM$) a priori. Furthermore, for both the settings, we show that the cost of model selection is an additive term in the regret having weak (logarithmic) dependence on the learning horizon $T$.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Machine learning and knowledge discovery in databases : European Conference, ECML PKDD ... : proceedings. ECML PKDD (Conference)

自引率

0.00%

发文量