一种ϵ-Greedy马尔可夫决策过程的多臂强盗方法

IF 0.9 Q4 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Stats Pub Date : 2023-01-01 DOI:10.3390/stats6010006

Isa Muqattash, Jiaqiao Hu

{"title":"一种ϵ-Greedy马尔可夫决策过程的多臂强盗方法","authors":"Isa Muqattash, Jiaqiao Hu","doi":"10.3390/stats6010006","DOIUrl":null,"url":null,"abstract":"We present REGA, a new adaptive-sampling-based algorithm for the control of finite-horizon Markov decision processes (MDPs) with very large state spaces and small action spaces. We apply a variant of the ϵ-greedy multiarmed bandit algorithm to each stage of the MDP in a recursive manner, thus computing an estimation of the “reward-to-go” value at each stage of the MDP. We provide a finite-time analysis of REGA. In particular, we provide a bound on the probability that the approximation error exceeds a given threshold, where the bound is given in terms of the number of samples collected at each stage of the MDP. We empirically compare REGA against another sampling-based algorithm called RASA by running simulations against the SysAdmin benchmark problem with 210 states. The results show that REGA and RASA achieved similar performance. Moreover, REGA and RASA empirically outperformed an implementation of the algorithm that uses the “original” ϵ-greedy algorithm that commonly appears in the literature.","PeriodicalId":93142,"journal":{"name":"Stats","volume":" ","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An ϵ-Greedy Multiarmed Bandit Approach to Markov Decision Processes\",\"authors\":\"Isa Muqattash, Jiaqiao Hu\",\"doi\":\"10.3390/stats6010006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present REGA, a new adaptive-sampling-based algorithm for the control of finite-horizon Markov decision processes (MDPs) with very large state spaces and small action spaces. We apply a variant of the ϵ-greedy multiarmed bandit algorithm to each stage of the MDP in a recursive manner, thus computing an estimation of the “reward-to-go” value at each stage of the MDP. We provide a finite-time analysis of REGA. In particular, we provide a bound on the probability that the approximation error exceeds a given threshold, where the bound is given in terms of the number of samples collected at each stage of the MDP. We empirically compare REGA against another sampling-based algorithm called RASA by running simulations against the SysAdmin benchmark problem with 210 states. The results show that REGA and RASA achieved similar performance. Moreover, REGA and RASA empirically outperformed an implementation of the algorithm that uses the “original” ϵ-greedy algorithm that commonly appears in the literature.\",\"PeriodicalId\":93142,\"journal\":{\"name\":\"Stats\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Stats\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/stats6010006\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Stats","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/stats6010006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

我们提出了一种新的基于自适应采样的REGA算法，用于控制具有非常大状态空间和小动作空间的有限时域马尔可夫决策过程。我们以递归的方式将ε-贪婪多臂土匪算法的变体应用于MDP的每个阶段，从而计算MDP每个阶段的“奖励”值的估计。我们提供了REGA的有限时间分析。特别地，我们提供了近似误差超过给定阈值的概率的界，其中该界是根据在MDP的每个阶段收集的样本数量给出的。我们通过对210个状态的SysAdmin基准问题进行模拟，将REGA与另一种基于采样的算法RASA进行了实证比较。结果表明，REGA和RASA具有相似的性能。此外，REGA和RASA在经验上优于使用文献中常见的“原始”ε-贪婪算法的算法实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An ϵ-Greedy Multiarmed Bandit Approach to Markov Decision Processes

We present REGA, a new adaptive-sampling-based algorithm for the control of finite-horizon Markov decision processes (MDPs) with very large state spaces and small action spaces. We apply a variant of the ϵ-greedy multiarmed bandit algorithm to each stage of the MDP in a recursive manner, thus computing an estimation of the “reward-to-go” value at each stage of the MDP. We provide a finite-time analysis of REGA. In particular, we provide a bound on the probability that the approximation error exceeds a given threshold, where the bound is given in terms of the number of samples collected at each stage of the MDP. We empirically compare REGA against another sampling-based algorithm called RASA by running simulations against the SysAdmin benchmark problem with 210 states. The results show that REGA and RASA achieved similar performance. Moreover, REGA and RASA empirically outperformed an implementation of the algorithm that uses the “original” ϵ-greedy algorithm that commonly appears in the literature.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Stats

CiteScore

0.60

自引率

0.00%

发文量

审稿时长

7 weeks