{"title":"观测不完善的不安定强盗的低复杂度算法","authors":"Keqin Liu, Richard Weber, Chengzhong Zhang","doi":"10.1007/s00186-024-00868-x","DOIUrl":null,"url":null,"abstract":"<p>We consider a class of restless bandit problems that finds a broad application area in reinforcement learning and stochastic optimization. We consider <i>N</i> independent discrete-time Markov processes, each of which had two possible states: 1 and 0 (‘good’ and ‘bad’). Only if a process is both in state 1 and observed to be so does reward accrue. The aim is to maximize the expected discounted sum of returns over the infinite horizon subject to a constraint that only <i>M</i> <span>\\((<N)\\)</span> processes may be observed at each step. Observation is error-prone: there are known probabilities that state 1 (0) will be observed as 0 (1). From this one knows, at any time <i>t</i>, a probability that process <i>i</i> is in state 1. The resulting system may be modeled as a restless multi-armed bandit problem with an information state space of uncountable cardinality. Restless bandit problems with even finite state spaces are PSPACE-HARD in general. We propose a novel approach for simplifying the dynamic programming equations of this class of restless bandits and develop a low-complexity algorithm that achieves a strong performance and is readily extensible to the general restless bandit model with observation errors. Under certain conditions, we establish the existence (indexability) of Whittle index and its equivalence to our algorithm. When those conditions do not hold, we show by numerical experiments the near-optimal performance of our algorithm in the general parametric space. Furthermore, we theoretically prove the optimality of our algorithm for homogeneous systems.</p>","PeriodicalId":49862,"journal":{"name":"Mathematical Methods of Operations Research","volume":null,"pages":null},"PeriodicalIF":0.9000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Low-complexity algorithm for restless bandits with imperfect observations\",\"authors\":\"Keqin Liu, Richard Weber, Chengzhong Zhang\",\"doi\":\"10.1007/s00186-024-00868-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>We consider a class of restless bandit problems that finds a broad application area in reinforcement learning and stochastic optimization. We consider <i>N</i> independent discrete-time Markov processes, each of which had two possible states: 1 and 0 (‘good’ and ‘bad’). Only if a process is both in state 1 and observed to be so does reward accrue. The aim is to maximize the expected discounted sum of returns over the infinite horizon subject to a constraint that only <i>M</i> <span>\\\\((<N)\\\\)</span> processes may be observed at each step. Observation is error-prone: there are known probabilities that state 1 (0) will be observed as 0 (1). From this one knows, at any time <i>t</i>, a probability that process <i>i</i> is in state 1. The resulting system may be modeled as a restless multi-armed bandit problem with an information state space of uncountable cardinality. Restless bandit problems with even finite state spaces are PSPACE-HARD in general. We propose a novel approach for simplifying the dynamic programming equations of this class of restless bandits and develop a low-complexity algorithm that achieves a strong performance and is readily extensible to the general restless bandit model with observation errors. Under certain conditions, we establish the existence (indexability) of Whittle index and its equivalence to our algorithm. When those conditions do not hold, we show by numerical experiments the near-optimal performance of our algorithm in the general parametric space. Furthermore, we theoretically prove the optimality of our algorithm for homogeneous systems.</p>\",\"PeriodicalId\":49862,\"journal\":{\"name\":\"Mathematical Methods of Operations Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2024-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Mathematical Methods of Operations Research\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1007/s00186-024-00868-x\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MATHEMATICS, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematical Methods of Operations Research","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1007/s00186-024-00868-x","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}
引用次数: 0
摘要
我们考虑的是一类不安定的强盗问题,它在强化学习和随机优化中有着广泛的应用。我们考虑 N 个独立的离散时间马尔可夫过程,每个过程都有两种可能的状态:1 和 0("好 "和 "坏"):1和0("好 "和 "坏")。只有当进程处于状态 1 并被观察到时,才会产生奖励。我们的目标是在每一步只能观察到 M ((<N)\)个过程的约束下,最大化无限期内的预期贴现收益总和。观察是容易出错的:状态 1(0)被观察为 0(1)的概率是已知的。由此可以知道,在任何时间 t,进程 i 处于状态 1 的概率。由此产生的系统可以建模为一个不安定的多臂强盗问题,其信息状态空间具有不可计数的卡方性。一般来说,即使是有限状态空间的无休止强盗问题也是 PSPACE-HARD(空间困难)的。我们提出了一种简化该类无休止强盗动态程序方程的新方法,并开发了一种低复杂度算法,该算法性能优异,可随时扩展到具有观测误差的一般无休止强盗模型。在某些条件下,我们建立了惠特尔指数的存在性(可索引性)及其与我们算法的等价性。当这些条件不成立时,我们通过数值实验证明了我们的算法在一般参数空间中接近最优的性能。此外,我们还从理论上证明了我们的算法对于同质系统的最优性。
Low-complexity algorithm for restless bandits with imperfect observations
We consider a class of restless bandit problems that finds a broad application area in reinforcement learning and stochastic optimization. We consider N independent discrete-time Markov processes, each of which had two possible states: 1 and 0 (‘good’ and ‘bad’). Only if a process is both in state 1 and observed to be so does reward accrue. The aim is to maximize the expected discounted sum of returns over the infinite horizon subject to a constraint that only M\((<N)\) processes may be observed at each step. Observation is error-prone: there are known probabilities that state 1 (0) will be observed as 0 (1). From this one knows, at any time t, a probability that process i is in state 1. The resulting system may be modeled as a restless multi-armed bandit problem with an information state space of uncountable cardinality. Restless bandit problems with even finite state spaces are PSPACE-HARD in general. We propose a novel approach for simplifying the dynamic programming equations of this class of restless bandits and develop a low-complexity algorithm that achieves a strong performance and is readily extensible to the general restless bandit model with observation errors. Under certain conditions, we establish the existence (indexability) of Whittle index and its equivalence to our algorithm. When those conditions do not hold, we show by numerical experiments the near-optimal performance of our algorithm in the general parametric space. Furthermore, we theoretically prove the optimality of our algorithm for homogeneous systems.
期刊介绍:
This peer reviewed journal publishes original and high-quality articles on important mathematical and computational aspects of operations research, in particular in the areas of continuous and discrete mathematical optimization, stochastics, and game theory. Theoretically oriented papers are supposed to include explicit motivations of assumptions and results, while application oriented papers need to contain substantial mathematical contributions. Suggestions for algorithms should be accompanied with numerical evidence for their superiority over state-of-the-art methods. Articles must be of interest for a large audience in operations research, written in clear and correct English, and typeset in LaTeX. A special section contains invited tutorial papers on advanced mathematical or computational aspects of operations research, aiming at making such methodologies accessible for a wider audience.
All papers are refereed. The emphasis is on originality, quality, and importance.