Approximation Algorithms for Partial-Information Based Stochastic Control with Markovian Rewards

48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07) Pub Date : 2007-10-21 DOI:10.1109/FOCS.2007.12

S. Guha, Kamesh Munagala

{"title":"Approximation Algorithms for Partial-Information Based Stochastic Control with Markovian Rewards","authors":"S. Guha, Kamesh Munagala","doi":"10.1109/FOCS.2007.12","DOIUrl":null,"url":null,"abstract":"We consider a variant of the classic multi-armed bandit problem (MAB), which we call feedback MAB, where the reward obtained by playing each of n independent arms varies according to an underlying on/off Markov process with known parameters. The evolution of the Markov chain happens irrespective of whether the arm is played, and furthermore, the exact state of the Markov chain is only revealed to the player when the arm is played and the reward observed. At most one arm (or in general, M arms) can be played any time step. The goal is to design a policy for playing the arms in order to maximize the infinite horizon time average expected reward. This problem is an instance of a partially observable Markov decision process (POMDP), and a special case of the notoriously intractable \"restless bandit\" problem. Unlike the stochastic MAB problem, the feedback MAB problem does not admit to greedy index-based optimal policies. Vie state of the system at any time step encodes the beliefs about the states of different arms, and the policy decisions change these beliefs - this aspect complicates the design and analysis of simple algorithms. We design a constant factor approximation to the feedback MAB problem by solving and rounding a natural LP relaxation to this problem. As far as we are aware, this is the first approximation algorithm for a POMDP problem.","PeriodicalId":197431,"journal":{"name":"48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2007-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"59","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FOCS.2007.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 59

Abstract

We consider a variant of the classic multi-armed bandit problem (MAB), which we call feedback MAB, where the reward obtained by playing each of n independent arms varies according to an underlying on/off Markov process with known parameters. The evolution of the Markov chain happens irrespective of whether the arm is played, and furthermore, the exact state of the Markov chain is only revealed to the player when the arm is played and the reward observed. At most one arm (or in general, M arms) can be played any time step. The goal is to design a policy for playing the arms in order to maximize the infinite horizon time average expected reward. This problem is an instance of a partially observable Markov decision process (POMDP), and a special case of the notoriously intractable "restless bandit" problem. Unlike the stochastic MAB problem, the feedback MAB problem does not admit to greedy index-based optimal policies. Vie state of the system at any time step encodes the beliefs about the states of different arms, and the policy decisions change these beliefs - this aspect complicates the design and analysis of simple algorithms. We design a constant factor approximation to the feedback MAB problem by solving and rounding a natural LP relaxation to this problem. As far as we are aware, this is the first approximation algorithm for a POMDP problem.

查看原文本刊更多论文

基于部分信息的马尔可夫奖励随机控制的逼近算法

我们考虑了经典多臂盗匪问题(MAB)的一个变体，我们称之为反馈MAB，其中通过玩n个独立手臂中的每一个获得的奖励根据具有已知参数的潜在开/关马尔可夫过程而变化。马尔可夫链的进化与手臂是否被使用无关，而且，马尔可夫链的确切状态只有在手臂被使用并观察到奖励时才会向玩家透露。在任何时间步长，最多可以播放一只手臂(或者通常是M只手臂)。目标是设计一种策略，使无限视界时间平均期望奖励最大化。这个问题是部分可观察马尔可夫决策过程(POMDP)的一个实例，也是众所周知的棘手的“不宁强盗”问题的一个特例。与随机MAB问题不同，反馈MAB问题不承认基于贪婪索引的最优策略。系统在任意时刻的状态编码了关于不同武器状态的信念，而策略决策改变了这些信念——这方面使简单算法的设计和分析变得复杂。我们设计了一个常因子近似的反馈MAB问题通过求解和四舍五入的自然LP松弛问题。据我们所知，这是POMDP问题的第一个近似算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07)

自引率

0.00%

发文量