POLICY IMPROVEMENT IN MARKOV DECISION PROCESSES AND MARKOV POTENTIAL THEORY

Bulletin of Mathematical Statistics Pub Date : 1978-03-01 DOI:10.5109/13123

M. Yasuda

{"title":"POLICY IMPROVEMENT IN MARKOV DECISION PROCESSES AND MARKOV POTENTIAL THEORY","authors":"M. Yasuda","doi":"10.5109/13123","DOIUrl":null,"url":null,"abstract":"A connection between Markov Decision Process (MDP) and Markov potential theory has two sides. One is the potential theoritic development of MDP and the other is the alternative proof of the results in MDP owing to Markov potential theory. Shaufele [12] belongs to the later, but it seems interesting from the standpoint of the mathematical programming to establish the development of MDP by using certain potential notion. Several approaches have been tried. Watanabe [16] interpreted the monotonicity of Howard's iteration [8] in the relation to the a dual problem of Linear Programming. By the property of a potential kernel, Furukawa [6] and Aso and Kimura [1] proved a policy improvement. A formulation of MDP by potential theoretic notion has been tried by Hordijk [7]. In many cases it is restricted to a transient potential theory because its analysis is simpler. In this paper we shall define a new potential in order to serve a general policy improvement. Our aim is to expose theorems which are available to several cases of MDP. By the potential theoretic terms, we can interpret policy improvements of MDP as follows ; The increase of rewards in MDP consists of the potential with a charge of an increment of the policy improvement and a regular function. If it is transient, then the potential is reduced to the ordinary one and the regular function equals zero. Hence this consists with that of Watanabe [16]. The merit of the potential is that it connects the policy improvement with the increment of rewards. We shall consider the following cost criteria of MDP ; (1) discounted case, (2) average case, (3) nearly optimal case and (4) sensitive discounted case. Case (1) and (2) are representitive and discussed by many authors. Especially we list up Howard [7] and Blackwell [2], [3] for (1) and Howard [8] and Derman [4], [5] for (2). Case (3) is due to Blackwell [2]. Extending case (3), case (4) is studied by Miller and Veinott [11] and Veinott [14], [15].","PeriodicalId":287765,"journal":{"name":"Bulletin of Mathematical Statistics","volume":"164 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1978-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bulletin of Mathematical Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5109/13123","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

A connection between Markov Decision Process (MDP) and Markov potential theory has two sides. One is the potential theoritic development of MDP and the other is the alternative proof of the results in MDP owing to Markov potential theory. Shaufele [12] belongs to the later, but it seems interesting from the standpoint of the mathematical programming to establish the development of MDP by using certain potential notion. Several approaches have been tried. Watanabe [16] interpreted the monotonicity of Howard's iteration [8] in the relation to the a dual problem of Linear Programming. By the property of a potential kernel, Furukawa [6] and Aso and Kimura [1] proved a policy improvement. A formulation of MDP by potential theoretic notion has been tried by Hordijk [7]. In many cases it is restricted to a transient potential theory because its analysis is simpler. In this paper we shall define a new potential in order to serve a general policy improvement. Our aim is to expose theorems which are available to several cases of MDP. By the potential theoretic terms, we can interpret policy improvements of MDP as follows ; The increase of rewards in MDP consists of the potential with a charge of an increment of the policy improvement and a regular function. If it is transient, then the potential is reduced to the ordinary one and the regular function equals zero. Hence this consists with that of Watanabe [16]. The merit of the potential is that it connects the policy improvement with the increment of rewards. We shall consider the following cost criteria of MDP ; (1) discounted case, (2) average case, (3) nearly optimal case and (4) sensitive discounted case. Case (1) and (2) are representitive and discussed by many authors. Especially we list up Howard [7] and Blackwell [2], [3] for (1) and Howard [8] and Derman [4], [5] for (2). Case (3) is due to Blackwell [2]. Extending case (3), case (4) is studied by Miller and Veinott [11] and Veinott [14], [15].

查看原文本刊更多论文

马尔可夫决策过程与马尔可夫势理论的政策改进

马尔可夫决策过程与马尔可夫势理论之间的联系有两个方面。一是MDP的势理论发展，二是利用马尔可夫势理论对MDP的结果进行替代证明。Shaufele[12]属于后者，但从数学规划的角度来看，利用某种势概念来建立MDP的发展似乎很有趣。已经尝试了几种方法。Watanabe[16]在线性规划的对偶问题中解释了Howard迭代[8]的单调性。Furukawa[6]、Aso和Kimura[1]通过势核的性质证明了一种政策改进。Hordijk[7]尝试用势理论概念来表述MDP。在许多情况下，它被限制在暂态电位理论，因为它的分析更简单。在本文中，我们将定义一种新的潜力，以便为总体政策改进服务。我们的目标是公开一些适用于MDP的定理。通过潜在的理论术语，我们可以解释民主党的政策改进:MDP中奖励的增加由政策改进的潜在增量和规律函数组成。如果它是瞬态的，那么势就被简化为普通的，并且正则函数等于零。因此这与Watanabe[16]的观点一致。潜力的优点在于，它将政策改进与奖励的增加联系起来。我们将考虑以下的MDP成本标准:(1)折现情况，(2)平均情况，(3)接近最优情况，(4)敏感折现情况。案例(1)和(2)具有代表性，并被许多作者讨论过。我们特别列出Howard[7]和Blackwell[2]、[3]的(1)和Howard[8]和Derman[4]、[5]的(2)。案例(3)是由于Blackwell[2]。Miller和Veinott[11]、Veinott[14]、[15]研究了扩展案例(3)、案例(4)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bulletin of Mathematical Statistics

自引率

0.00%

发文量