Interval Dominance based Structural Results for Markov Decision Process

IEEE Robotics Autom. Mag. Pub Date : 2022-03-20 DOI:10.48550/arXiv.2203.10618

V. Krishnamurthy

引用次数: 0

Abstract

Structural results impose sufficient conditions on the model parameters of a Markov decision process (MDP) so that the optimal policy is an increasing function of the underlying state. The classical assumptions for MDP structural results require supermodularity of the rewards and transition probabilities. However, supermodularity does not hold in many applications. This paper uses a sufficient condition for interval dominance (called I) proposed in the microeconomics literature, to obtain structural results for MDPs under more general conditions. We present several MDP examples where supermodularity does not hold, yet I holds, and so the optimal policy is monotone; these include sigmoidal rewards (arising in prospect theory for human decision making), bi-diagonal and perturbed bi-diagonal transition matrices (in optimal allocation problems). We also consider MDPs with TP3 transition matrices and concave value functions. Finally, reinforcement learning algorithms that exploit the differential sparse structure of the optimal monotone policy are discussed.

查看原文本刊更多论文

基于区间优势的马尔可夫决策过程结构结果

结构结果对马尔可夫决策过程(MDP)的模型参数施加了充分条件，使得最优策略是底层状态的递增函数。MDP结构结果的经典假设要求奖励和转移概率的超模块化。然而，超模块化在许多应用中并不适用。本文利用微观经济学文献中提出的区间优势的充分条件(称为I)，得到了更一般条件下gdp的结构结果。我们给出了几个MDP例子，其中超模块化不成立，但I成立，因此最优策略是单调的;这些包括s型奖励(在人类决策的前景理论中出现)，双对角和摄动双对角转移矩阵(在最优分配问题中)。我们还考虑了具有TP3转移矩阵和凹值函数的mdp。最后，讨论了利用最优单调策略的差分稀疏结构的强化学习算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Robotics Autom. Mag.

自引率

0.00%

发文量