使用最大似然估计的时变mdp学习和规划。

Journal of machine learning research : JMLR Pub Date : 2021-01-01 Epub Date: 2021-02-01

Melkior Ornik, Ufuk Topcu

{"title":"使用最大似然估计的时变mdp学习和规划。","authors":"Melkior Ornik, Ufuk Topcu","doi":"","DOIUrl":null,"url":null,"abstract":"This paper proposes a formal approach to online learning and planning for agents operating in a priori unknown, time-varying environments. The proposed method computes the maximally likely model of the environment, given the observations about the environment made by an agent earlier in the system run and assuming knowledge of a bound on the maximal rate of change of system dynamics. Such an approach generalizes the estimation method commonly used in learning algorithms for unknown Markov decision processes with time-invariant transition probabilities, but is also able to quickly and correctly identify the system dynamics following a change. Based on the proposed method, we generalize the exploration bonuses used in learning for time-invariant Markov decision processes by introducing a notion of uncertainty in a learned time-varying model, and develop a control policy for time-varying Markov decision processes based on the exploitation and exploration trade-off. We demonstrate the proposed methods on four numerical examples: a patrolling task with a change in system dynamics, a two-state MDP with periodically changing outcomes of actions, a wind flow estimation task, and a multi-armed bandit problem with periodically changing probabilities of different rewards.","PeriodicalId":314696,"journal":{"name":"Journal of machine learning research : JMLR","volume":" ","pages":"1-40"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8739185/pdf/","citationCount":"0","resultStr":"{\"title\":\"Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation.\",\"authors\":\"Melkior Ornik, Ufuk Topcu\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a formal approach to online learning and planning for agents operating in a priori unknown, time-varying environments. The proposed method computes the maximally likely model of the environment, given the observations about the environment made by an agent earlier in the system run and assuming knowledge of a bound on the maximal rate of change of system dynamics. Such an approach generalizes the estimation method commonly used in learning algorithms for unknown Markov decision processes with time-invariant transition probabilities, but is also able to quickly and correctly identify the system dynamics following a change. Based on the proposed method, we generalize the exploration bonuses used in learning for time-invariant Markov decision processes by introducing a notion of uncertainty in a learned time-varying model, and develop a control policy for time-varying Markov decision processes based on the exploitation and exploration trade-off. We demonstrate the proposed methods on four numerical examples: a patrolling task with a change in system dynamics, a two-state MDP with periodically changing outcomes of actions, a wind flow estimation task, and a multi-armed bandit problem with periodically changing probabilities of different rewards.\",\"PeriodicalId\":314696,\"journal\":{\"name\":\"Journal of machine learning research : JMLR\",\"volume\":\" \",\"pages\":\"1-40\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8739185/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of machine learning research : JMLR\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2021/2/1 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of machine learning research : JMLR","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/2/1 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文提出了一种正式的方法，用于在先验未知的时变环境中运行的智能体的在线学习和规划。提出的方法计算环境的最大可能模型，给定一个代理在系统运行早期对环境的观察，并假设系统动力学的最大变化率有一个界的知识。该方法不仅推广了具有定常转移概率的未知马尔可夫决策过程学习算法中常用的估计方法，而且能够快速正确地识别变化后的系统动力学。在此基础上，通过在学习的时变模型中引入不确定性的概念，推广了时变马尔可夫决策过程学习中使用的探索奖励，并基于开发和探索权衡制定了时变马尔可夫决策过程的控制策略。我们通过四个数值例子证明了所提出的方法：一个具有系统动力学变化的巡逻任务，一个具有周期性变化的行动结果的两状态MDP，一个风流量估计任务，以及一个具有周期性变化的不同奖励概率的多武装强盗问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation.

本刊更多论文

Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation.

This paper proposes a formal approach to online learning and planning for agents operating in a priori unknown, time-varying environments. The proposed method computes the maximally likely model of the environment, given the observations about the environment made by an agent earlier in the system run and assuming knowledge of a bound on the maximal rate of change of system dynamics. Such an approach generalizes the estimation method commonly used in learning algorithms for unknown Markov decision processes with time-invariant transition probabilities, but is also able to quickly and correctly identify the system dynamics following a change. Based on the proposed method, we generalize the exploration bonuses used in learning for time-invariant Markov decision processes by introducing a notion of uncertainty in a learned time-varying model, and develop a control policy for time-varying Markov decision processes based on the exploitation and exploration trade-off. We demonstrate the proposed methods on four numerical examples: a patrolling task with a change in system dynamics, a two-state MDP with periodically changing outcomes of actions, a wind flow estimation task, and a multi-armed bandit problem with periodically changing probabilities of different rewards.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of machine learning research : JMLR

自引率

0.00%

发文量