Reinforcement Learning for Omega-Regular Specifications on Continuous-Time MDP

International Conference on Automated Planning and Scheduling Pub Date : 2023-03-16 DOI:10.48550/arXiv.2303.09528

A. Falah, Shibashis Guha, Ashutosh Trivedi

{"title":"Reinforcement Learning for Omega-Regular Specifications on Continuous-Time MDP","authors":"A. Falah, Shibashis Guha, Ashutosh Trivedi","doi":"10.48550/arXiv.2303.09528","DOIUrl":null,"url":null,"abstract":"Continuous-time Markov decision processes (CTMDPs) are canonical models to express sequential decision-making under dense-time and stochastic environments. When the stochastic evolution of the environment is only available via sampling, model-free reinforcement learning (RL) is the algorithm-of-choice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such translations manually is both tedious and error-prone, a number of techniques have been proposed to translate high-level objectives (expressed in logic or automata formalism) to scalar rewards for discrete-time Markov decision processes. Unfortunately, no automatic translation exists for CTMDPs.\n \n We consider CTMDP environments against the learning objectives expressed as omega-regular languages. Omega-regular languages generalize regular languages to infinite-horizon specifications and can express properties given in popular linear-time logic LTL. To accommodate the dense-time nature of CTMDPs, we consider two different semantics of omega-regular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spending positive time in the good states, and 2) expectation semantics where the goal of the learner is to optimize the long-run expected average time spent in the ''good states'' of the automaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by off-the-shelf RL algorithms for CTMDPs. We demonstrate the effectiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omega-regular objectives.","PeriodicalId":239898,"journal":{"name":"International Conference on Automated Planning and Scheduling","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Automated Planning and Scheduling","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2303.09528","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Continuous-time Markov decision processes (CTMDPs) are canonical models to express sequential decision-making under dense-time and stochastic environments. When the stochastic evolution of the environment is only available via sampling, model-free reinforcement learning (RL) is the algorithm-of-choice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such translations manually is both tedious and error-prone, a number of techniques have been proposed to translate high-level objectives (expressed in logic or automata formalism) to scalar rewards for discrete-time Markov decision processes. Unfortunately, no automatic translation exists for CTMDPs. We consider CTMDP environments against the learning objectives expressed as omega-regular languages. Omega-regular languages generalize regular languages to infinite-horizon specifications and can express properties given in popular linear-time logic LTL. To accommodate the dense-time nature of CTMDPs, we consider two different semantics of omega-regular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spending positive time in the good states, and 2) expectation semantics where the goal of the learner is to optimize the long-run expected average time spent in the ''good states'' of the automaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by off-the-shelf RL algorithms for CTMDPs. We demonstrate the effectiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omega-regular objectives.

查看原文本刊更多论文

连续时间MDP上omega -规则规范的强化学习

连续时间马尔可夫决策过程(ctmdp)是表达密集时间和随机环境下序列决策的典型模型。当环境的随机演化只能通过采样获得时，无模型强化学习(RL)是计算最优决策序列的首选算法。另一方面，强化学习要求将学习目标编码为标量奖励信号。由于手工进行这种转换既乏味又容易出错，因此已经提出了许多技术来将高级目标(以逻辑或自动机形式表示)转换为离散时间马尔可夫决策过程的标量奖励。不幸的是，ctmdp不存在自动翻译。我们将CTMDP环境与表达为ω -正则语言的学习目标相违背。omega -正则语言将正则语言推广到无限视界规范，可以表达流行的线性时间逻辑LTL中给出的属性。为了适应ctmdp的密集时间性质，我们考虑了两种不同的-规则目标语义:1)满意度语义，学习者的目标是最大化在良好状态下花费积极时间的概率;2)期望语义，学习者的目标是优化在自动机的“良好状态”下花费的长期预期平均时间。我们提出了一种能够正确转换标量奖励信号的方法，该方法可以很容易地用于ctmdp的现成RL算法。我们通过在一些流行的具有ω -规则目标的CTMDP基准上对其进行评估来证明所提出算法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Automated Planning and Scheduling

自引率

0.00%

发文量