多目标ω-常规强化学习

IF 1.4 4区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, Dominik Wojtczak
{"title":"多目标ω-常规强化学习","authors":"Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, Dominik Wojtczak","doi":"https://dl.acm.org/doi/10.1145/3605950","DOIUrl":null,"url":null,"abstract":"<p>The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1) <i>weighted preference</i>, where the decision maker provides scalar weights for various objectives, and (2) <i>lexicographic preference</i>, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is both <i>faithful</i> (maximising reward means maximising probability of achieving the objectives under the corresponding preference) and <i>effective</i> (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool, <span>Mungojerrie</span>, and we present an experimental evaluation of our technique on benchmark learning problems.</p>","PeriodicalId":50432,"journal":{"name":"Formal Aspects of Computing","volume":"23 1","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-objective ω-Regular Reinforcement Learning\",\"authors\":\"Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, Dominik Wojtczak\",\"doi\":\"https://dl.acm.org/doi/10.1145/3605950\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1) <i>weighted preference</i>, where the decision maker provides scalar weights for various objectives, and (2) <i>lexicographic preference</i>, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is both <i>faithful</i> (maximising reward means maximising probability of achieving the objectives under the corresponding preference) and <i>effective</i> (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool, <span>Mungojerrie</span>, and we present an experimental evaluation of our technique on benchmark learning problems.</p>\",\"PeriodicalId\":50432,\"journal\":{\"name\":\"Formal Aspects of Computing\",\"volume\":\"23 1\",\"pages\":\"\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2023-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Formal Aspects of Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/https://dl.acm.org/doi/10.1145/3605950\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Formal Aspects of Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3605950","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

摘要

强化学习(RL)在安全关键型系统设计中的作用不断扩大,使得ω-自动机成为一种表达学习需求的方式——通常是非马尔科夫的——比标量奖励信号更容易表达和解释。然而,现实世界的顺序决策通常涉及多个潜在冲突的目标。在多个目标上表达相对偏好的两种主要方法是:(1)加权偏好,其中决策者为各种目标提供标量权重;(2)字典偏好,其中决策者提供目标的顺序,以便任何数量的高阶目标的满足都优于任何数量的低阶目标。在本文中,我们研究和开发了RL算法,用于在加权和字典偏好下针对多个ω-正则目标计算马尔可夫决策过程中的最优策略。我们提供了从多个正则目标到标量奖励信号的转换,该信号既忠实(奖励最大化意味着在相应偏好下实现目标的概率最大化)又有效(强化学习快速收敛到最优策略)。我们已经在正式的强化学习工具Mungojerrie中实现了翻译,并在基准学习问题上对我们的技术进行了实验评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Multi-objective ω-Regular Reinforcement Learning

The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1) weighted preference, where the decision maker provides scalar weights for various objectives, and (2) lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is both faithful (maximising reward means maximising probability of achieving the objectives under the corresponding preference) and effective (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool, Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Formal Aspects of Computing
Formal Aspects of Computing 工程技术-计算机:软件工程
CiteScore
3.30
自引率
0.00%
发文量
17
审稿时长
>12 weeks
期刊介绍: This journal aims to publish contributions at the junction of theory and practice. The objective is to disseminate applicable research. Thus new theoretical contributions are welcome where they are motivated by potential application; applications of existing formalisms are of interest if they show something novel about the approach or application. In particular, the scope of Formal Aspects of Computing includes: well-founded notations for the description of systems; verifiable design methods; elucidation of fundamental computational concepts; approaches to fault-tolerant design; theorem-proving support; state-exploration tools; formal underpinning of widely used notations and methods; formal approaches to requirements analysis.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信