多目标ω-常规强化学习

IF 1.4 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Formal Aspects of Computing Pub Date : 2023-07-18 DOI:https://dl.acm.org/doi/10.1145/3605950

Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, Dominik Wojtczak

{"title":"多目标ω-常规强化学习","authors":"Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, Dominik Wojtczak","doi":"https://dl.acm.org/doi/10.1145/3605950","DOIUrl":null,"url":null,"abstract":"The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1) weighted preference, where the decision maker provides scalar weights for various objectives, and (2) lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is both faithful (maximising reward means maximising probability of achieving the objectives under the corresponding preference) and effective (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool, Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems.","PeriodicalId":50432,"journal":{"name":"Formal Aspects of Computing","volume":"23 1","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-objective ω-Regular Reinforcement Learning\",\"authors\":\"Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, Dominik Wojtczak\",\"doi\":\"https://dl.acm.org/doi/10.1145/3605950\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1) weighted preference, where the decision maker provides scalar weights for various objectives, and (2) lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is both faithful (maximising reward means maximising probability of achieving the objectives under the corresponding preference) and effective (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool, Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems.\",\"PeriodicalId\":50432,\"journal\":{\"name\":\"Formal Aspects of Computing\",\"volume\":\"23 1\",\"pages\":\"\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2023-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Formal Aspects of Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/https://dl.acm.org/doi/10.1145/3605950\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Formal Aspects of Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3605950","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

强化学习(RL)在安全关键型系统设计中的作用不断扩大，使得ω-自动机成为一种表达学习需求的方式——通常是非马尔科夫的——比标量奖励信号更容易表达和解释。然而，现实世界的顺序决策通常涉及多个潜在冲突的目标。在多个目标上表达相对偏好的两种主要方法是:(1)加权偏好，其中决策者为各种目标提供标量权重;(2)字典偏好，其中决策者提供目标的顺序，以便任何数量的高阶目标的满足都优于任何数量的低阶目标。在本文中，我们研究和开发了RL算法，用于在加权和字典偏好下针对多个ω-正则目标计算马尔可夫决策过程中的最优策略。我们提供了从多个正则目标到标量奖励信号的转换，该信号既忠实(奖励最大化意味着在相应偏好下实现目标的概率最大化)又有效(强化学习快速收敛到最优策略)。我们已经在正式的强化学习工具Mungojerrie中实现了翻译，并在基准学习问题上对我们的技术进行了实验评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-objective ω-Regular Reinforcement Learning

The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1) weighted preference, where the decision maker provides scalar weights for various objectives, and (2) lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is both faithful (maximising reward means maximising probability of achieving the objectives under the corresponding preference) and effective (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool, Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Formal Aspects of Computing 工程技术-计算机：软件工程

CiteScore

3.30

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： This journal aims to publish contributions at the junction of theory and practice. The objective is to disseminate applicable research. Thus new theoretical contributions are welcome where they are motivated by potential application; applications of existing formalisms are of interest if they show something novel about the approach or application. In particular, the scope of Formal Aspects of Computing includes: well-founded notations for the description of systems; verifiable design methods; elucidation of fundamental computational concepts; approaches to fault-tolerant design; theorem-proving support; state-exploration tools; formal underpinning of widely used notations and methods; formal approaches to requirements analysis.