Non-maximizing policies that fulfill multi-criterion aspirations in expectation

arXiv - ECON - Theoretical Economics Pub Date : 2024-08-08 DOI:arxiv-2408.04385

Simon Dima, Simon Fischer, Jobst Heitzig, Joss Oliver

{"title":"Non-maximizing policies that fulfill multi-criterion aspirations in expectation","authors":"Simon Dima, Simon Fischer, Jobst Heitzig, Joss Oliver","doi":"arxiv-2408.04385","DOIUrl":null,"url":null,"abstract":"In dynamic programming and reinforcement learning, the policy for the\nsequential decision making of an agent in a stochastic environment is usually\ndetermined by expressing the goal as a scalar reward function and seeking a\npolicy that maximizes the expected total reward. However, many goals that\nhumans care about naturally concern multiple aspects of the world, and it may\nnot be obvious how to condense those into a single reward function.\nFurthermore, maximization suffers from specification gaming, where the obtained\npolicy achieves a high expected total reward in an unintended way, often taking\nextreme or nonsensical actions. Here we consider finite acyclic Markov Decision Processes with multiple\ndistinct evaluation metrics, which do not necessarily represent quantities that\nthe user wants to be maximized. We assume the task of the agent is to ensure\nthat the vector of expected totals of the evaluation metrics falls into some\ngiven convex set, called the aspiration set. Our algorithm guarantees that this\ntask is fulfilled by using simplices to approximate feasibility sets and\npropagate aspirations forward while ensuring they remain feasible. It has\ncomplexity linear in the number of possible state-action-successor triples and\npolynomial in the number of evaluation metrics. Moreover, the explicitly\nnon-maximizing nature of the chosen policy and goals yields additional degrees\nof freedom, which can be used to apply heuristic safety criteria to the choice\nof actions. We discuss several such safety criteria that aim to steer the agent\ntowards more conservative behavior.","PeriodicalId":501188,"journal":{"name":"arXiv - ECON - Theoretical Economics","volume":"370 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - ECON - Theoretical Economics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04385","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In dynamic programming and reinforcement learning, the policy for the sequential decision making of an agent in a stochastic environment is usually determined by expressing the goal as a scalar reward function and seeking a policy that maximizes the expected total reward. However, many goals that humans care about naturally concern multiple aspects of the world, and it may not be obvious how to condense those into a single reward function. Furthermore, maximization suffers from specification gaming, where the obtained policy achieves a high expected total reward in an unintended way, often taking extreme or nonsensical actions. Here we consider finite acyclic Markov Decision Processes with multiple distinct evaluation metrics, which do not necessarily represent quantities that the user wants to be maximized. We assume the task of the agent is to ensure that the vector of expected totals of the evaluation metrics falls into some given convex set, called the aspiration set. Our algorithm guarantees that this task is fulfilled by using simplices to approximate feasibility sets and propagate aspirations forward while ensuring they remain feasible. It has complexity linear in the number of possible state-action-successor triples and polynomial in the number of evaluation metrics. Moreover, the explicitly non-maximizing nature of the chosen policy and goals yields additional degrees of freedom, which can be used to apply heuristic safety criteria to the choice of actions. We discuss several such safety criteria that aim to steer the agent towards more conservative behavior.

查看原文本刊更多论文

非最大化政策在预期中满足多重标准的愿望

在动态编程和强化学习中，通常通过将目标表示为标量奖励函数，并寻求使预期总奖励最大化的策略，来确定代理在随机环境中进行这些连续决策的策略。然而，人类关心的许多目标自然涉及世界的多个方面，而如何将这些目标浓缩为一个单一的奖励函数可能并不明显。此外，最大化会受到规范博弈的影响，在这种情况下，获得的政策会以一种非预期的方式实现较高的预期总奖励，通常会采取极端或无意义的行动。在这里，我们考虑的是具有多个不同评价指标的有限非循环马尔可夫决策过程，这些指标并不一定代表用户希望最大化的量。我们假设代理的任务是确保评价指标的预期总和向量落入某个给定的凸集（称为愿望集）。我们的算法通过使用简约来近似可行性集，并在确保愿望保持可行的情况下向前推进愿望，从而保证了任务的完成。该算法的复杂度与可能的状态-行动-继承者三元组的数量呈线性关系，与评估指标的数量呈多项式关系。此外，所选策略和目标的显式非最大化性质还带来了额外的自由度，可用于在行动选择中应用启发式安全标准。我们将讨论几种这样的安全标准，它们旨在引导代理采取更保守的行为。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - ECON - Theoretical Economics

自引率

0.00%

发文量