Simon Dima, Simon Fischer, Jobst Heitzig, Joss Oliver
{"title":"Non-maximizing policies that fulfill multi-criterion aspirations in expectation","authors":"Simon Dima, Simon Fischer, Jobst Heitzig, Joss Oliver","doi":"arxiv-2408.04385","DOIUrl":null,"url":null,"abstract":"In dynamic programming and reinforcement learning, the policy for the\nsequential decision making of an agent in a stochastic environment is usually\ndetermined by expressing the goal as a scalar reward function and seeking a\npolicy that maximizes the expected total reward. However, many goals that\nhumans care about naturally concern multiple aspects of the world, and it may\nnot be obvious how to condense those into a single reward function.\nFurthermore, maximization suffers from specification gaming, where the obtained\npolicy achieves a high expected total reward in an unintended way, often taking\nextreme or nonsensical actions. Here we consider finite acyclic Markov Decision Processes with multiple\ndistinct evaluation metrics, which do not necessarily represent quantities that\nthe user wants to be maximized. We assume the task of the agent is to ensure\nthat the vector of expected totals of the evaluation metrics falls into some\ngiven convex set, called the aspiration set. Our algorithm guarantees that this\ntask is fulfilled by using simplices to approximate feasibility sets and\npropagate aspirations forward while ensuring they remain feasible. It has\ncomplexity linear in the number of possible state-action-successor triples and\npolynomial in the number of evaluation metrics. Moreover, the explicitly\nnon-maximizing nature of the chosen policy and goals yields additional degrees\nof freedom, which can be used to apply heuristic safety criteria to the choice\nof actions. We discuss several such safety criteria that aim to steer the agent\ntowards more conservative behavior.","PeriodicalId":501188,"journal":{"name":"arXiv - ECON - Theoretical Economics","volume":"370 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - ECON - Theoretical Economics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04385","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In dynamic programming and reinforcement learning, the policy for the
sequential decision making of an agent in a stochastic environment is usually
determined by expressing the goal as a scalar reward function and seeking a
policy that maximizes the expected total reward. However, many goals that
humans care about naturally concern multiple aspects of the world, and it may
not be obvious how to condense those into a single reward function.
Furthermore, maximization suffers from specification gaming, where the obtained
policy achieves a high expected total reward in an unintended way, often taking
extreme or nonsensical actions. Here we consider finite acyclic Markov Decision Processes with multiple
distinct evaluation metrics, which do not necessarily represent quantities that
the user wants to be maximized. We assume the task of the agent is to ensure
that the vector of expected totals of the evaluation metrics falls into some
given convex set, called the aspiration set. Our algorithm guarantees that this
task is fulfilled by using simplices to approximate feasibility sets and
propagate aspirations forward while ensuring they remain feasible. It has
complexity linear in the number of possible state-action-successor triples and
polynomial in the number of evaluation metrics. Moreover, the explicitly
non-maximizing nature of the chosen policy and goals yields additional degrees
of freedom, which can be used to apply heuristic safety criteria to the choice
of actions. We discuss several such safety criteria that aim to steer the agent
towards more conservative behavior.