基于广义策略改进优先级的样本高效多目标学习

Adaptive Agents and Multi-Agent Systems Pub Date : 2023-01-18 DOI:10.48550/arXiv.2301.07784

L. N. Alegre, A. Bazzan, Diederik M. Roijers, Ann Now'e, Bruno C. da Silva

{"title":"基于广义策略改进优先级的样本高效多目标学习","authors":"L. N. Alegre, A. Bazzan, Diederik M. Roijers, Ann Now'e, Bruno C. da Silva","doi":"10.48550/arXiv.2301.07784","DOIUrl":null,"url":null,"abstract":"Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $\\epsilon$-optimal solution (for a bounded $\\epsilon$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state and action spaces.","PeriodicalId":326727,"journal":{"name":"Adaptive Agents and Multi-Agent Systems","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization\",\"authors\":\"L. N. Alegre, A. Bazzan, Diederik M. Roijers, Ann Now'e, Bruno C. da Silva\",\"doi\":\"10.48550/arXiv.2301.07784\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $\\\\epsilon$-optimal solution (for a bounded $\\\\epsilon$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state and action spaces.\",\"PeriodicalId\":326727,\"journal\":{\"name\":\"Adaptive Agents and Multi-Agent Systems\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Adaptive Agents and Multi-Agent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2301.07784\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Adaptive Agents and Multi-Agent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2301.07784","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

多目标强化学习(MORL)算法解决顺序决策问题，其中代理可能对奖励函数有不同的偏好(可能相互冲突)。这样的算法通常会学习一组策略(每个策略都针对特定的代理偏好进行了优化)，这些策略以后可以用于解决具有新偏好的问题。我们介绍了一种新的算法，该算法使用广义策略改进(GPI)来定义原则性的，正式派生的优先级方案，以提高样本效率学习。它们执行主动学习策略，通过该策略，智能体可以(i)识别每个时刻最有希望训练的偏好/目标，以更快地解决给定的MORL问题;(ii)通过一种新颖的dyna风格的MORL方法，确定在学习特定代理偏好的策略时，哪些先前的经验是最相关的。我们证明了我们的算法保证在有限的步骤中总是收敛到一个最优解，或者如果代理是有限的并且只能识别可能的次优策略，那么我们的算法保证总是收敛到一个$\epsilon$最优解(对于有界$\epsilon$)。我们还证明了该方法在学习过程中单调地提高了部分解的质量。最后，我们引入了一个界，它表征了在整个学习过程中由我们的方法计算的部分解所引起的最大效用损失(相对于最优解)。我们的经验表明，我们的方法在具有离散和连续状态和动作空间的挑战性多目标任务中优于最先进的MORL算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization

Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $\epsilon$-optimal solution (for a bounded $\epsilon$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state and action spaces.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Adaptive Agents and Multi-Agent Systems

自引率

0.00%

发文量