基于偏好的多目标强化学习

IF 6.4 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

IEEE Transactions on Automation Science and Engineering Pub Date : 2025-07-15 DOI:10.1109/TASE.2025.3589271

Ni Mu;Yao Luan;Qing-Shan Jia

{"title":"基于偏好的多目标强化学习","authors":"Ni Mu;Yao Luan;Qing-Shan Jia","doi":"10.1109/TASE.2025.3589271","DOIUrl":null,"url":null,"abstract":"Multi-objective reinforcement learning (MORL) is a structured approach for optimizing tasks with multiple objectives. However, it often relies on pre-defined reward functions, which can be hard to design for balancing conflicting goals and may lead to oversimplification. PReferences can serve as more flexible and intuitive decision-making guidance, eliminating the need for complicated reward design. This paper introduces preference-based MORL (Pb-MORL), which formalizes the integration of preferences into the MORL framework. We theoretically prove that preferences can derive policies across the entire Pareto frontier. To guide policy optimization using preferences, our method constructs a multi-objective reward model that aligns with the given preferences. We further provide theoretical proof to show that optimizing this reward model is equivalent to training the Pareto optimal policy. Extensive experiments in benchmark multi-objective tasks, a multi-energy management task, and an autonomous driving task on a multi-line highway show that our method performs competitively, surpassing the oracle method, which uses the ground truth reward function. This highlights its potential for practical applications in complex real-world systems. Note to Practitioners—Decision-making problems with multiple conflicting objectives are common in real-world applications, e.g., energy management must balance system lifespan, charge-discharge cycles, and energy procurement costs; autonomous driving vehicles must balance safety, speed, and passenger comfort. While multi-objective reinforcement learning (MORL) is an effective framework for these problems, its dependence on pre-defined reward functions can limit its application in complex situations, as designing a reward function often fails to capture the full complexity of the task fully. This paper introduces preference-based MORL (Pb-MORL), which utilizes user preference data to optimize policies, thereby eliminating the complexity of reward design. Specifically, we construct a multi-objective reward model that aligns with user preferences and demonstrate that optimizing this model can derive Pareto optimal solutions. Pb-MORL is effective, easy to deploy, and is expected to be applied in complex systems, e.g., multi-energy management through preference feedback and adaptive autonomous driving policies for diverse situations.","PeriodicalId":51060,"journal":{"name":"IEEE Transactions on Automation Science and Engineering","volume":"22 ","pages":"18737-18749"},"PeriodicalIF":6.4000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Preference-Based Multi-Objective Reinforcement Learning\",\"authors\":\"Ni Mu;Yao Luan;Qing-Shan Jia\",\"doi\":\"10.1109/TASE.2025.3589271\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-objective reinforcement learning (MORL) is a structured approach for optimizing tasks with multiple objectives. However, it often relies on pre-defined reward functions, which can be hard to design for balancing conflicting goals and may lead to oversimplification. PReferences can serve as more flexible and intuitive decision-making guidance, eliminating the need for complicated reward design. This paper introduces preference-based MORL (Pb-MORL), which formalizes the integration of preferences into the MORL framework. We theoretically prove that preferences can derive policies across the entire Pareto frontier. To guide policy optimization using preferences, our method constructs a multi-objective reward model that aligns with the given preferences. We further provide theoretical proof to show that optimizing this reward model is equivalent to training the Pareto optimal policy. Extensive experiments in benchmark multi-objective tasks, a multi-energy management task, and an autonomous driving task on a multi-line highway show that our method performs competitively, surpassing the oracle method, which uses the ground truth reward function. This highlights its potential for practical applications in complex real-world systems. Note to Practitioners—Decision-making problems with multiple conflicting objectives are common in real-world applications, e.g., energy management must balance system lifespan, charge-discharge cycles, and energy procurement costs; autonomous driving vehicles must balance safety, speed, and passenger comfort. While multi-objective reinforcement learning (MORL) is an effective framework for these problems, its dependence on pre-defined reward functions can limit its application in complex situations, as designing a reward function often fails to capture the full complexity of the task fully. This paper introduces preference-based MORL (Pb-MORL), which utilizes user preference data to optimize policies, thereby eliminating the complexity of reward design. Specifically, we construct a multi-objective reward model that aligns with user preferences and demonstrate that optimizing this model can derive Pareto optimal solutions. Pb-MORL is effective, easy to deploy, and is expected to be applied in complex systems, e.g., multi-energy management through preference feedback and adaptive autonomous driving policies for diverse situations.\",\"PeriodicalId\":51060,\"journal\":{\"name\":\"IEEE Transactions on Automation Science and Engineering\",\"volume\":\"22 \",\"pages\":\"18737-18749\"},\"PeriodicalIF\":6.4000,\"publicationDate\":\"2025-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Automation Science and Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11080487/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Automation Science and Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11080487/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

多目标强化学习（MORL）是一种结构化的多目标任务优化方法。然而，它通常依赖于预定义的奖励功能，这可能很难平衡冲突的目标，并可能导致过度简化。偏好可以作为更灵活和直观的决策指导，消除了复杂的奖励设计的需要。本文引入了基于偏好的MORL (Pb-MORL)，它将偏好整合到MORL框架中。我们从理论上证明了偏好可以推导出跨越整个帕累托边界的政策。为了指导使用偏好的政策优化，我们的方法构建了一个与给定偏好一致的多目标奖励模型。我们进一步提供理论证明，表明优化这个奖励模型相当于训练帕累托最优策略。在基准多目标任务、多能量管理任务和多线公路上的自动驾驶任务中进行的大量实验表明，我们的方法具有竞争力，超过了使用地面真相奖励函数的oracle方法。这突出了它在复杂的现实世界系统中实际应用的潜力。从业人员注意：在现实世界的应用中，具有多个相互冲突的目标的决策问题很常见，例如，能源管理必须平衡系统寿命、充放电周期和能源采购成本；自动驾驶汽车必须平衡安全性、速度和乘客舒适度。虽然多目标强化学习（MORL）是解决这些问题的有效框架，但它对预定义奖励函数的依赖限制了其在复杂情况下的应用，因为设计奖励函数通常无法充分捕捉任务的全部复杂性。本文介绍了基于偏好的MORL (Pb-MORL)，它利用用户偏好数据来优化策略，从而消除了奖励设计的复杂性。具体来说，我们构建了一个与用户偏好一致的多目标奖励模型，并证明了优化该模型可以得出帕累托最优解。Pb-MORL有效且易于部署，有望应用于复杂系统中，例如通过偏好反馈进行多能量管理和针对不同情况的自适应自动驾驶策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Preference-Based Multi-Objective Reinforcement Learning

Multi-objective reinforcement learning (MORL) is a structured approach for optimizing tasks with multiple objectives. However, it often relies on pre-defined reward functions, which can be hard to design for balancing conflicting goals and may lead to oversimplification. PReferences can serve as more flexible and intuitive decision-making guidance, eliminating the need for complicated reward design. This paper introduces preference-based MORL (Pb-MORL), which formalizes the integration of preferences into the MORL framework. We theoretically prove that preferences can derive policies across the entire Pareto frontier. To guide policy optimization using preferences, our method constructs a multi-objective reward model that aligns with the given preferences. We further provide theoretical proof to show that optimizing this reward model is equivalent to training the Pareto optimal policy. Extensive experiments in benchmark multi-objective tasks, a multi-energy management task, and an autonomous driving task on a multi-line highway show that our method performs competitively, surpassing the oracle method, which uses the ground truth reward function. This highlights its potential for practical applications in complex real-world systems. Note to Practitioners—Decision-making problems with multiple conflicting objectives are common in real-world applications, e.g., energy management must balance system lifespan, charge-discharge cycles, and energy procurement costs; autonomous driving vehicles must balance safety, speed, and passenger comfort. While multi-objective reinforcement learning (MORL) is an effective framework for these problems, its dependence on pre-defined reward functions can limit its application in complex situations, as designing a reward function often fails to capture the full complexity of the task fully. This paper introduces preference-based MORL (Pb-MORL), which utilizes user preference data to optimize policies, thereby eliminating the complexity of reward design. Specifically, we construct a multi-objective reward model that aligns with user preferences and demonstrate that optimizing this model can derive Pareto optimal solutions. Pb-MORL is effective, easy to deploy, and is expected to be applied in complex systems, e.g., multi-energy management through preference feedback and adaptive autonomous driving policies for diverse situations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Automation Science and Engineering 工程技术-自动化与控制系统

CiteScore

12.50

自引率

14.30%

发文量

404

审稿时长

3.0 months

期刊介绍： The IEEE Transactions on Automation Science and Engineering (T-ASE) publishes fundamental papers on Automation, emphasizing scientific results that advance efficiency, quality, productivity, and reliability. T-ASE encourages interdisciplinary approaches from computer science, control systems, electrical engineering, mathematics, mechanical engineering, operations research, and other fields. T-ASE welcomes results relevant to industries such as agriculture, biotechnology, healthcare, home automation, maintenance, manufacturing, pharmaceuticals, retail, security, service, supply chains, and transportation. T-ASE addresses a research community willing to integrate knowledge across disciplines and industries. For this purpose, each paper includes a Note to Practitioners that summarizes how its results can be applied or how they might be extended to apply in practice.