{"title":"基于偏好的多目标强化学习","authors":"Ni Mu;Yao Luan;Qing-Shan Jia","doi":"10.1109/TASE.2025.3589271","DOIUrl":null,"url":null,"abstract":"Multi-objective reinforcement learning (MORL) is a structured approach for optimizing tasks with multiple objectives. However, it often relies on pre-defined reward functions, which can be hard to design for balancing conflicting goals and may lead to oversimplification. PReferences can serve as more flexible and intuitive decision-making guidance, eliminating the need for complicated reward design. This paper introduces preference-based MORL (Pb-MORL), which formalizes the integration of preferences into the MORL framework. We theoretically prove that preferences can derive policies across the entire Pareto frontier. To guide policy optimization using preferences, our method constructs a multi-objective reward model that aligns with the given preferences. We further provide theoretical proof to show that optimizing this reward model is equivalent to training the Pareto optimal policy. Extensive experiments in benchmark multi-objective tasks, a multi-energy management task, and an autonomous driving task on a multi-line highway show that our method performs competitively, surpassing the oracle method, which uses the ground truth reward function. This highlights its potential for practical applications in complex real-world systems. Note to Practitioners—Decision-making problems with multiple conflicting objectives are common in real-world applications, e.g., energy management must balance system lifespan, charge-discharge cycles, and energy procurement costs; autonomous driving vehicles must balance safety, speed, and passenger comfort. While multi-objective reinforcement learning (MORL) is an effective framework for these problems, its dependence on pre-defined reward functions can limit its application in complex situations, as designing a reward function often fails to capture the full complexity of the task fully. This paper introduces preference-based MORL (Pb-MORL), which utilizes user preference data to optimize policies, thereby eliminating the complexity of reward design. Specifically, we construct a multi-objective reward model that aligns with user preferences and demonstrate that optimizing this model can derive Pareto optimal solutions. Pb-MORL is effective, easy to deploy, and is expected to be applied in complex systems, e.g., multi-energy management through preference feedback and adaptive autonomous driving policies for diverse situations.","PeriodicalId":51060,"journal":{"name":"IEEE Transactions on Automation Science and Engineering","volume":"22 ","pages":"18737-18749"},"PeriodicalIF":6.4000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Preference-Based Multi-Objective Reinforcement Learning\",\"authors\":\"Ni Mu;Yao Luan;Qing-Shan Jia\",\"doi\":\"10.1109/TASE.2025.3589271\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-objective reinforcement learning (MORL) is a structured approach for optimizing tasks with multiple objectives. However, it often relies on pre-defined reward functions, which can be hard to design for balancing conflicting goals and may lead to oversimplification. PReferences can serve as more flexible and intuitive decision-making guidance, eliminating the need for complicated reward design. This paper introduces preference-based MORL (Pb-MORL), which formalizes the integration of preferences into the MORL framework. We theoretically prove that preferences can derive policies across the entire Pareto frontier. To guide policy optimization using preferences, our method constructs a multi-objective reward model that aligns with the given preferences. We further provide theoretical proof to show that optimizing this reward model is equivalent to training the Pareto optimal policy. Extensive experiments in benchmark multi-objective tasks, a multi-energy management task, and an autonomous driving task on a multi-line highway show that our method performs competitively, surpassing the oracle method, which uses the ground truth reward function. This highlights its potential for practical applications in complex real-world systems. Note to Practitioners—Decision-making problems with multiple conflicting objectives are common in real-world applications, e.g., energy management must balance system lifespan, charge-discharge cycles, and energy procurement costs; autonomous driving vehicles must balance safety, speed, and passenger comfort. While multi-objective reinforcement learning (MORL) is an effective framework for these problems, its dependence on pre-defined reward functions can limit its application in complex situations, as designing a reward function often fails to capture the full complexity of the task fully. This paper introduces preference-based MORL (Pb-MORL), which utilizes user preference data to optimize policies, thereby eliminating the complexity of reward design. Specifically, we construct a multi-objective reward model that aligns with user preferences and demonstrate that optimizing this model can derive Pareto optimal solutions. Pb-MORL is effective, easy to deploy, and is expected to be applied in complex systems, e.g., multi-energy management through preference feedback and adaptive autonomous driving policies for diverse situations.\",\"PeriodicalId\":51060,\"journal\":{\"name\":\"IEEE Transactions on Automation Science and Engineering\",\"volume\":\"22 \",\"pages\":\"18737-18749\"},\"PeriodicalIF\":6.4000,\"publicationDate\":\"2025-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Automation Science and Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11080487/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Automation Science and Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11080487/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
Multi-objective reinforcement learning (MORL) is a structured approach for optimizing tasks with multiple objectives. However, it often relies on pre-defined reward functions, which can be hard to design for balancing conflicting goals and may lead to oversimplification. PReferences can serve as more flexible and intuitive decision-making guidance, eliminating the need for complicated reward design. This paper introduces preference-based MORL (Pb-MORL), which formalizes the integration of preferences into the MORL framework. We theoretically prove that preferences can derive policies across the entire Pareto frontier. To guide policy optimization using preferences, our method constructs a multi-objective reward model that aligns with the given preferences. We further provide theoretical proof to show that optimizing this reward model is equivalent to training the Pareto optimal policy. Extensive experiments in benchmark multi-objective tasks, a multi-energy management task, and an autonomous driving task on a multi-line highway show that our method performs competitively, surpassing the oracle method, which uses the ground truth reward function. This highlights its potential for practical applications in complex real-world systems. Note to Practitioners—Decision-making problems with multiple conflicting objectives are common in real-world applications, e.g., energy management must balance system lifespan, charge-discharge cycles, and energy procurement costs; autonomous driving vehicles must balance safety, speed, and passenger comfort. While multi-objective reinforcement learning (MORL) is an effective framework for these problems, its dependence on pre-defined reward functions can limit its application in complex situations, as designing a reward function often fails to capture the full complexity of the task fully. This paper introduces preference-based MORL (Pb-MORL), which utilizes user preference data to optimize policies, thereby eliminating the complexity of reward design. Specifically, we construct a multi-objective reward model that aligns with user preferences and demonstrate that optimizing this model can derive Pareto optimal solutions. Pb-MORL is effective, easy to deploy, and is expected to be applied in complex systems, e.g., multi-energy management through preference feedback and adaptive autonomous driving policies for diverse situations.
期刊介绍:
The IEEE Transactions on Automation Science and Engineering (T-ASE) publishes fundamental papers on Automation, emphasizing scientific results that advance efficiency, quality, productivity, and reliability. T-ASE encourages interdisciplinary approaches from computer science, control systems, electrical engineering, mathematics, mechanical engineering, operations research, and other fields. T-ASE welcomes results relevant to industries such as agriculture, biotechnology, healthcare, home automation, maintenance, manufacturing, pharmaceuticals, retail, security, service, supply chains, and transportation. T-ASE addresses a research community willing to integrate knowledge across disciplines and industries. For this purpose, each paper includes a Note to Practitioners that summarizes how its results can be applied or how they might be extended to apply in practice.