Yudan Wang;Peiyao Xiao;Hao Ban;Kaiyi Ji;Shaofeng Zou
{"title":"回避冲突多目标强化学习的理论研究","authors":"Yudan Wang;Peiyao Xiao;Hao Ban;Kaiyi Ji;Shaofeng Zou","doi":"10.1109/TIT.2025.3581454","DOIUrl":null,"url":null,"abstract":"Multi-objective reinforcement learning (MORL) has shown great promise in many real-world applications. Existing MORL algorithms often aim to learn a policy that optimizes individual objective functions simultaneously with a given prior preference (or weights) on different objectives. However, these methods often suffer from the issue of <italic>gradient conflict</i> such that the objectives with larger gradients dominate the update direction, resulting in a performance degeneration on other objectives. In this paper, we develop a novel dynamic weighting multi-objective actor-critic algorithm (MOAC) under two options of sub-procedures named as conflict-avoidant (CA) and faster convergence (FC) in objective weight updates. MOAC-CA aims to find a CA update direction that maximizes the minimum value improvement among objectives, and MOAC-FC targets at a much faster convergence rate. We provide a comprehensive finite-time convergence analysis for both algorithms. We show that MOAC-CA can find a <inline-formula> <tex-math>$\\epsilon +\\epsilon _{\\text {app}}$ </tex-math></inline-formula>-accurate Pareto stationary policy using <inline-formula> <tex-math>$\\mathcal {O}({\\epsilon ^{-5}})$ </tex-math></inline-formula> samples, while ensuring a small <inline-formula> <tex-math>$\\epsilon +\\sqrt {\\epsilon _{\\text {app}}}$ </tex-math></inline-formula>-level CA distance (defined as the distance to the CA direction), where <inline-formula> <tex-math>$\\epsilon _{\\text {app}}$ </tex-math></inline-formula> is the function approximation error. The analysis also shows that MOAC-FC improves the sample complexity to <inline-formula> <tex-math>$\\mathcal {O}(\\epsilon ^{-3})$ </tex-math></inline-formula>, but with a constant-level CA distance. Our experiments on MT10 demonstrate the improved performance of our algorithms over existing MORL methods with fixed preference.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"71 9","pages":"7254-7269"},"PeriodicalIF":2.9000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Theoretical Study of Conflict-Avoidant Multi-Objective Reinforcement Learning\",\"authors\":\"Yudan Wang;Peiyao Xiao;Hao Ban;Kaiyi Ji;Shaofeng Zou\",\"doi\":\"10.1109/TIT.2025.3581454\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-objective reinforcement learning (MORL) has shown great promise in many real-world applications. Existing MORL algorithms often aim to learn a policy that optimizes individual objective functions simultaneously with a given prior preference (or weights) on different objectives. However, these methods often suffer from the issue of <italic>gradient conflict</i> such that the objectives with larger gradients dominate the update direction, resulting in a performance degeneration on other objectives. In this paper, we develop a novel dynamic weighting multi-objective actor-critic algorithm (MOAC) under two options of sub-procedures named as conflict-avoidant (CA) and faster convergence (FC) in objective weight updates. MOAC-CA aims to find a CA update direction that maximizes the minimum value improvement among objectives, and MOAC-FC targets at a much faster convergence rate. We provide a comprehensive finite-time convergence analysis for both algorithms. We show that MOAC-CA can find a <inline-formula> <tex-math>$\\\\epsilon +\\\\epsilon _{\\\\text {app}}$ </tex-math></inline-formula>-accurate Pareto stationary policy using <inline-formula> <tex-math>$\\\\mathcal {O}({\\\\epsilon ^{-5}})$ </tex-math></inline-formula> samples, while ensuring a small <inline-formula> <tex-math>$\\\\epsilon +\\\\sqrt {\\\\epsilon _{\\\\text {app}}}$ </tex-math></inline-formula>-level CA distance (defined as the distance to the CA direction), where <inline-formula> <tex-math>$\\\\epsilon _{\\\\text {app}}$ </tex-math></inline-formula> is the function approximation error. The analysis also shows that MOAC-FC improves the sample complexity to <inline-formula> <tex-math>$\\\\mathcal {O}(\\\\epsilon ^{-3})$ </tex-math></inline-formula>, but with a constant-level CA distance. Our experiments on MT10 demonstrate the improved performance of our algorithms over existing MORL methods with fixed preference.\",\"PeriodicalId\":13494,\"journal\":{\"name\":\"IEEE Transactions on Information Theory\",\"volume\":\"71 9\",\"pages\":\"7254-7269\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Theory\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11044347/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11044347/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Theoretical Study of Conflict-Avoidant Multi-Objective Reinforcement Learning
Multi-objective reinforcement learning (MORL) has shown great promise in many real-world applications. Existing MORL algorithms often aim to learn a policy that optimizes individual objective functions simultaneously with a given prior preference (or weights) on different objectives. However, these methods often suffer from the issue of gradient conflict such that the objectives with larger gradients dominate the update direction, resulting in a performance degeneration on other objectives. In this paper, we develop a novel dynamic weighting multi-objective actor-critic algorithm (MOAC) under two options of sub-procedures named as conflict-avoidant (CA) and faster convergence (FC) in objective weight updates. MOAC-CA aims to find a CA update direction that maximizes the minimum value improvement among objectives, and MOAC-FC targets at a much faster convergence rate. We provide a comprehensive finite-time convergence analysis for both algorithms. We show that MOAC-CA can find a $\epsilon +\epsilon _{\text {app}}$ -accurate Pareto stationary policy using $\mathcal {O}({\epsilon ^{-5}})$ samples, while ensuring a small $\epsilon +\sqrt {\epsilon _{\text {app}}}$ -level CA distance (defined as the distance to the CA direction), where $\epsilon _{\text {app}}$ is the function approximation error. The analysis also shows that MOAC-FC improves the sample complexity to $\mathcal {O}(\epsilon ^{-3})$ , but with a constant-level CA distance. Our experiments on MT10 demonstrate the improved performance of our algorithms over existing MORL methods with fixed preference.
期刊介绍:
The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.