回避冲突多目标强化学习的理论研究

IF 2.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Information Theory Pub Date : 2025-06-19 DOI:10.1109/TIT.2025.3581454

Yudan Wang;Peiyao Xiao;Hao Ban;Kaiyi Ji;Shaofeng Zou

{"title":"回避冲突多目标强化学习的理论研究","authors":"Yudan Wang;Peiyao Xiao;Hao Ban;Kaiyi Ji;Shaofeng Zou","doi":"10.1109/TIT.2025.3581454","DOIUrl":null,"url":null,"abstract":"Multi-objective reinforcement learning (MORL) has shown great promise in many real-world applications. Existing MORL algorithms often aim to learn a policy that optimizes individual objective functions simultaneously with a given prior preference (or weights) on different objectives. However, these methods often suffer from the issue of <italic>gradient conflict</i> such that the objectives with larger gradients dominate the update direction, resulting in a performance degeneration on other objectives. In this paper, we develop a novel dynamic weighting multi-objective actor-critic algorithm (MOAC) under two options of sub-procedures named as conflict-avoidant (CA) and faster convergence (FC) in objective weight updates. MOAC-CA aims to find a CA update direction that maximizes the minimum value improvement among objectives, and MOAC-FC targets at a much faster convergence rate. We provide a comprehensive finite-time convergence analysis for both algorithms. We show that MOAC-CA can find a <inline-formula> <tex-math>$\\epsilon +\\epsilon _{\\text {app}}$ </tex-math></inline-formula>-accurate Pareto stationary policy using <inline-formula> <tex-math>$\\mathcal {O}({\\epsilon ^{-5}})$ </tex-math></inline-formula> samples, while ensuring a small <inline-formula> <tex-math>$\\epsilon +\\sqrt {\\epsilon _{\\text {app}}}$ </tex-math></inline-formula>-level CA distance (defined as the distance to the CA direction), where <inline-formula> <tex-math>$\\epsilon _{\\text {app}}$ </tex-math></inline-formula> is the function approximation error. The analysis also shows that MOAC-FC improves the sample complexity to <inline-formula> <tex-math>$\\mathcal {O}(\\epsilon ^{-3})$ </tex-math></inline-formula>, but with a constant-level CA distance. Our experiments on MT10 demonstrate the improved performance of our algorithms over existing MORL methods with fixed preference.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"71 9","pages":"7254-7269"},"PeriodicalIF":2.9000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Theoretical Study of Conflict-Avoidant Multi-Objective Reinforcement Learning\",\"authors\":\"Yudan Wang;Peiyao Xiao;Hao Ban;Kaiyi Ji;Shaofeng Zou\",\"doi\":\"10.1109/TIT.2025.3581454\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-objective reinforcement learning (MORL) has shown great promise in many real-world applications. Existing MORL algorithms often aim to learn a policy that optimizes individual objective functions simultaneously with a given prior preference (or weights) on different objectives. However, these methods often suffer from the issue of <italic>gradient conflict</i> such that the objectives with larger gradients dominate the update direction, resulting in a performance degeneration on other objectives. In this paper, we develop a novel dynamic weighting multi-objective actor-critic algorithm (MOAC) under two options of sub-procedures named as conflict-avoidant (CA) and faster convergence (FC) in objective weight updates. MOAC-CA aims to find a CA update direction that maximizes the minimum value improvement among objectives, and MOAC-FC targets at a much faster convergence rate. We provide a comprehensive finite-time convergence analysis for both algorithms. We show that MOAC-CA can find a <inline-formula> <tex-math>$\\\\epsilon +\\\\epsilon _{\\\\text {app}}$ </tex-math></inline-formula>-accurate Pareto stationary policy using <inline-formula> <tex-math>$\\\\mathcal {O}({\\\\epsilon ^{-5}})$ </tex-math></inline-formula> samples, while ensuring a small <inline-formula> <tex-math>$\\\\epsilon +\\\\sqrt {\\\\epsilon _{\\\\text {app}}}$ </tex-math></inline-formula>-level CA distance (defined as the distance to the CA direction), where <inline-formula> <tex-math>$\\\\epsilon _{\\\\text {app}}$ </tex-math></inline-formula> is the function approximation error. The analysis also shows that MOAC-FC improves the sample complexity to <inline-formula> <tex-math>$\\\\mathcal {O}(\\\\epsilon ^{-3})$ </tex-math></inline-formula>, but with a constant-level CA distance. Our experiments on MT10 demonstrate the improved performance of our algorithms over existing MORL methods with fixed preference.\",\"PeriodicalId\":13494,\"journal\":{\"name\":\"IEEE Transactions on Information Theory\",\"volume\":\"71 9\",\"pages\":\"7254-7269\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Theory\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11044347/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11044347/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

多目标强化学习（MORL）在许多实际应用中显示出巨大的前景。现有的MORL算法通常旨在学习一种策略，该策略在不同目标上具有给定的优先偏好（或权重），同时优化单个目标函数。然而，这些方法经常会遇到梯度冲突的问题，即梯度较大的目标会主导更新方向，从而导致其他目标的性能下降。本文提出了一种新的动态加权多目标行为者评价算法（MOAC），该算法在客观权重更新中采用回避冲突（CA）和快速收敛（FC）两种子过程。MOAC-CA的目标是寻找一个目标间改进最小值最大化的CA更新方向，而MOAC-FC的目标收敛速度要快得多。我们为这两种算法提供了全面的有限时间收敛分析。我们证明MOAC-CA可以使用$\mathcal {O}({\epsilon ^{-5}})$样本找到$\epsilon +\epsilon _{\text {app}}$ -精确的Pareto平稳策略，同时确保较小的$\epsilon +\sqrt {\epsilon _{\text {app}}}$ -级CA距离（定义为到CA方向的距离），其中$\epsilon _{\text {app}}$是函数近似误差。分析还表明，MOAC-FC将样本复杂度提高到$\mathcal {O}(\epsilon ^{-3})$，但CA距离不变。我们在MT10上的实验证明了我们的算法比现有的具有固定偏好的MORL方法的性能有所提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Theoretical Study of Conflict-Avoidant Multi-Objective Reinforcement Learning

Multi-objective reinforcement learning (MORL) has shown great promise in many real-world applications. Existing MORL algorithms often aim to learn a policy that optimizes individual objective functions simultaneously with a given prior preference (or weights) on different objectives. However, these methods often suffer from the issue of gradient conflict such that the objectives with larger gradients dominate the update direction, resulting in a performance degeneration on other objectives. In this paper, we develop a novel dynamic weighting multi-objective actor-critic algorithm (MOAC) under two options of sub-procedures named as conflict-avoidant (CA) and faster convergence (FC) in objective weight updates. MOAC-CA aims to find a CA update direction that maximizes the minimum value improvement among objectives, and MOAC-FC targets at a much faster convergence rate. We provide a comprehensive finite-time convergence analysis for both algorithms. We show that MOAC-CA can find a

$\epsilon +\epsilon _{\text {app}}$

-accurate Pareto stationary policy using

$\mathcal {O}({\epsilon ^{-5}})$

samples, while ensuring a small

$\epsilon +\sqrt {\epsilon _{\text {app}}}$

-level CA distance (defined as the distance to the CA direction), where

$\epsilon _{\text {app}}$

is the function approximation error. The analysis also shows that MOAC-FC improves the sample complexity to

$\mathcal {O}(\epsilon ^{-3})$

, but with a constant-level CA distance. Our experiments on MT10 demonstrate the improved performance of our algorithms over existing MORL methods with fixed preference.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Information Theory 工程技术-工程：电子与电气

CiteScore

5.70

自引率

20.00%

发文量

514

审稿时长

12 months

期刊介绍： The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.