零阶行为批判者：序列决策问题的进化框架

IF 11.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Evolutionary Computation Pub Date : 2025-01-14 DOI:10.1109/TEVC.2025.3529503

Yuheng Lei;Yao Lyu;Guojian Zhan;Tao Zhang;Jiangtao Li;Jianyu Chen;Shengbo Eben Li;Sifa Zheng

{"title":"零阶行为批判者：序列决策问题的进化框架","authors":"Yuheng Lei;Yao Lyu;Guojian Zhan;Tao Zhang;Jiangtao Li;Jianyu Chen;Shengbo Eben Li;Sifa Zheng","doi":"10.1109/TEVC.2025.3529503","DOIUrl":null,"url":null,"abstract":"Evolutionary algorithms (EAs) have shown promise in solving sequential decision problems (SDPs) by simplifying them to static optimization problems and searching for the optimal policy parameters in a zeroth-order way. Despite their versatility, EAs often suffer from high sample complexity due to neglecting underlying temporal structures. In contrast, reinforcement learning (RL) methods typically formulate SDPs as Markov decision process (MDP). Although more sample efficient than EAs, RL methods are restricted to differentiable policies and prone to getting stuck in local optima. To address these issues, we propose a novel evolutionary framework zeroth-order actor-critic (ZOAC). We propose to use stepwise exploration in parameter space and theoretically derive the zeroth-order policy gradient. We further utilize the actor-critic architecture to effectively leverage the Markov property of SDPs and reduce the variance of gradient estimators. In each iteration, ZOAC collects trajectories with parameter space exploration, and alternates between first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM). We evaluate the effectiveness of ZOAC on a challenging multilane driving task optimizing the parameters in a rule-based, nondifferentiable driving policy that consists of three submodules: 1) behavior selection; 2) path planning; and 3) trajectory tracking. We also compare it with gradient-based RL methods on three Gymnasium tasks, optimizing neural network policies with thousands of parameters. Experimental results demonstrate the strong capability of ZOAC in solving SDPs. ZOAC significantly outperforms EAs that treat the problem as static optimization and matches the performance of gradient-based RL methods even without first-order information, in terms of total average return across tasks.","PeriodicalId":13206,"journal":{"name":"IEEE Transactions on Evolutionary Computation","volume":"29 2","pages":"555-569"},"PeriodicalIF":11.7000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Zeroth-Order Actor–Critic: An Evolutionary Framework for Sequential Decision Problems\",\"authors\":\"Yuheng Lei;Yao Lyu;Guojian Zhan;Tao Zhang;Jiangtao Li;Jianyu Chen;Shengbo Eben Li;Sifa Zheng\",\"doi\":\"10.1109/TEVC.2025.3529503\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Evolutionary algorithms (EAs) have shown promise in solving sequential decision problems (SDPs) by simplifying them to static optimization problems and searching for the optimal policy parameters in a zeroth-order way. Despite their versatility, EAs often suffer from high sample complexity due to neglecting underlying temporal structures. In contrast, reinforcement learning (RL) methods typically formulate SDPs as Markov decision process (MDP). Although more sample efficient than EAs, RL methods are restricted to differentiable policies and prone to getting stuck in local optima. To address these issues, we propose a novel evolutionary framework zeroth-order actor-critic (ZOAC). We propose to use stepwise exploration in parameter space and theoretically derive the zeroth-order policy gradient. We further utilize the actor-critic architecture to effectively leverage the Markov property of SDPs and reduce the variance of gradient estimators. In each iteration, ZOAC collects trajectories with parameter space exploration, and alternates between first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM). We evaluate the effectiveness of ZOAC on a challenging multilane driving task optimizing the parameters in a rule-based, nondifferentiable driving policy that consists of three submodules: 1) behavior selection; 2) path planning; and 3) trajectory tracking. We also compare it with gradient-based RL methods on three Gymnasium tasks, optimizing neural network policies with thousands of parameters. Experimental results demonstrate the strong capability of ZOAC in solving SDPs. ZOAC significantly outperforms EAs that treat the problem as static optimization and matches the performance of gradient-based RL methods even without first-order information, in terms of total average return across tasks.\",\"PeriodicalId\":13206,\"journal\":{\"name\":\"IEEE Transactions on Evolutionary Computation\",\"volume\":\"29 2\",\"pages\":\"555-569\"},\"PeriodicalIF\":11.7000,\"publicationDate\":\"2025-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Evolutionary Computation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10841436/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Evolutionary Computation","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10841436/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

进化算法（EAs）通过将序列决策问题简化为静态优化问题并以零阶方式搜索最优策略参数，在解决序列决策问题（sdp）方面显示出了前景。尽管ea具有通用性，但由于忽略了潜在的时间结构，因此常常遭受高样本复杂性的困扰。相比之下，强化学习（RL）方法通常将sdp表述为马尔可夫决策过程（MDP）。尽管RL方法比ea更有效，但它仅限于可微分策略，并且容易陷入局部最优。为了解决这些问题，我们提出了一个新的进化框架——零阶行动者-评论家（ZOAC）。我们提出在参数空间中逐步探索，并从理论上推导出零阶策略梯度。我们进一步利用参与者-评论家架构来有效地利用sdp的马尔可夫性质并减小梯度估计量的方差。在每次迭代中，ZOAC收集具有参数空间探索的轨迹，并在一阶策略评估（PEV）和零阶策略改进（PIM）之间交替进行。我们评估了ZOAC在一个具有挑战性的多车道驾驶任务上的有效性，该任务优化了基于规则的不可微驾驶策略中的参数，该策略由三个子模块组成：1)行为选择；2)路径规划；3)轨迹跟踪。我们还将其与基于梯度的RL方法在三个体育馆任务上进行了比较，优化了具有数千个参数的神经网络策略。实验结果表明，ZOAC具有较强的求解sdp的能力。在跨任务的总平均回报方面，ZOAC显著优于将问题视为静态优化的ea，并且即使没有一阶信息，也可以匹配基于梯度的RL方法的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Zeroth-Order Actor–Critic: An Evolutionary Framework for Sequential Decision Problems

Evolutionary algorithms (EAs) have shown promise in solving sequential decision problems (SDPs) by simplifying them to static optimization problems and searching for the optimal policy parameters in a zeroth-order way. Despite their versatility, EAs often suffer from high sample complexity due to neglecting underlying temporal structures. In contrast, reinforcement learning (RL) methods typically formulate SDPs as Markov decision process (MDP). Although more sample efficient than EAs, RL methods are restricted to differentiable policies and prone to getting stuck in local optima. To address these issues, we propose a novel evolutionary framework zeroth-order actor-critic (ZOAC). We propose to use stepwise exploration in parameter space and theoretically derive the zeroth-order policy gradient. We further utilize the actor-critic architecture to effectively leverage the Markov property of SDPs and reduce the variance of gradient estimators. In each iteration, ZOAC collects trajectories with parameter space exploration, and alternates between first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM). We evaluate the effectiveness of ZOAC on a challenging multilane driving task optimizing the parameters in a rule-based, nondifferentiable driving policy that consists of three submodules: 1) behavior selection; 2) path planning; and 3) trajectory tracking. We also compare it with gradient-based RL methods on three Gymnasium tasks, optimizing neural network policies with thousands of parameters. Experimental results demonstrate the strong capability of ZOAC in solving SDPs. ZOAC significantly outperforms EAs that treat the problem as static optimization and matches the performance of gradient-based RL methods even without first-order information, in terms of total average return across tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Evolutionary Computation 工程技术-计算机：理论方法

CiteScore

21.90

自引率

9.80%

发文量

196

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on Evolutionary Computation is published by the IEEE Computational Intelligence Society on behalf of 13 societies: Circuits and Systems; Computer; Control Systems; Engineering in Medicine and Biology; Industrial Electronics; Industry Applications; Lasers and Electro-Optics; Oceanic Engineering; Power Engineering; Robotics and Automation; Signal Processing; Social Implications of Technology; and Systems, Man, and Cybernetics. The journal publishes original papers in evolutionary computation and related areas such as nature-inspired algorithms, population-based methods, optimization, and hybrid systems. It welcomes both purely theoretical papers and application papers that provide general insights into these areas of computation.