{"title":"零阶行为批判者:序列决策问题的进化框架","authors":"Yuheng Lei;Yao Lyu;Guojian Zhan;Tao Zhang;Jiangtao Li;Jianyu Chen;Shengbo Eben Li;Sifa Zheng","doi":"10.1109/TEVC.2025.3529503","DOIUrl":null,"url":null,"abstract":"Evolutionary algorithms (EAs) have shown promise in solving sequential decision problems (SDPs) by simplifying them to static optimization problems and searching for the optimal policy parameters in a zeroth-order way. Despite their versatility, EAs often suffer from high sample complexity due to neglecting underlying temporal structures. In contrast, reinforcement learning (RL) methods typically formulate SDPs as Markov decision process (MDP). Although more sample efficient than EAs, RL methods are restricted to differentiable policies and prone to getting stuck in local optima. To address these issues, we propose a novel evolutionary framework zeroth-order actor-critic (ZOAC). We propose to use stepwise exploration in parameter space and theoretically derive the zeroth-order policy gradient. We further utilize the actor-critic architecture to effectively leverage the Markov property of SDPs and reduce the variance of gradient estimators. In each iteration, ZOAC collects trajectories with parameter space exploration, and alternates between first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM). We evaluate the effectiveness of ZOAC on a challenging multilane driving task optimizing the parameters in a rule-based, nondifferentiable driving policy that consists of three submodules: 1) behavior selection; 2) path planning; and 3) trajectory tracking. We also compare it with gradient-based RL methods on three Gymnasium tasks, optimizing neural network policies with thousands of parameters. Experimental results demonstrate the strong capability of ZOAC in solving SDPs. ZOAC significantly outperforms EAs that treat the problem as static optimization and matches the performance of gradient-based RL methods even without first-order information, in terms of total average return across tasks.","PeriodicalId":13206,"journal":{"name":"IEEE Transactions on Evolutionary Computation","volume":"29 2","pages":"555-569"},"PeriodicalIF":11.7000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Zeroth-Order Actor–Critic: An Evolutionary Framework for Sequential Decision Problems\",\"authors\":\"Yuheng Lei;Yao Lyu;Guojian Zhan;Tao Zhang;Jiangtao Li;Jianyu Chen;Shengbo Eben Li;Sifa Zheng\",\"doi\":\"10.1109/TEVC.2025.3529503\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Evolutionary algorithms (EAs) have shown promise in solving sequential decision problems (SDPs) by simplifying them to static optimization problems and searching for the optimal policy parameters in a zeroth-order way. Despite their versatility, EAs often suffer from high sample complexity due to neglecting underlying temporal structures. In contrast, reinforcement learning (RL) methods typically formulate SDPs as Markov decision process (MDP). Although more sample efficient than EAs, RL methods are restricted to differentiable policies and prone to getting stuck in local optima. To address these issues, we propose a novel evolutionary framework zeroth-order actor-critic (ZOAC). We propose to use stepwise exploration in parameter space and theoretically derive the zeroth-order policy gradient. We further utilize the actor-critic architecture to effectively leverage the Markov property of SDPs and reduce the variance of gradient estimators. In each iteration, ZOAC collects trajectories with parameter space exploration, and alternates between first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM). We evaluate the effectiveness of ZOAC on a challenging multilane driving task optimizing the parameters in a rule-based, nondifferentiable driving policy that consists of three submodules: 1) behavior selection; 2) path planning; and 3) trajectory tracking. We also compare it with gradient-based RL methods on three Gymnasium tasks, optimizing neural network policies with thousands of parameters. Experimental results demonstrate the strong capability of ZOAC in solving SDPs. ZOAC significantly outperforms EAs that treat the problem as static optimization and matches the performance of gradient-based RL methods even without first-order information, in terms of total average return across tasks.\",\"PeriodicalId\":13206,\"journal\":{\"name\":\"IEEE Transactions on Evolutionary Computation\",\"volume\":\"29 2\",\"pages\":\"555-569\"},\"PeriodicalIF\":11.7000,\"publicationDate\":\"2025-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Evolutionary Computation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10841436/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Evolutionary Computation","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10841436/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Zeroth-Order Actor–Critic: An Evolutionary Framework for Sequential Decision Problems
Evolutionary algorithms (EAs) have shown promise in solving sequential decision problems (SDPs) by simplifying them to static optimization problems and searching for the optimal policy parameters in a zeroth-order way. Despite their versatility, EAs often suffer from high sample complexity due to neglecting underlying temporal structures. In contrast, reinforcement learning (RL) methods typically formulate SDPs as Markov decision process (MDP). Although more sample efficient than EAs, RL methods are restricted to differentiable policies and prone to getting stuck in local optima. To address these issues, we propose a novel evolutionary framework zeroth-order actor-critic (ZOAC). We propose to use stepwise exploration in parameter space and theoretically derive the zeroth-order policy gradient. We further utilize the actor-critic architecture to effectively leverage the Markov property of SDPs and reduce the variance of gradient estimators. In each iteration, ZOAC collects trajectories with parameter space exploration, and alternates between first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM). We evaluate the effectiveness of ZOAC on a challenging multilane driving task optimizing the parameters in a rule-based, nondifferentiable driving policy that consists of three submodules: 1) behavior selection; 2) path planning; and 3) trajectory tracking. We also compare it with gradient-based RL methods on three Gymnasium tasks, optimizing neural network policies with thousands of parameters. Experimental results demonstrate the strong capability of ZOAC in solving SDPs. ZOAC significantly outperforms EAs that treat the problem as static optimization and matches the performance of gradient-based RL methods even without first-order information, in terms of total average return across tasks.
期刊介绍:
The IEEE Transactions on Evolutionary Computation is published by the IEEE Computational Intelligence Society on behalf of 13 societies: Circuits and Systems; Computer; Control Systems; Engineering in Medicine and Biology; Industrial Electronics; Industry Applications; Lasers and Electro-Optics; Oceanic Engineering; Power Engineering; Robotics and Automation; Signal Processing; Social Implications of Technology; and Systems, Man, and Cybernetics. The journal publishes original papers in evolutionary computation and related areas such as nature-inspired algorithms, population-based methods, optimization, and hybrid systems. It welcomes both purely theoretical papers and application papers that provide general insights into these areas of computation.