Dongdong Wang, Siyang Lu, Xiang Wei, Mingquan Wang, Yandong Li, Liqiang Wang
{"title":"APR-ES: Adaptive Penalty-Reward Based Evolution Strategy for Deep Reinforcement Learning","authors":"Dongdong Wang, Siyang Lu, Xiang Wei, Mingquan Wang, Yandong Li, Liqiang Wang","doi":"10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00079","DOIUrl":null,"url":null,"abstract":"As a black-box optimization approach, derivative-free evolution strategy (ES) draws lots of attention in virtue of its low sensitivity and high scalability. It rivals Markov Decision Process based reinforcement learning or even can more efficiently improve rewards under complex scenarios. However, existing derivative-free ES still confronts slow convergence speed at the early training stage and limited exploration at the late convergence stage. Inspired from human learning process, we propose a new scheme extended from ES by taking advantage of prior knowledge to guide ES, thus accelerating early exploitation process and improving later exploration ability. At early training stage, Drift-Plus-Penalty (DPP), a penalty-based optimization scheme, is reformulated to boost penalty learning and reduce regrets. Along with DPP-directed evolution, reward learning with Thompson sampling (TS) is increasingly enhanced to explore global optima at late training stage. This scheme is justified with extensive experiments from a variety of benchmarks, including numerical problems, physics environments, and games. By virtue of its imitation of human learning process, this scheme outperforms state-of-the-art ES on the benchmarks by a large margin.","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00079","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
As a black-box optimization approach, derivative-free evolution strategy (ES) draws lots of attention in virtue of its low sensitivity and high scalability. It rivals Markov Decision Process based reinforcement learning or even can more efficiently improve rewards under complex scenarios. However, existing derivative-free ES still confronts slow convergence speed at the early training stage and limited exploration at the late convergence stage. Inspired from human learning process, we propose a new scheme extended from ES by taking advantage of prior knowledge to guide ES, thus accelerating early exploitation process and improving later exploration ability. At early training stage, Drift-Plus-Penalty (DPP), a penalty-based optimization scheme, is reformulated to boost penalty learning and reduce regrets. Along with DPP-directed evolution, reward learning with Thompson sampling (TS) is increasingly enhanced to explore global optima at late training stage. This scheme is justified with extensive experiments from a variety of benchmarks, including numerical problems, physics environments, and games. By virtue of its imitation of human learning process, this scheme outperforms state-of-the-art ES on the benchmarks by a large margin.