Two-stage population based training method for deep reinforcement learning

Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications Pub Date : 2019-03-08 DOI:10.1145/3318265.3318294

Yinda Zhou, W. Liu, Bin Li

{"title":"Two-stage population based training method for deep reinforcement learning","authors":"Yinda Zhou, W. Liu, Bin Li","doi":"10.1145/3318265.3318294","DOIUrl":null,"url":null,"abstract":"Deep reinforcement learning (DRL) methods has been widely applied on more and more challenging learning tasks, and achieved excellent performance. However, the efficiency of deep reinforcement learning is notoriously sensitive to their own hyperparameter configuration. The optimization process of deep reinforcement learning is highly dynamic and non-stationary, rather than a simple fitting process. So, its optimal hyperparameter should be adaptively adjusted according to the current learning process, rather than using a fixed set of hyperparameter configurations from beginning to end. DeepMind innovatively proposed a population based training (PBT) method for deep reinforcement learning, which achieved hyperparameter adaptation and made the model better trained. However, we assume that at the early stage when the learning model has little knowledge of the environment, frequent hyperparameter change will not be helpful for the model to learn efficiently, while learning with a reasonable fixed hyperparameter configuration will help the model obtain necessary knowledge as quick as possible, which we consider is more important for reinforcement learning at early stage. In this paper, we verified our hypothesis through experiments, and a Two-Stage Population Based Training (TS-PBT) method is proposed, which is a more efficient population based training method for deep reinforcement learning. Experiments show that at the same computational budget, our TS-PBT method makes the final performance of the model significantly better than the PBT method. TS-PBT achieved 40%, 310%, 2%, 53%, 30% and 38% performance improvement over PBT separately in six test environments.","PeriodicalId":241692,"journal":{"name":"Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications","volume":"183 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3318265.3318294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Deep reinforcement learning (DRL) methods has been widely applied on more and more challenging learning tasks, and achieved excellent performance. However, the efficiency of deep reinforcement learning is notoriously sensitive to their own hyperparameter configuration. The optimization process of deep reinforcement learning is highly dynamic and non-stationary, rather than a simple fitting process. So, its optimal hyperparameter should be adaptively adjusted according to the current learning process, rather than using a fixed set of hyperparameter configurations from beginning to end. DeepMind innovatively proposed a population based training (PBT) method for deep reinforcement learning, which achieved hyperparameter adaptation and made the model better trained. However, we assume that at the early stage when the learning model has little knowledge of the environment, frequent hyperparameter change will not be helpful for the model to learn efficiently, while learning with a reasonable fixed hyperparameter configuration will help the model obtain necessary knowledge as quick as possible, which we consider is more important for reinforcement learning at early stage. In this paper, we verified our hypothesis through experiments, and a Two-Stage Population Based Training (TS-PBT) method is proposed, which is a more efficient population based training method for deep reinforcement learning. Experiments show that at the same computational budget, our TS-PBT method makes the final performance of the model significantly better than the PBT method. TS-PBT achieved 40%, 310%, 2%, 53%, 30% and 38% performance improvement over PBT separately in six test environments.

查看原文本刊更多论文

基于两阶段人口的深度强化学习训练方法

深度强化学习(DRL)方法在越来越具有挑战性的学习任务中得到了广泛的应用，并取得了优异的成绩。然而，深度强化学习的效率对其自身的超参数配置非常敏感。深度强化学习的优化过程是高度动态和非平稳的，而不是简单的拟合过程。因此，它的最优超参数应该根据当前的学习过程自适应调整，而不是从头到尾使用一组固定的超参数配置。DeepMind创新地提出了一种基于种群的深度强化学习方法(population based training, PBT)，实现了超参数自适应，使模型得到更好的训练。然而，我们假设在学习模型对环境知之甚少的早期阶段，频繁的超参数变化不利于模型的高效学习，而合理的固定超参数配置学习会帮助模型尽快获得必要的知识，我们认为这对于早期的强化学习更为重要。在本文中，我们通过实验验证了我们的假设，并提出了一种基于两阶段种群的训练方法(Two-Stage Population Based Training, TS-PBT)，这是一种更有效的深度强化学习的基于种群的训练方法。实验表明，在相同的计算预算下，我们的TS-PBT方法使模型的最终性能明显优于PBT方法。在6个测试环境中，TS-PBT分别比PBT提高了40%、310%、2%、53%、30%和38%的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications

自引率

0.00%

发文量