基于r-STDP的深度强化学习策略微调

Proceedings of the International Conference on Neuromorphic Systems 2022 Pub Date : 2022-07-27 DOI:10.1145/3546790.3546804

Mahmoud Akl, Yulia Sandamirskaya, Deniz Ergene, Florian Walter, Alois Knoll

{"title":"基于r-STDP的深度强化学习策略微调","authors":"Mahmoud Akl, Yulia Sandamirskaya, Deniz Ergene, Florian Walter, Alois Knoll","doi":"10.1145/3546790.3546804","DOIUrl":null,"url":null,"abstract":"Using deep reinforcement learning policies that are trained in simulation on real robotic platforms requires fine-tuning due to discrepancies between simulated and real environments. Multiple methods like domain randomization and system identification have been suggested to overcome this problem. However, sim-to-real transfer remains an open problem in robotics and deep reinforcement learning. In this paper, we present a spiking neural network (SNN) alternative for dealing with the sim-to-real problem. In particular, we train SNNs with backpropagation using surrogate gradients and the (Deep Q-Network) DQN algorithm to solve two classical control reinforcement learning tasks. The performance of the trained DQNs degrades when evaluated on randomized versions of the environments used during training. To compensate for the drop in performance, we apply the biologically plausible reward-modulated spike timing dependent plasticity (r-STDP) learning rule. Our results show that r-STDP can be successfully utilized to restore the network’s ability to solve the task. Furthermore, since r-STDP can be directly implemented on neuromorphic hardware, we believe it provides a promising neuromorphic solution to the sim-to-real problem.","PeriodicalId":104528,"journal":{"name":"Proceedings of the International Conference on Neuromorphic Systems 2022","volume":"149 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Fine-tuning Deep Reinforcement Learning Policies with r-STDP for Domain Adaptation\",\"authors\":\"Mahmoud Akl, Yulia Sandamirskaya, Deniz Ergene, Florian Walter, Alois Knoll\",\"doi\":\"10.1145/3546790.3546804\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Using deep reinforcement learning policies that are trained in simulation on real robotic platforms requires fine-tuning due to discrepancies between simulated and real environments. Multiple methods like domain randomization and system identification have been suggested to overcome this problem. However, sim-to-real transfer remains an open problem in robotics and deep reinforcement learning. In this paper, we present a spiking neural network (SNN) alternative for dealing with the sim-to-real problem. In particular, we train SNNs with backpropagation using surrogate gradients and the (Deep Q-Network) DQN algorithm to solve two classical control reinforcement learning tasks. The performance of the trained DQNs degrades when evaluated on randomized versions of the environments used during training. To compensate for the drop in performance, we apply the biologically plausible reward-modulated spike timing dependent plasticity (r-STDP) learning rule. Our results show that r-STDP can be successfully utilized to restore the network’s ability to solve the task. Furthermore, since r-STDP can be directly implemented on neuromorphic hardware, we believe it provides a promising neuromorphic solution to the sim-to-real problem.\",\"PeriodicalId\":104528,\"journal\":{\"name\":\"Proceedings of the International Conference on Neuromorphic Systems 2022\",\"volume\":\"149 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Conference on Neuromorphic Systems 2022\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3546790.3546804\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on Neuromorphic Systems 2022","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3546790.3546804","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

由于模拟环境和真实环境之间的差异，使用在真实机器人平台上模拟训练的深度强化学习策略需要进行微调。为了克服这一问题，人们提出了域随机化和系统识别等多种方法。然而，模拟到真实的迁移在机器人和深度强化学习中仍然是一个开放的问题。在本文中，我们提出了一个尖峰神经网络(SNN)的替代方案来处理模拟到真实的问题。特别是，我们使用代理梯度和(Deep Q-Network) DQN算法进行反向传播训练snn，以解决两个经典的控制强化学习任务。训练后的dqn在训练期间使用的随机环境中进行评估时，其性能会下降。为了弥补性能的下降，我们应用生物学上合理的奖励调制峰值时间依赖可塑性(r-STDP)学习规则。我们的研究结果表明，r-STDP可以成功地用于恢复网络解决任务的能力。此外，由于r-STDP可以直接在神经形态硬件上实现，我们相信它为模拟到真实的问题提供了一个有前途的神经形态解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fine-tuning Deep Reinforcement Learning Policies with r-STDP for Domain Adaptation

Using deep reinforcement learning policies that are trained in simulation on real robotic platforms requires fine-tuning due to discrepancies between simulated and real environments. Multiple methods like domain randomization and system identification have been suggested to overcome this problem. However, sim-to-real transfer remains an open problem in robotics and deep reinforcement learning. In this paper, we present a spiking neural network (SNN) alternative for dealing with the sim-to-real problem. In particular, we train SNNs with backpropagation using surrogate gradients and the (Deep Q-Network) DQN algorithm to solve two classical control reinforcement learning tasks. The performance of the trained DQNs degrades when evaluated on randomized versions of the environments used during training. To compensate for the drop in performance, we apply the biologically plausible reward-modulated spike timing dependent plasticity (r-STDP) learning rule. Our results show that r-STDP can be successfully utilized to restore the network’s ability to solve the task. Furthermore, since r-STDP can be directly implemented on neuromorphic hardware, we believe it provides a promising neuromorphic solution to the sim-to-real problem.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the International Conference on Neuromorphic Systems 2022

自引率

0.00%

发文量