如何有效地训练你的AI代理?异构平台上深度强化学习的表征与评价

Yuan Meng, Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna
{"title":"如何有效地训练你的AI代理?异构平台上深度强化学习的表征与评价","authors":"Yuan Meng, Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna","doi":"10.1109/HPEC43674.2020.9286150","DOIUrl":null,"url":null,"abstract":"Deep Reinforcement Learning (Deep RL) is a key technology in several domains such as self-driving cars, robotics, surveillance, etc. In Deep RL, using a Deep Neural Network model, an agent learns how to interact with the environment to achieve a certain goal. The efficiency of running a Deep RL algorithm on a hardware architecture is dependent upon several factors including (1) the suitability of the hardware architecture for kernels and computation patterns which are fundamental to Deep RL; (2) the capability of the hardware architecture's memory hierarchy to minimize data-communication latency; and (3) the ability of the hardware architecture to hide overheads introduced by the deeply nested highly irregular computation characteristics in Deep RL algorithms. GPUs have been popular for accelerating RL algorithms, however, they fail to optimally satisfy the above-mentioned requirements. A few recent works have developed highly customized accelerators for specific Deep RL algorithms. However, they cannot be generalized easily to the plethora of Deep RL algorithms and DNN model choices that are available. In this paper, we explore the possibility of developing a unified framework that can accelerate a wide range of Deep RL algorithms including variations in training methods or DNN model structures. We take one step towards this goal by defining a domain-specific high-level abstraction for a widely used broad class of Deep RL algorithms - on-policy Deep RL. Furthermore, we provide a systematic analysis of the performance of state-of-the-art on-policy Deep RL algorithms on CPU-GPU and CPU-FPGA platforms. We target two representative algorithms - PPO and A2C, for application areas - robotics and games. we show that a FPGA-based custom accelerator achieves up to 24× (PPO) and 8× (A2C) speedups on training tasks, and 17× (PPO) and 2.1 × (A2C) improvements on overall throughput, respectively.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"How to Efficiently Train Your AI Agent? Characterizing and Evaluating Deep Reinforcement Learning on Heterogeneous Platforms\",\"authors\":\"Yuan Meng, Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna\",\"doi\":\"10.1109/HPEC43674.2020.9286150\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep Reinforcement Learning (Deep RL) is a key technology in several domains such as self-driving cars, robotics, surveillance, etc. In Deep RL, using a Deep Neural Network model, an agent learns how to interact with the environment to achieve a certain goal. The efficiency of running a Deep RL algorithm on a hardware architecture is dependent upon several factors including (1) the suitability of the hardware architecture for kernels and computation patterns which are fundamental to Deep RL; (2) the capability of the hardware architecture's memory hierarchy to minimize data-communication latency; and (3) the ability of the hardware architecture to hide overheads introduced by the deeply nested highly irregular computation characteristics in Deep RL algorithms. GPUs have been popular for accelerating RL algorithms, however, they fail to optimally satisfy the above-mentioned requirements. A few recent works have developed highly customized accelerators for specific Deep RL algorithms. However, they cannot be generalized easily to the plethora of Deep RL algorithms and DNN model choices that are available. In this paper, we explore the possibility of developing a unified framework that can accelerate a wide range of Deep RL algorithms including variations in training methods or DNN model structures. We take one step towards this goal by defining a domain-specific high-level abstraction for a widely used broad class of Deep RL algorithms - on-policy Deep RL. Furthermore, we provide a systematic analysis of the performance of state-of-the-art on-policy Deep RL algorithms on CPU-GPU and CPU-FPGA platforms. We target two representative algorithms - PPO and A2C, for application areas - robotics and games. we show that a FPGA-based custom accelerator achieves up to 24× (PPO) and 8× (A2C) speedups on training tasks, and 17× (PPO) and 2.1 × (A2C) improvements on overall throughput, respectively.\",\"PeriodicalId\":168544,\"journal\":{\"name\":\"2020 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC43674.2020.9286150\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC43674.2020.9286150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

深度强化学习(Deep Reinforcement Learning, Deep RL)是自动驾驶汽车、机器人、监控等多个领域的关键技术。在深度强化学习中,使用深度神经网络模型,代理学习如何与环境交互以实现特定目标。在硬件架构上运行深度强化学习算法的效率取决于几个因素,包括:(1)硬件架构对内核和计算模式的适用性,这是深度强化学习的基础;(2)硬件架构的内存层次结构最小化数据通信延迟的能力;(3)硬件架构隐藏深度强化学习算法中深度嵌套的高度不规则计算特征所带来的开销的能力。gpu在加速RL算法方面一直很受欢迎,然而,它们无法最优地满足上述要求。最近的一些工作已经为特定的深度强化学习算法开发了高度定制的加速器。然而,它们不能很容易地推广到可用的大量深度强化学习算法和DNN模型选择。在本文中,我们探讨了开发一个统一框架的可能性,该框架可以加速广泛的深度强化学习算法,包括训练方法或DNN模型结构的变化。我们通过为广泛使用的深度强化学习算法(on-policy Deep RL)定义特定于领域的高级抽象,向这一目标迈出了一步。此外,我们还对CPU-GPU和CPU-FPGA平台上最先进的策略深度强化学习算法的性能进行了系统分析。我们针对两种具有代表性的算法- PPO和A2C,用于应用领域-机器人和游戏。我们表明,基于fpga的定制加速器在训练任务上实现了高达24倍(PPO)和8倍(A2C)的加速,在总吞吐量上分别实现了17倍(PPO)和2.1倍(A2C)的改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
How to Efficiently Train Your AI Agent? Characterizing and Evaluating Deep Reinforcement Learning on Heterogeneous Platforms
Deep Reinforcement Learning (Deep RL) is a key technology in several domains such as self-driving cars, robotics, surveillance, etc. In Deep RL, using a Deep Neural Network model, an agent learns how to interact with the environment to achieve a certain goal. The efficiency of running a Deep RL algorithm on a hardware architecture is dependent upon several factors including (1) the suitability of the hardware architecture for kernels and computation patterns which are fundamental to Deep RL; (2) the capability of the hardware architecture's memory hierarchy to minimize data-communication latency; and (3) the ability of the hardware architecture to hide overheads introduced by the deeply nested highly irregular computation characteristics in Deep RL algorithms. GPUs have been popular for accelerating RL algorithms, however, they fail to optimally satisfy the above-mentioned requirements. A few recent works have developed highly customized accelerators for specific Deep RL algorithms. However, they cannot be generalized easily to the plethora of Deep RL algorithms and DNN model choices that are available. In this paper, we explore the possibility of developing a unified framework that can accelerate a wide range of Deep RL algorithms including variations in training methods or DNN model structures. We take one step towards this goal by defining a domain-specific high-level abstraction for a widely used broad class of Deep RL algorithms - on-policy Deep RL. Furthermore, we provide a systematic analysis of the performance of state-of-the-art on-policy Deep RL algorithms on CPU-GPU and CPU-FPGA platforms. We target two representative algorithms - PPO and A2C, for application areas - robotics and games. we show that a FPGA-based custom accelerator achieves up to 24× (PPO) and 8× (A2C) speedups on training tasks, and 17× (PPO) and 2.1 × (A2C) improvements on overall throughput, respectively.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信