LearningGroup:基于可学习权分组的FPGA实时稀疏训练多智能体强化学习

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-10-29 DOI:10.1109/ICFPT56656.2022.9974543

Jenny Yang, Jaeuk Kim, Joo-Young Kim

{"title":"LearningGroup:基于可学习权分组的FPGA实时稀疏训练多智能体强化学习","authors":"Jenny Yang, Jaeuk Kim, Joo-Young Kim","doi":"10.1109/ICFPT56656.2022.9974543","DOIUrl":null,"url":null,"abstract":"Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems in various applications such as multi-robot control and self-driving cars. Unlike supervised model or single-agent rein-forcement learning, which actively exploits network pruning, it is obscure that how pruning will work in multi-agent reinforcement learning with its cooperative and interactive characteristics. In this paper, we present a real-time sparse training accel-eration system named LearningGroup, which adopts network pruning on the training of MARL for the first time with an algorithm/architecture co-design approach. We create spar-sity using a weight grouping algorithm and propose on-chip sparse data encoding loop (OSEL) that enables fast encoding with efficient implementation. Based on the OSEL's encoding format, LearningGroup performs efficient weight compression and computation workload allocation to multiple cores, where each core handles multiple sparse rows of the weight matrix simultaneously with vector processing units. As a result, LearningGroup system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x. Its FPGA accelerator shows 257.40-3629.48 GFLOPS throughput and 7.10-100.12 GFLOPS/W energy efficiency for various conditions in MARL, which are 7.13x higher and 12.43x more energy efficient than Nvidia Titan RTX GPU, thanks to the fully on-chip training and highly optimized dataflow/data format provided by FPGA. Most importantly, the accelerator shows speedup up to 12.52 x for processing sparse data over the dense case, which is the highest among state-of-the-art sparse training accelerators.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning\",\"authors\":\"Jenny Yang, Jaeuk Kim, Joo-Young Kim\",\"doi\":\"10.1109/ICFPT56656.2022.9974543\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems in various applications such as multi-robot control and self-driving cars. Unlike supervised model or single-agent rein-forcement learning, which actively exploits network pruning, it is obscure that how pruning will work in multi-agent reinforcement learning with its cooperative and interactive characteristics. In this paper, we present a real-time sparse training accel-eration system named LearningGroup, which adopts network pruning on the training of MARL for the first time with an algorithm/architecture co-design approach. We create spar-sity using a weight grouping algorithm and propose on-chip sparse data encoding loop (OSEL) that enables fast encoding with efficient implementation. Based on the OSEL's encoding format, LearningGroup performs efficient weight compression and computation workload allocation to multiple cores, where each core handles multiple sparse rows of the weight matrix simultaneously with vector processing units. As a result, LearningGroup system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x. Its FPGA accelerator shows 257.40-3629.48 GFLOPS throughput and 7.10-100.12 GFLOPS/W energy efficiency for various conditions in MARL, which are 7.13x higher and 12.43x more energy efficient than Nvidia Titan RTX GPU, thanks to the fully on-chip training and highly optimized dataflow/data format provided by FPGA. Most importantly, the accelerator shows speedup up to 12.52 x for processing sparse data over the dense case, which is the highest among state-of-the-art sparse training accelerators.\",\"PeriodicalId\":239314,\"journal\":{\"name\":\"2022 International Conference on Field-Programmable Technology (ICFPT)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Field-Programmable Technology (ICFPT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICFPT56656.2022.9974543\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT56656.2022.9974543","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

多智能体强化学习(MARL)是一种构建交互式人工智能系统的强大技术，可用于多机器人控制和自动驾驶汽车等各种应用。与主动利用网络修剪的监督模型或单智能体强化学习不同，基于多智能体强化学习的协作和交互特性，修剪如何发挥作用尚不清楚。本文提出了一个名为LearningGroup的实时稀疏训练加速系统，该系统首次采用算法/架构协同设计的方法对MARL的训练进行网络剪枝。我们使用权重分组算法创建稀疏性，并提出片上稀疏数据编码循环(OSEL)，该循环可以实现快速编码和高效实现。基于OSEL的编码格式，LearningGroup对多个核执行有效的权重压缩和计算工作量分配，其中每个核同时使用向量处理单元处理权重矩阵的多个稀疏行。因此，LearningGroup系统将稀疏数据生成的周期时间和内存占用最小化，分别达到5.72倍和6.81倍。其FPGA加速器在MARL各种条件下的吞吐量为257.40-3629.48 GFLOPS/W，能效为7.10-100.12 GFLOPS/W，比Nvidia Titan RTX GPU高7.13倍，节能12.43倍，这得益于FPGA提供的完全片上训练和高度优化的数据流/数据格式。最重要的是，在密集情况下，该加速器在处理稀疏数据方面的加速高达12.52 x，这在最先进的稀疏训练加速器中是最高的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems in various applications such as multi-robot control and self-driving cars. Unlike supervised model or single-agent rein-forcement learning, which actively exploits network pruning, it is obscure that how pruning will work in multi-agent reinforcement learning with its cooperative and interactive characteristics. In this paper, we present a real-time sparse training accel-eration system named LearningGroup, which adopts network pruning on the training of MARL for the first time with an algorithm/architecture co-design approach. We create spar-sity using a weight grouping algorithm and propose on-chip sparse data encoding loop (OSEL) that enables fast encoding with efficient implementation. Based on the OSEL's encoding format, LearningGroup performs efficient weight compression and computation workload allocation to multiple cores, where each core handles multiple sparse rows of the weight matrix simultaneously with vector processing units. As a result, LearningGroup system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x. Its FPGA accelerator shows 257.40-3629.48 GFLOPS throughput and 7.10-100.12 GFLOPS/W energy efficiency for various conditions in MARL, which are 7.13x higher and 12.43x more energy efficient than Nvidia Titan RTX GPU, thanks to the fully on-chip training and highly optimized dataflow/data format provided by FPGA. Most importantly, the accelerator shows speedup up to 12.52 x for processing sparse data over the dense case, which is the highest among state-of-the-art sparse training accelerators.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 International Conference on Field-Programmable Technology (ICFPT)

自引率

0.00%

发文量