DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

Angelo Garofalo;Yvan Tortorella;Matteo Perotti;Luca Valente;Alessandro Nadalini;Luca Benini;Davide Rossi;Francesco Conti
{"title":"DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training","authors":"Angelo Garofalo;Yvan Tortorella;Matteo Perotti;Luca Valente;Alessandro Nadalini;Luca Benini;Davide Rossi;Francesco Conti","doi":"10.1109/OJSSCS.2022.3210082","DOIUrl":null,"url":null,"abstract":"On-chip deep neural network (DNN) inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy, and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present DARKSIDE, a System-on-Chip with a heterogeneous cluster of eight RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost the performance and efficiency on key compute-intensive DNN kernels, the cluster is enriched with three digital accelerators: 1) a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); 2) a minimal overhead datamover to marshal 1–32-b data on-the-fly; and 3) a 16-b floating-point tensor product engine (TPE) for tiled matrix-multiplication acceleration. DARKSIDE is implemented in 65-nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency—enough to enable on-chip floating-point training at competitive speed coupled with ultralow power quantized inference.","PeriodicalId":100633,"journal":{"name":"IEEE Open Journal of the Solid-State Circuits Society","volume":"2 ","pages":"231-243"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8782712/9733783/09903915.pdf","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Solid-State Circuits Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/9903915/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

On-chip deep neural network (DNN) inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy, and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present DARKSIDE, a System-on-Chip with a heterogeneous cluster of eight RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost the performance and efficiency on key compute-intensive DNN kernels, the cluster is enriched with three digital accelerators: 1) a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); 2) a minimal overhead datamover to marshal 1–32-b data on-the-fly; and 3) a 16-b floating-point tensor product engine (TPE) for tiled matrix-multiplication acceleration. DARKSIDE is implemented in 65-nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency—enough to enable on-chip floating-point training at competitive speed coupled with ultralow power quantized inference.
DARKSIDE:一个用于极端边缘片上DNN推理和训练的异构RISC-V计算集群
片上深度神经网络(DNN)推理和极限边缘训练(TinyML)对延迟、吞吐量、准确性和灵活性提出了严格的要求。异构集群将DSP增强内核的灵活性与专用加速器的性能和能量提升相结合,是应对这一挑战的有前景的解决方案。我们介绍了DARKSIDE,这是一种片上系统,具有八个RISC-V核心的异构集群,并使用2-b到32-b混合精度整数算法进行了增强。为了提高关键计算密集型DNN内核的性能和效率,该集群配备了三个数字加速器:1)用于低数据重用深度卷积内核的专用引擎(高达30 MAC/周期);2) 一个最小开销的数据移动器,用于动态编组1–32-b数据;以及3)用于拼接矩阵乘法加速的16-b浮点张量积引擎(TPE)。DARKSIDE采用65nm CMOS技术实现。当处理2-b整数DNN内核时,该集群实现了65 GOPS的峰值整数性能和835 GOPS/W的峰值效率。当针对浮点张量运算时,TPE提供高达18.2 GFLOPS的性能或300 GFLOPS/W的效率,足以实现具有竞争力的速度下的片上浮点训练以及超低功耗量化推理。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信