AI-PiM—Extending the RISC-V processor with Processing-in-Memory functional units for AI inference at the edge of IoT

IF 1.9 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC

Frontiers in electronics Pub Date : 2022-08-11 DOI:10.3389/felec.2022.898273

Vaibhav Verma, M. Stan

{"title":"AI-PiM—Extending the RISC-V processor with Processing-in-Memory functional units for AI inference at the edge of IoT","authors":"Vaibhav Verma, M. Stan","doi":"10.3389/felec.2022.898273","DOIUrl":null,"url":null,"abstract":"The recent advances in Artificial Intelligence (AI) achieving “better-than-human” accuracy in a variety of tasks such as image classification and the game of Go have come at the cost of exponential increase in the size of artificial neural networks. This has lead to AI hardware solutions becoming severely memory-bound and scrambling to keep-up with the ever increasing “von Neumann bottleneck”. Processing-in-Memory (PiM) architectures offer an excellent solution to ease the von Neumann bottleneck by embedding compute capabilities inside the memory and reducing the data traffic between the memory and the processor. But PiM accelerators break the standard von Neumann programming model by fusing memory and compute operations together which impedes their integration in the standard computing stack. There is an urgent requirement for system-level solutions to take full advantage of PiM accelerators for end-to-end acceleration of AI applications. This article presents AI-PiM as a solution to bridge this research gap. AI-PiM proposes a hardware, ISA and software co-design methodology which allows integration of PiM accelerators in the RISC-V processor pipeline as functional execution units. AI-PiM also extends the RISC-V ISA with custom instructions which directly target the PiM functional units resulting in their tight integration with the processor. This tight integration is especially important for edge AI devices which need to process both AI and non-AI tasks on the same hardware due to area, power, size and cost constraints. AI-PiM ISA extensions expose the PiM hardware functionality to software programmers allowing efficient mapping of applications to the PiM hardware. AI-PiM adds support for custom ISA extensions to the complete software stack including compiler, assembler, linker, simulator and profiler to ensure programmability and evaluation with popular AI domain-specific languages and frameworks like TensorFlow, PyTorch, MXNet, Keras etc. AI-PiM improves the performance for vector-matrix multiplication (VMM) kernel by 17.63x and provides a mean speed-up of 2.74x for MLPerf Tiny benchmark compared to RV64IMC RISC-V baseline. AI-PiM also speeds-up MLPerf Tiny benchmark inference cycles by 2.45x (average) compared to state-of-the-art Arm Cortex-A72 processor.","PeriodicalId":73081,"journal":{"name":"Frontiers in electronics","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2022-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in electronics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/felec.2022.898273","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 2

Abstract

The recent advances in Artificial Intelligence (AI) achieving “better-than-human” accuracy in a variety of tasks such as image classification and the game of Go have come at the cost of exponential increase in the size of artificial neural networks. This has lead to AI hardware solutions becoming severely memory-bound and scrambling to keep-up with the ever increasing “von Neumann bottleneck”. Processing-in-Memory (PiM) architectures offer an excellent solution to ease the von Neumann bottleneck by embedding compute capabilities inside the memory and reducing the data traffic between the memory and the processor. But PiM accelerators break the standard von Neumann programming model by fusing memory and compute operations together which impedes their integration in the standard computing stack. There is an urgent requirement for system-level solutions to take full advantage of PiM accelerators for end-to-end acceleration of AI applications. This article presents AI-PiM as a solution to bridge this research gap. AI-PiM proposes a hardware, ISA and software co-design methodology which allows integration of PiM accelerators in the RISC-V processor pipeline as functional execution units. AI-PiM also extends the RISC-V ISA with custom instructions which directly target the PiM functional units resulting in their tight integration with the processor. This tight integration is especially important for edge AI devices which need to process both AI and non-AI tasks on the same hardware due to area, power, size and cost constraints. AI-PiM ISA extensions expose the PiM hardware functionality to software programmers allowing efficient mapping of applications to the PiM hardware. AI-PiM adds support for custom ISA extensions to the complete software stack including compiler, assembler, linker, simulator and profiler to ensure programmability and evaluation with popular AI domain-specific languages and frameworks like TensorFlow, PyTorch, MXNet, Keras etc. AI-PiM improves the performance for vector-matrix multiplication (VMM) kernel by 17.63x and provides a mean speed-up of 2.74x for MLPerf Tiny benchmark compared to RV64IMC RISC-V baseline. AI-PiM also speeds-up MLPerf Tiny benchmark inference cycles by 2.45x (average) compared to state-of-the-art Arm Cortex-A72 processor.

查看原文本刊更多论文

AI PiM——用内存处理功能单元扩展RISC-V处理器，用于物联网边缘的AI推理

人工智能（AI）在图像分类和围棋等各种任务中实现了“优于人类”的精度，这是以人工神经网络规模的指数级增长为代价的。这导致人工智能硬件解决方案变得内存严重受限，并争相跟上日益增长的“冯·诺依曼瓶颈”。内存处理（PiM）体系结构通过在内存中嵌入计算能力并减少内存和处理器之间的数据流量，为缓解冯·诺依曼瓶颈提供了一个出色的解决方案。但是，PiM加速器通过将内存和计算操作融合在一起，打破了标准的冯·诺依曼编程模型，这阻碍了它们在标准计算堆栈中的集成。迫切需要系统级解决方案来充分利用PiM加速器来实现人工智能应用的端到端加速。本文提出了AI PiM作为一种解决方案来弥补这一研究空白。AI PiM提出了一种硬件、ISA和软件协同设计方法，允许将PiM加速器作为功能执行单元集成在RISC-V处理器流水线中。AI PiM还通过自定义指令扩展了RISC-V ISA，这些指令直接针对PiM功能单元，从而使其与处理器紧密集成。这种紧密集成对于边缘人工智能设备尤其重要，因为面积、功率、尺寸和成本限制，这些设备需要在同一硬件上处理人工智能和非人工智能任务。AI PiM ISA扩展将PiM硬件功能暴露给软件程序员，从而实现应用程序到PiM硬件的高效映射。AI PiM为完整的软件堆栈添加了对自定义ISA扩展的支持，包括编译器、汇编程序、链接器、模拟器和分析器，以确保使用TensorFlow、PyTorch、MXNet、Keras等流行的AI领域特定语言和框架进行可编程性和评估。与RV64IMC RISC-V基线相比，AI PiM将矢量矩阵乘法（VMM）内核的性能提高了17.63x，并为MLPerf Tiny基准提供了2.74x的平均加速。与最先进的Arm Cortex-A72处理器相比，AI PiM还将MLPerf Tiny基准推理周期加快了2.45倍（平均）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in electronics

自引率

0.00%

发文量