一个用于内存中处理加速器的端到端DNN编译器

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Pub Date : 2024-11-12 DOI:10.1109/TCAD.2024.3496847

Xiaotian Sun;Xinyu Wang;Wanqian Li;Yinhe Han;Xiaoming Chen

{"title":"一个用于内存中处理加速器的端到端DNN编译器","authors":"Xiaotian Sun;Xinyu Wang;Wanqian Li;Yinhe Han;Xiaoming Chen","doi":"10.1109/TCAD.2024.3496847","DOIUrl":null,"url":null,"abstract":"In the past decade, various processing-in-memory (PIM) accelerators based on various devices, micro-architectures, and interfaces have been proposed to accelerate deep neural networks (DNNs). How to deploy DNNs onto PIM-based accelerators is the key to explore PIM’s high performance and energy efficiency. The scale of DNN models, the diversity of PIM accelerators, and the complexity of deployment are far beyond the human deployment capability. Hence, an automatic deployment methodology is indispensable. In this work, we propose PIMCOMP, an end-to-end DNN compiler tailored for PIM accelerators, achieving efficient deployment of DNN models on PIM hardware. PIMCOMP can adapt to various PIM architectures by using an abstract configurable PIM accelerator template with a set of pseudo instructions, which is a high-level abstraction of the hardware’s fundamental functionalities. Through a generic multilevel optimization framework, PIMCOMP realizes an end-to-end conversion from a high-level DNN description to pseudo instructions, which can be further converted to specific hardware intrinsics/primitives. The compilation addresses two critical issues in PIM-accelerated inference from a system perspective: 1) resource utilization and 2) dataflow scheduling. PIMCOMP adopts a flexible unfolding format to reshape and partition convolutional layers, adopts a weight-layout guided computation-storage-mapping approach to enhance resource utilization, and balances the system’s computation, memory access, and communication characteristics. For dataflow scheduling, we design two scheduling algorithms with different interlayer pipeline granularities to support varying application scenarios while ensuring high-computational parallelism. Experiments demonstrate that PIMCOMP improves throughput, latency, and energy efficiency across various architectures. PIMCOMP is open-sourced at <uri>https://github.com/sunxt99/PIMCOMP-NN</uri>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1745-1759"},"PeriodicalIF":2.7000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PIMCOMP: An End-to-End DNN Compiler for Processing-In-Memory Accelerators\",\"authors\":\"Xiaotian Sun;Xinyu Wang;Wanqian Li;Yinhe Han;Xiaoming Chen\",\"doi\":\"10.1109/TCAD.2024.3496847\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the past decade, various processing-in-memory (PIM) accelerators based on various devices, micro-architectures, and interfaces have been proposed to accelerate deep neural networks (DNNs). How to deploy DNNs onto PIM-based accelerators is the key to explore PIM’s high performance and energy efficiency. The scale of DNN models, the diversity of PIM accelerators, and the complexity of deployment are far beyond the human deployment capability. Hence, an automatic deployment methodology is indispensable. In this work, we propose PIMCOMP, an end-to-end DNN compiler tailored for PIM accelerators, achieving efficient deployment of DNN models on PIM hardware. PIMCOMP can adapt to various PIM architectures by using an abstract configurable PIM accelerator template with a set of pseudo instructions, which is a high-level abstraction of the hardware’s fundamental functionalities. Through a generic multilevel optimization framework, PIMCOMP realizes an end-to-end conversion from a high-level DNN description to pseudo instructions, which can be further converted to specific hardware intrinsics/primitives. The compilation addresses two critical issues in PIM-accelerated inference from a system perspective: 1) resource utilization and 2) dataflow scheduling. PIMCOMP adopts a flexible unfolding format to reshape and partition convolutional layers, adopts a weight-layout guided computation-storage-mapping approach to enhance resource utilization, and balances the system’s computation, memory access, and communication characteristics. For dataflow scheduling, we design two scheduling algorithms with different interlayer pipeline granularities to support varying application scenarios while ensuring high-computational parallelism. Experiments demonstrate that PIMCOMP improves throughput, latency, and energy efficiency across various architectures. PIMCOMP is open-sourced at <uri>https://github.com/sunxt99/PIMCOMP-NN</uri>.\",\"PeriodicalId\":13251,\"journal\":{\"name\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"volume\":\"44 5\",\"pages\":\"1745-1759\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2024-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10750525/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10750525/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

在过去的十年中，人们提出了基于各种器件、微架构和接口的各种内存处理（PIM）加速器来加速深度神经网络（dnn）。如何将深度神经网络部署到基于PIM的加速器上是探索PIM高性能和高能效的关键。深度神经网络模型的规模、PIM加速器的多样性以及部署的复杂性都远远超出了人类的部署能力。因此，自动部署方法是必不可少的。在这项工作中，我们提出了PIMCOMP，一个为PIM加速器量身定制的端到端DNN编译器，实现了在PIM硬件上有效部署DNN模型。通过使用带有一组伪指令的抽象可配置PIM加速器模板（这是硬件基本功能的高级抽象），PIMCOMP可以适应各种PIM体系结构。通过一个通用的多级优化框架，PIMCOMP实现了从高级DNN描述到伪指令的端到端转换，伪指令可以进一步转换为特定的硬件本质/原语。从系统的角度来看，编译解决了pim加速推理中的两个关键问题：1)资源利用和2)数据流调度。PIMCOMP采用灵活的展开格式对卷积层进行重塑和划分，采用权重布局引导的计算-存储-映射方法提高资源利用率，平衡系统的计算、内存访问和通信特性。对于数据流调度，我们设计了两种不同层间管道粒度的调度算法，以支持不同的应用场景，同时保证高计算并行性。实验证明，PIMCOMP可以提高各种体系结构的吞吐量、延迟和能源效率。PIMCOMP是开源的，网址是https://github.com/sunxt99/PIMCOMP-NN。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PIMCOMP: An End-to-End DNN Compiler for Processing-In-Memory Accelerators

In the past decade, various processing-in-memory (PIM) accelerators based on various devices, micro-architectures, and interfaces have been proposed to accelerate deep neural networks (DNNs). How to deploy DNNs onto PIM-based accelerators is the key to explore PIM’s high performance and energy efficiency. The scale of DNN models, the diversity of PIM accelerators, and the complexity of deployment are far beyond the human deployment capability. Hence, an automatic deployment methodology is indispensable. In this work, we propose PIMCOMP, an end-to-end DNN compiler tailored for PIM accelerators, achieving efficient deployment of DNN models on PIM hardware. PIMCOMP can adapt to various PIM architectures by using an abstract configurable PIM accelerator template with a set of pseudo instructions, which is a high-level abstraction of the hardware’s fundamental functionalities. Through a generic multilevel optimization framework, PIMCOMP realizes an end-to-end conversion from a high-level DNN description to pseudo instructions, which can be further converted to specific hardware intrinsics/primitives. The compilation addresses two critical issues in PIM-accelerated inference from a system perspective: 1) resource utilization and 2) dataflow scheduling. PIMCOMP adopts a flexible unfolding format to reshape and partition convolutional layers, adopts a weight-layout guided computation-storage-mapping approach to enhance resource utilization, and balances the system’s computation, memory access, and communication characteristics. For dataflow scheduling, we design two scheduling algorithms with different interlayer pipeline granularities to support varying application scenarios while ensuring high-computational parallelism. Experiments demonstrate that PIMCOMP improves throughput, latency, and energy efficiency across various architectures. PIMCOMP is open-sourced at https://github.com/sunxt99/PIMCOMP-NN.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 工程技术-工程：电子与电气

CiteScore

5.60

自引率

13.80%

发文量

500

审稿时长

7 months

期刊介绍： The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.