FAMS: A FrAmework of Memory-Centric Mapping for DNNs on Systolic Array Accelerators

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-01-16 DOI:10.1109/TVLSI.2024.3522326

Hao Sun;Junzhong Shen;Tian Zhang;Zhongyi Tang;Changwu Zhang;Yuhang Li;Yang Shi;Hengzhu Liu

{"title":"FAMS: A FrAmework of Memory-Centric Mapping for DNNs on Systolic Array Accelerators","authors":"Hao Sun;Junzhong Shen;Tian Zhang;Zhongyi Tang;Changwu Zhang;Yuhang Li;Yang Shi;Hengzhu Liu","doi":"10.1109/TVLSI.2024.3522326","DOIUrl":null,"url":null,"abstract":"In recent years, deep neural networks (DNNs) have experienced rapid development. These DNNs demonstrate significant variations in architecture and scale, creating a substantial demand for domain-specific accelerators that are optimized for both high performance and low energy consumption. Systolic array accelerators, due to their efficient dataflow and parallel processing capabilities, offer significant advantages when performing computations for DNNs. Existing studies frequently overlook various hardware constraints in systolic array accelerators when representing mapping strategies. This oversight includes ignoring the differences in delays between communication and computation operations, as well as overlooking the capacities of multilevel memory hierarchies. Such omissions can lead to inaccuracies in predicting accelerator performance and inefficiencies in system design. We propose the FAMS framework, which introduces a memory-centric notation capable of fully representing the mapping of DNN operations on systolic array accelerators. Memory-centric notation moves away from the idealized assumptions of previous notations and considers various hardware constraints, thereby expanding the effective design and mapping spaces. The FAMS framework also includes a cycle-accurate simulator, which takes the hardware configurations, task descriptions, and mapping strategy represented by memory-centric notation as inputs, providing various metrics such as latency and energy consumption. The experimental results demonstrate that our proposed FAMS framework reduces latency by up to 29.7% and increases throughput by 42.4% compared to the state-of-the-art TENET framework. Additionally, under hardware configurations with a MAC delay of 2 and 3 clock cycles, the FAMS framework enhances performance by 12.0% and 25.4%, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 4","pages":"976-989"},"PeriodicalIF":2.8000,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10843963/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, deep neural networks (DNNs) have experienced rapid development. These DNNs demonstrate significant variations in architecture and scale, creating a substantial demand for domain-specific accelerators that are optimized for both high performance and low energy consumption. Systolic array accelerators, due to their efficient dataflow and parallel processing capabilities, offer significant advantages when performing computations for DNNs. Existing studies frequently overlook various hardware constraints in systolic array accelerators when representing mapping strategies. This oversight includes ignoring the differences in delays between communication and computation operations, as well as overlooking the capacities of multilevel memory hierarchies. Such omissions can lead to inaccuracies in predicting accelerator performance and inefficiencies in system design. We propose the FAMS framework, which introduces a memory-centric notation capable of fully representing the mapping of DNN operations on systolic array accelerators. Memory-centric notation moves away from the idealized assumptions of previous notations and considers various hardware constraints, thereby expanding the effective design and mapping spaces. The FAMS framework also includes a cycle-accurate simulator, which takes the hardware configurations, task descriptions, and mapping strategy represented by memory-centric notation as inputs, providing various metrics such as latency and energy consumption. The experimental results demonstrate that our proposed FAMS framework reduces latency by up to 29.7% and increases throughput by 42.4% compared to the state-of-the-art TENET framework. Additionally, under hardware configurations with a MAC delay of 2 and 3 clock cycles, the FAMS framework enhances performance by 12.0% and 25.4%, respectively.

查看原文本刊更多论文

收缩阵列加速器上dnn以记忆为中心的映射框架

近年来，深度神经网络（dnn）得到了快速发展。这些深度神经网络在架构和规模上表现出显著的差异，对针对高性能和低能耗进行优化的特定领域加速器产生了大量需求。收缩阵列加速器由于其高效的数据流和并行处理能力，在执行深度神经网络计算时提供了显着的优势。现有研究在表示映射策略时经常忽略收缩阵列加速器的各种硬件约束。这种疏忽包括忽略通信和计算操作之间延迟的差异，以及忽略多层内存层次结构的容量。这种遗漏可能导致预测加速器性能的不准确性和系统设计的低效率。我们提出了FAMS框架，它引入了一个以内存为中心的符号，能够完全表示收缩阵列加速器上DNN操作的映射。以内存为中心的表示法摆脱了以前表示法的理想化假设，并考虑了各种硬件约束，从而扩展了有效的设计和映射空间。FAMS框架还包括一个周期精确的模拟器，它将硬件配置、任务描述和以内存为中心的符号表示的映射策略作为输入，提供各种度量，如延迟和能耗。实验结果表明，与最先进的TENET框架相比，我们提出的FAMS框架将延迟降低了29.7%，吞吐量提高了42.4%。此外，在MAC延迟为2和3时钟周期的硬件配置下，FAMS框架的性能分别提高了12.0%和25.4%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.