Hao Sun;Junzhong Shen;Tian Zhang;Zhongyi Tang;Changwu Zhang;Yuhang Li;Yang Shi;Hengzhu Liu
{"title":"FAMS: A FrAmework of Memory-Centric Mapping for DNNs on Systolic Array Accelerators","authors":"Hao Sun;Junzhong Shen;Tian Zhang;Zhongyi Tang;Changwu Zhang;Yuhang Li;Yang Shi;Hengzhu Liu","doi":"10.1109/TVLSI.2024.3522326","DOIUrl":null,"url":null,"abstract":"In recent years, deep neural networks (DNNs) have experienced rapid development. These DNNs demonstrate significant variations in architecture and scale, creating a substantial demand for domain-specific accelerators that are optimized for both high performance and low energy consumption. Systolic array accelerators, due to their efficient dataflow and parallel processing capabilities, offer significant advantages when performing computations for DNNs. Existing studies frequently overlook various hardware constraints in systolic array accelerators when representing mapping strategies. This oversight includes ignoring the differences in delays between communication and computation operations, as well as overlooking the capacities of multilevel memory hierarchies. Such omissions can lead to inaccuracies in predicting accelerator performance and inefficiencies in system design. We propose the FAMS framework, which introduces a memory-centric notation capable of fully representing the mapping of DNN operations on systolic array accelerators. Memory-centric notation moves away from the idealized assumptions of previous notations and considers various hardware constraints, thereby expanding the effective design and mapping spaces. The FAMS framework also includes a cycle-accurate simulator, which takes the hardware configurations, task descriptions, and mapping strategy represented by memory-centric notation as inputs, providing various metrics such as latency and energy consumption. The experimental results demonstrate that our proposed FAMS framework reduces latency by up to 29.7% and increases throughput by 42.4% compared to the state-of-the-art TENET framework. Additionally, under hardware configurations with a MAC delay of 2 and 3 clock cycles, the FAMS framework enhances performance by 12.0% and 25.4%, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 4","pages":"976-989"},"PeriodicalIF":2.8000,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10843963/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, deep neural networks (DNNs) have experienced rapid development. These DNNs demonstrate significant variations in architecture and scale, creating a substantial demand for domain-specific accelerators that are optimized for both high performance and low energy consumption. Systolic array accelerators, due to their efficient dataflow and parallel processing capabilities, offer significant advantages when performing computations for DNNs. Existing studies frequently overlook various hardware constraints in systolic array accelerators when representing mapping strategies. This oversight includes ignoring the differences in delays between communication and computation operations, as well as overlooking the capacities of multilevel memory hierarchies. Such omissions can lead to inaccuracies in predicting accelerator performance and inefficiencies in system design. We propose the FAMS framework, which introduces a memory-centric notation capable of fully representing the mapping of DNN operations on systolic array accelerators. Memory-centric notation moves away from the idealized assumptions of previous notations and considers various hardware constraints, thereby expanding the effective design and mapping spaces. The FAMS framework also includes a cycle-accurate simulator, which takes the hardware configurations, task descriptions, and mapping strategy represented by memory-centric notation as inputs, providing various metrics such as latency and energy consumption. The experimental results demonstrate that our proposed FAMS framework reduces latency by up to 29.7% and increases throughput by 42.4% compared to the state-of-the-art TENET framework. Additionally, under hardware configurations with a MAC delay of 2 and 3 clock cycles, the FAMS framework enhances performance by 12.0% and 25.4%, respectively.
期刊介绍:
The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.