用于加速变压器前馈网络的高效两级流水线内存计算宏程序

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-07-29 DOI:10.1109/TVLSI.2024.3432403

Heng Zhang;Wenhe Yin;Sunan He;Yuan Du;Li Du

{"title":"用于加速变压器前馈网络的高效两级流水线内存计算宏程序","authors":"Heng Zhang;Wenhe Yin;Sunan He;Yuan Du;Li Du","doi":"10.1109/TVLSI.2024.3432403","DOIUrl":null,"url":null,"abstract":"Transformer architectures have achieved state-of-the-art performance in various applications. However, deploying transformer models on resource-constrained platforms is still challenging due to its dynamic workloads, intensive computations, and substantial memory access. In this article, we propose a two-stage pipelined compute-in-memory (CIM) macro for effectively deploying and accelerating the feed-forward network (FFN) layers of transformer models. Two independent CIM arrays are designed to execute the two distinct linear projections in FFN layers, which are interconnected by co-designed analog rectified linear unit (ReLU) circuits to realize the nonlinear activation function. The analog multiply-and-add (MAC) results from the first CIM array are streamed directly to the analog ReLU circuits, and subsequently to the next CIM array for performing another linear projection. This architecture eliminates the need for analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) for internal results’ staging, thereby enhancing overall macro efficiency and reducing computing latency. A proof-of-concept macro is fabricated using TSMC 65-nm process and achieves 4.096 TOPS peak throughput, 4.39 TOPS/mm2 area efficiency, and 49.83 TOPS/W energy efficiency. To map transformer models onto the proposed macro, we quantize the FFN layers of BERTMINI model under per-token granularity for activations and per-tensor granularity for weights using quantization-aware training (QAT), which exhibits excellent accuracy across multiple benchmarks.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 10","pages":"1889-1899"},"PeriodicalIF":2.8000,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Efficient Two-Stage Pipelined Compute-in-Memory Macro for Accelerating Transformer Feed-Forward Networks\",\"authors\":\"Heng Zhang;Wenhe Yin;Sunan He;Yuan Du;Li Du\",\"doi\":\"10.1109/TVLSI.2024.3432403\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer architectures have achieved state-of-the-art performance in various applications. However, deploying transformer models on resource-constrained platforms is still challenging due to its dynamic workloads, intensive computations, and substantial memory access. In this article, we propose a two-stage pipelined compute-in-memory (CIM) macro for effectively deploying and accelerating the feed-forward network (FFN) layers of transformer models. Two independent CIM arrays are designed to execute the two distinct linear projections in FFN layers, which are interconnected by co-designed analog rectified linear unit (ReLU) circuits to realize the nonlinear activation function. The analog multiply-and-add (MAC) results from the first CIM array are streamed directly to the analog ReLU circuits, and subsequently to the next CIM array for performing another linear projection. This architecture eliminates the need for analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) for internal results’ staging, thereby enhancing overall macro efficiency and reducing computing latency. A proof-of-concept macro is fabricated using TSMC 65-nm process and achieves 4.096 TOPS peak throughput, 4.39 TOPS/mm2 area efficiency, and 49.83 TOPS/W energy efficiency. To map transformer models onto the proposed macro, we quantize the FFN layers of BERTMINI model under per-token granularity for activations and per-tensor granularity for weights using quantization-aware training (QAT), which exhibits excellent accuracy across multiple benchmarks.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"32 10\",\"pages\":\"1889-1899\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-07-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10614387/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10614387/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

变压器架构已在各种应用中实现了最先进的性能。然而，由于其动态工作负载、密集计算和大量内存访问，在资源受限的平台上部署变压器模型仍具有挑战性。在本文中，我们提出了一种两级流水线内存计算（CIM）宏，用于有效部署和加速变压器模型的前馈网络（FFN）层。我们设计了两个独立的 CIM 阵列来执行前馈网络层中两个不同的线性投影，它们通过共同设计的模拟整流线性单元 (ReLU) 电路相互连接，以实现非线性激活函数。第一个 CIM 阵列的模拟乘加（MAC）结果直接流向模拟 ReLU 电路，随后流向下一个 CIM 阵列，以执行另一个线性投影。这种架构无需使用模数转换器（ADC）和数模转换器（DAC）进行内部结果分期，从而提高了整体宏效率并减少了计算延迟。概念验证宏采用台积电 65 纳米工艺制造，实现了 4.096 TOPS 的峰值吞吐量、4.39 TOPS/mm2 的面积效率和 49.83 TOPS/W 的能效。为了将变压器模型映射到所提出的宏上，我们使用量化感知训练（QAT）对 BERTMINI 模型的 FFN 层进行量化，激活度按标记粒度计算，权重按张量粒度计算。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Efficient Two-Stage Pipelined Compute-in-Memory Macro for Accelerating Transformer Feed-Forward Networks

Transformer architectures have achieved state-of-the-art performance in various applications. However, deploying transformer models on resource-constrained platforms is still challenging due to its dynamic workloads, intensive computations, and substantial memory access. In this article, we propose a two-stage pipelined compute-in-memory (CIM) macro for effectively deploying and accelerating the feed-forward network (FFN) layers of transformer models. Two independent CIM arrays are designed to execute the two distinct linear projections in FFN layers, which are interconnected by co-designed analog rectified linear unit (ReLU) circuits to realize the nonlinear activation function. The analog multiply-and-add (MAC) results from the first CIM array are streamed directly to the analog ReLU circuits, and subsequently to the next CIM array for performing another linear projection. This architecture eliminates the need for analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) for internal results’ staging, thereby enhancing overall macro efficiency and reducing computing latency. A proof-of-concept macro is fabricated using TSMC 65-nm process and achieves 4.096 TOPS peak throughput, 4.39 TOPS/mm2 area efficiency, and 49.83 TOPS/W energy efficiency. To map transformer models onto the proposed macro, we quantize the FFN layers of BERTMINI model under per-token granularity for activations and per-tensor granularity for weights using quantization-aware training (QAT), which exhibits excellent accuracy across multiple benchmarks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.