基于多模态变压器的SRAM和HBM混合内存计算体系结构

IF 5.2 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems I: Regular Papers Pub Date : 2025-03-08 DOI:10.1109/TCSI.2025.3561245

Xiangqu Fu;Jinshan Yue;Muhammad Faizan;Zhi Li;Qiang Huo;Feng Zhang

{"title":"基于多模态变压器的SRAM和HBM混合内存计算体系结构","authors":"Xiangqu Fu;Jinshan Yue;Muhammad Faizan;Zhi Li;Qiang Huo;Feng Zhang","doi":"10.1109/TCSI.2025.3561245","DOIUrl":null,"url":null,"abstract":"Multimodal Transformer (MMT) algorithms have become the state-of-the-art for multimodal tasks such as image captioning. The Encoder-Decoder (E-D) structure, consisting of Encoder, Decoder-causal, and Decoder-cross components, provides a flexible and effective framework for multimodal tasks. However, previous accelerators mainly focus on the dataflow and hardware optimization of the Encoder, which fails to accelerate the entire E-D structure efficiently. There remain three challenges: 1) the lack of pipeline and multicore optimization at the module, layer, and E-D level; 2) the Decoder-causal and Decoder-cross computations have lower arithmetic intensity compared to the Encoder, requiring a better solution for the varying arithmetic intensities; and 3) the autoregressive algorithm in Decoder-causal leads to redundant KV Cache accesses and considerable idle power. In this paper, <italic>SHMT</i>, an SRAM and HBM hybrid computing-in-memory (CIM) architecture, is designed to efficiently support multimodal Transformers with three key contributions: 1) a multi-level pipelined multicore scheme, including pipeline optimization across E-D layer-head-module levels and a multicore network-on-chip (NoC) architecture, to reduce inference latency and off-chip accesses; 2) a heterogeneous SRAM-HBM architecture, utilizing high-density HBM-CIM for low-arithmetic-intensity (LAI) parts and high-performance SRAM-CIM for high-arithmetic-intensity (HAI) parts; and 3) by integrating KV Cache with zero-padding in SRAM-CIM, SHMT eliminates redundant read-write operations in KV Cache, reducing idle power consumption. Experiment results show that SHMT achieves <inline-formula> <tex-math>$212\\times $ </tex-math></inline-formula> speedup, reduces energy consumption by <inline-formula> <tex-math>$208\\times \\sim 2000\\times $ </tex-math></inline-formula> per token, and achieves <inline-formula> <tex-math>$13.3\\times $ </tex-math></inline-formula> higher energy efficiency compared to NVIDIA A100 GPU.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 6","pages":"2712-2725"},"PeriodicalIF":5.2000,"publicationDate":"2025-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SHMT: An SRAM and HBM Hybrid Computing-in-Memory Architecture With Optimized KV Cache for Multimodal Transformer\",\"authors\":\"Xiangqu Fu;Jinshan Yue;Muhammad Faizan;Zhi Li;Qiang Huo;Feng Zhang\",\"doi\":\"10.1109/TCSI.2025.3561245\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal Transformer (MMT) algorithms have become the state-of-the-art for multimodal tasks such as image captioning. The Encoder-Decoder (E-D) structure, consisting of Encoder, Decoder-causal, and Decoder-cross components, provides a flexible and effective framework for multimodal tasks. However, previous accelerators mainly focus on the dataflow and hardware optimization of the Encoder, which fails to accelerate the entire E-D structure efficiently. There remain three challenges: 1) the lack of pipeline and multicore optimization at the module, layer, and E-D level; 2) the Decoder-causal and Decoder-cross computations have lower arithmetic intensity compared to the Encoder, requiring a better solution for the varying arithmetic intensities; and 3) the autoregressive algorithm in Decoder-causal leads to redundant KV Cache accesses and considerable idle power. In this paper, <italic>SHMT</i>, an SRAM and HBM hybrid computing-in-memory (CIM) architecture, is designed to efficiently support multimodal Transformers with three key contributions: 1) a multi-level pipelined multicore scheme, including pipeline optimization across E-D layer-head-module levels and a multicore network-on-chip (NoC) architecture, to reduce inference latency and off-chip accesses; 2) a heterogeneous SRAM-HBM architecture, utilizing high-density HBM-CIM for low-arithmetic-intensity (LAI) parts and high-performance SRAM-CIM for high-arithmetic-intensity (HAI) parts; and 3) by integrating KV Cache with zero-padding in SRAM-CIM, SHMT eliminates redundant read-write operations in KV Cache, reducing idle power consumption. Experiment results show that SHMT achieves <inline-formula> <tex-math>$212\\\\times $ </tex-math></inline-formula> speedup, reduces energy consumption by <inline-formula> <tex-math>$208\\\\times \\\\sim 2000\\\\times $ </tex-math></inline-formula> per token, and achieves <inline-formula> <tex-math>$13.3\\\\times $ </tex-math></inline-formula> higher energy efficiency compared to NVIDIA A100 GPU.\",\"PeriodicalId\":13039,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"volume\":\"72 6\",\"pages\":\"2712-2725\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-03-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10993508/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10993508/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

多模态变压器（MMT）算法已成为图像字幕等多模态任务的最先进算法。编码器-解码器（E-D）结构由编码器、解码器-因果和解码器-交叉组件组成，为多模态任务提供了灵活有效的框架。然而，以往的加速器主要集中在编码器的数据流和硬件优化上，无法有效地加速整个E-D结构。目前仍存在三大挑战：1)在模块、层和E-D层面缺乏流水线和多核优化；2)与编码器相比，解码器-因果计算和解码器-交叉计算具有较低的算术强度，需要更好地解决不同的算术强度；(3)解码器因果中的自回归算法导致千伏缓存访问冗余和大量空闲功率。在本文中，SHMT是一种SRAM和HBM混合内存计算（CIM）架构，旨在有效地支持多模态变压器，具有三个关键贡献：1)多级流水线多核方案，包括跨E-D层-头模块级别的管道优化和多核片上网络（NoC）架构，以减少推理延迟和片外访问；2)异构SRAM-HBM架构，利用高密度的HBM-CIM用于低算法强度（LAI）部件，高性能的SRAM-CIM用于高算法强度（HAI）部件；3)通过在SRAM-CIM中集成KV Cache和零填充，SHMT消除了KV Cache中的冗余读写操作，降低了空闲功耗。实验结果表明，与NVIDIA A100 GPU相比，SHMT实现了$212\times $的加速，每个令牌减少了$208\times $的能耗，并且实现了$13.3\times $的能效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SHMT: An SRAM and HBM Hybrid Computing-in-Memory Architecture With Optimized KV Cache for Multimodal Transformer

Multimodal Transformer (MMT) algorithms have become the state-of-the-art for multimodal tasks such as image captioning. The Encoder-Decoder (E-D) structure, consisting of Encoder, Decoder-causal, and Decoder-cross components, provides a flexible and effective framework for multimodal tasks. However, previous accelerators mainly focus on the dataflow and hardware optimization of the Encoder, which fails to accelerate the entire E-D structure efficiently. There remain three challenges: 1) the lack of pipeline and multicore optimization at the module, layer, and E-D level; 2) the Decoder-causal and Decoder-cross computations have lower arithmetic intensity compared to the Encoder, requiring a better solution for the varying arithmetic intensities; and 3) the autoregressive algorithm in Decoder-causal leads to redundant KV Cache accesses and considerable idle power. In this paper, SHMT, an SRAM and HBM hybrid computing-in-memory (CIM) architecture, is designed to efficiently support multimodal Transformers with three key contributions: 1) a multi-level pipelined multicore scheme, including pipeline optimization across E-D layer-head-module levels and a multicore network-on-chip (NoC) architecture, to reduce inference latency and off-chip accesses; 2) a heterogeneous SRAM-HBM architecture, utilizing high-density HBM-CIM for low-arithmetic-intensity (LAI) parts and high-performance SRAM-CIM for high-arithmetic-intensity (HAI) parts; and 3) by integrating KV Cache with zero-padding in SRAM-CIM, SHMT eliminates redundant read-write operations in KV Cache, reducing idle power consumption. Experiment results show that SHMT achieves

$212\times $

speedup, reduces energy consumption by

$208\times \sim 2000\times $

per token, and achieves

$13.3\times $

higher energy efficiency compared to NVIDIA A100 GPU.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems I: Regular Papers 工程技术-工程：电子与电气

CiteScore

9.80

自引率

11.80%

发文量

441

审稿时长

2 months

期刊介绍： TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.