{"title":"基于多模态变压器的SRAM和HBM混合内存计算体系结构","authors":"Xiangqu Fu;Jinshan Yue;Muhammad Faizan;Zhi Li;Qiang Huo;Feng Zhang","doi":"10.1109/TCSI.2025.3561245","DOIUrl":null,"url":null,"abstract":"Multimodal Transformer (MMT) algorithms have become the state-of-the-art for multimodal tasks such as image captioning. The Encoder-Decoder (E-D) structure, consisting of Encoder, Decoder-causal, and Decoder-cross components, provides a flexible and effective framework for multimodal tasks. However, previous accelerators mainly focus on the dataflow and hardware optimization of the Encoder, which fails to accelerate the entire E-D structure efficiently. There remain three challenges: 1) the lack of pipeline and multicore optimization at the module, layer, and E-D level; 2) the Decoder-causal and Decoder-cross computations have lower arithmetic intensity compared to the Encoder, requiring a better solution for the varying arithmetic intensities; and 3) the autoregressive algorithm in Decoder-causal leads to redundant KV Cache accesses and considerable idle power. In this paper, <italic>SHMT</i>, an SRAM and HBM hybrid computing-in-memory (CIM) architecture, is designed to efficiently support multimodal Transformers with three key contributions: 1) a multi-level pipelined multicore scheme, including pipeline optimization across E-D layer-head-module levels and a multicore network-on-chip (NoC) architecture, to reduce inference latency and off-chip accesses; 2) a heterogeneous SRAM-HBM architecture, utilizing high-density HBM-CIM for low-arithmetic-intensity (LAI) parts and high-performance SRAM-CIM for high-arithmetic-intensity (HAI) parts; and 3) by integrating KV Cache with zero-padding in SRAM-CIM, SHMT eliminates redundant read-write operations in KV Cache, reducing idle power consumption. Experiment results show that SHMT achieves <inline-formula> <tex-math>$212\\times $ </tex-math></inline-formula> speedup, reduces energy consumption by <inline-formula> <tex-math>$208\\times \\sim 2000\\times $ </tex-math></inline-formula> per token, and achieves <inline-formula> <tex-math>$13.3\\times $ </tex-math></inline-formula> higher energy efficiency compared to NVIDIA A100 GPU.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 6","pages":"2712-2725"},"PeriodicalIF":5.2000,"publicationDate":"2025-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SHMT: An SRAM and HBM Hybrid Computing-in-Memory Architecture With Optimized KV Cache for Multimodal Transformer\",\"authors\":\"Xiangqu Fu;Jinshan Yue;Muhammad Faizan;Zhi Li;Qiang Huo;Feng Zhang\",\"doi\":\"10.1109/TCSI.2025.3561245\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal Transformer (MMT) algorithms have become the state-of-the-art for multimodal tasks such as image captioning. The Encoder-Decoder (E-D) structure, consisting of Encoder, Decoder-causal, and Decoder-cross components, provides a flexible and effective framework for multimodal tasks. However, previous accelerators mainly focus on the dataflow and hardware optimization of the Encoder, which fails to accelerate the entire E-D structure efficiently. There remain three challenges: 1) the lack of pipeline and multicore optimization at the module, layer, and E-D level; 2) the Decoder-causal and Decoder-cross computations have lower arithmetic intensity compared to the Encoder, requiring a better solution for the varying arithmetic intensities; and 3) the autoregressive algorithm in Decoder-causal leads to redundant KV Cache accesses and considerable idle power. In this paper, <italic>SHMT</i>, an SRAM and HBM hybrid computing-in-memory (CIM) architecture, is designed to efficiently support multimodal Transformers with three key contributions: 1) a multi-level pipelined multicore scheme, including pipeline optimization across E-D layer-head-module levels and a multicore network-on-chip (NoC) architecture, to reduce inference latency and off-chip accesses; 2) a heterogeneous SRAM-HBM architecture, utilizing high-density HBM-CIM for low-arithmetic-intensity (LAI) parts and high-performance SRAM-CIM for high-arithmetic-intensity (HAI) parts; and 3) by integrating KV Cache with zero-padding in SRAM-CIM, SHMT eliminates redundant read-write operations in KV Cache, reducing idle power consumption. Experiment results show that SHMT achieves <inline-formula> <tex-math>$212\\\\times $ </tex-math></inline-formula> speedup, reduces energy consumption by <inline-formula> <tex-math>$208\\\\times \\\\sim 2000\\\\times $ </tex-math></inline-formula> per token, and achieves <inline-formula> <tex-math>$13.3\\\\times $ </tex-math></inline-formula> higher energy efficiency compared to NVIDIA A100 GPU.\",\"PeriodicalId\":13039,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"volume\":\"72 6\",\"pages\":\"2712-2725\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-03-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10993508/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10993508/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
SHMT: An SRAM and HBM Hybrid Computing-in-Memory Architecture With Optimized KV Cache for Multimodal Transformer
Multimodal Transformer (MMT) algorithms have become the state-of-the-art for multimodal tasks such as image captioning. The Encoder-Decoder (E-D) structure, consisting of Encoder, Decoder-causal, and Decoder-cross components, provides a flexible and effective framework for multimodal tasks. However, previous accelerators mainly focus on the dataflow and hardware optimization of the Encoder, which fails to accelerate the entire E-D structure efficiently. There remain three challenges: 1) the lack of pipeline and multicore optimization at the module, layer, and E-D level; 2) the Decoder-causal and Decoder-cross computations have lower arithmetic intensity compared to the Encoder, requiring a better solution for the varying arithmetic intensities; and 3) the autoregressive algorithm in Decoder-causal leads to redundant KV Cache accesses and considerable idle power. In this paper, SHMT, an SRAM and HBM hybrid computing-in-memory (CIM) architecture, is designed to efficiently support multimodal Transformers with three key contributions: 1) a multi-level pipelined multicore scheme, including pipeline optimization across E-D layer-head-module levels and a multicore network-on-chip (NoC) architecture, to reduce inference latency and off-chip accesses; 2) a heterogeneous SRAM-HBM architecture, utilizing high-density HBM-CIM for low-arithmetic-intensity (LAI) parts and high-performance SRAM-CIM for high-arithmetic-intensity (HAI) parts; and 3) by integrating KV Cache with zero-padding in SRAM-CIM, SHMT eliminates redundant read-write operations in KV Cache, reducing idle power consumption. Experiment results show that SHMT achieves $212\times $ speedup, reduces energy consumption by $208\times \sim 2000\times $ per token, and achieves $13.3\times $ higher energy efficiency compared to NVIDIA A100 GPU.
期刊介绍:
TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.