SAL-PIM: A Subarray-Level Processing-in-Memory Architecture With LUT-Based Linear Interpolation for Transformer-Based Text Generation

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers Pub Date : 2025-06-04 DOI:10.1109/TC.2025.3576935

Wontak Han;Hyunjun Cho;Donghyuk Kim;Joo-Young Kim

{"title":"SAL-PIM: A Subarray-Level Processing-in-Memory Architecture With LUT-Based Linear Interpolation for Transformer-Based Text Generation","authors":"Wontak Han;Hyunjun Cho;Donghyuk Kim;Joo-Young Kim","doi":"10.1109/TC.2025.3576935","DOIUrl":null,"url":null,"abstract":"Text generation is a compelling sub-field of natural language processing, aiming to generate human-readable text from input words. Although many deep learning models have been proposed, the recent emergence of transformer-based large language models advances its academic research and industry development, showing remarkable qualitative results in text generation. In particular, the decoder-only generative models, such as generative pre-trained transformer (GPT), are widely used for text generation, with two major computational stages: summarization and generation. Unlike the summarization stage, which can process the input tokens in parallel, the generation stage is difficult to accelerate due to its sequential generation of output tokens through iteration. Moreover, each iteration requires reading a whole model with little data reuse opportunity. Therefore, the workload of transformer-based text generation is severely memory-bound, making the external memory bandwidth system bottleneck. In this paper, we propose a subarray-level processing-in-memory (PIM) architecture named SAL-PIM, the first HBM-based PIM architecture for the end-to-end acceleration of transformer-based text generation. With optimized data mapping schemes for different operations, SAL-PIM utilizes higher internal bandwidth by integrating multiple subarray-level arithmetic logic units (S-ALUs) next to memory subarrays. To minimize the area overhead for S-ALU, it uses shared MACs leveraging slow clock frequency of commands for the same bank. In addition, a few subarrays in the bank are used as look-up tables (LUTs) to handle non-linear functions in PIM, supporting multiple addressing to select sections for linear interpolation. Lastly, the channel-level arithmetic logic unit (C-ALU) is added in the buffer die of HBM to perform the accumulation and reduce-sum operations of data across multiple banks, completing end-to-end inference on PIM. To validate the SAL-PIM architecture, we built a cycle-accurate simulator based on Ramulator. We also implemented the SAL-PIM’s logic units in 28-nm CMOS technology and scaled the results to DRAM technology to verify its feasibility. We measured the end-to-end latency of SAL-PIM when it runs various text generation workloads on the GPT-2 medium model (with 345 million parameters), in which the input and output token numbers vary from 32 to 128 and from 1 to 256, respectively. As a result, with 4.81% area overhead, SAL-PIM achieves up to 4.72× speedup (1.83× on average) over the Nvidia Titan RTX GPU running Faster Transformer Framework.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"2909-2922"},"PeriodicalIF":3.8000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11024168/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Text generation is a compelling sub-field of natural language processing, aiming to generate human-readable text from input words. Although many deep learning models have been proposed, the recent emergence of transformer-based large language models advances its academic research and industry development, showing remarkable qualitative results in text generation. In particular, the decoder-only generative models, such as generative pre-trained transformer (GPT), are widely used for text generation, with two major computational stages: summarization and generation. Unlike the summarization stage, which can process the input tokens in parallel, the generation stage is difficult to accelerate due to its sequential generation of output tokens through iteration. Moreover, each iteration requires reading a whole model with little data reuse opportunity. Therefore, the workload of transformer-based text generation is severely memory-bound, making the external memory bandwidth system bottleneck. In this paper, we propose a subarray-level processing-in-memory (PIM) architecture named SAL-PIM, the first HBM-based PIM architecture for the end-to-end acceleration of transformer-based text generation. With optimized data mapping schemes for different operations, SAL-PIM utilizes higher internal bandwidth by integrating multiple subarray-level arithmetic logic units (S-ALUs) next to memory subarrays. To minimize the area overhead for S-ALU, it uses shared MACs leveraging slow clock frequency of commands for the same bank. In addition, a few subarrays in the bank are used as look-up tables (LUTs) to handle non-linear functions in PIM, supporting multiple addressing to select sections for linear interpolation. Lastly, the channel-level arithmetic logic unit (C-ALU) is added in the buffer die of HBM to perform the accumulation and reduce-sum operations of data across multiple banks, completing end-to-end inference on PIM. To validate the SAL-PIM architecture, we built a cycle-accurate simulator based on Ramulator. We also implemented the SAL-PIM’s logic units in 28-nm CMOS technology and scaled the results to DRAM technology to verify its feasibility. We measured the end-to-end latency of SAL-PIM when it runs various text generation workloads on the GPT-2 medium model (with 345 million parameters), in which the input and output token numbers vary from 32 to 128 and from 1 to 256, respectively. As a result, with 4.81% area overhead, SAL-PIM achieves up to 4.72× speedup (1.83× on average) over the Nvidia Titan RTX GPU running Faster Transformer Framework.

查看原文本刊更多论文

基于lut的线性插值的子数组级内存处理体系结构，用于基于转换的文本生成

文本生成是自然语言处理的一个引人注目的子领域，旨在从输入的单词生成人类可读的文本。尽管已经提出了许多深度学习模型，但最近基于转换器的大型语言模型的出现推进了其学术研究和行业发展，在文本生成方面显示出显著的定性结果。特别是，仅用于解码器的生成模型，如生成预训练转换器（GPT），被广泛用于文本生成，其计算阶段主要有两个：摘要和生成。摘要阶段可以并行处理输入标记，而生成阶段由于其通过迭代顺序生成输出标记，因此难以加速。此外，每次迭代都需要读取整个模型，几乎没有数据重用的机会。因此，基于转换器的文本生成工作负载受到严重的内存限制，使外部内存带宽成为系统的瓶颈。在本文中，我们提出了一种名为salpim的子数组级内存处理（PIM）体系结构，这是第一个用于端到端加速基于转换器的文本生成的基于hbm的PIM体系结构。通过优化不同操作的数据映射方案，SAL-PIM通过在内存子阵列旁边集成多个子阵列级算术逻辑单元（s - alu）来利用更高的内部带宽。为了最小化S-ALU的面积开销，它使用共享mac，利用同一银行命令的慢时钟频率。此外，库中的一些子数组用作查找表（lut）来处理PIM中的非线性函数，支持多个寻址来选择用于线性插值的部分。最后，在HBM的缓冲芯片中加入信道级算术逻辑单元（C-ALU），对多个银行的数据进行累加和约简运算，完成对PIM的端到端推理。为了验证SAL-PIM结构，我们建立了一个基于Ramulator的周期精确模拟器。我们还在28纳米CMOS技术上实现了SAL-PIM的逻辑单元，并将结果扩展到DRAM技术以验证其可行性。当SAL-PIM在GPT-2介质模型（具有3.45亿个参数）上运行各种文本生成工作负载时，我们测量了它的端到端延迟，其中输入和输出令牌数分别从32到128和1到256不等。因此，在4.81%的面积开销下，SAL-PIM在运行更快的Transformer框架的Nvidia Titan RTX GPU上实现了高达4.72倍的加速（平均1.83倍）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.