Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters Pub Date : 2023-08-15 DOI:10.1109/LCA.2023.3305386

Jaewan Choi;Jaehyun Park;Kwanhee Kyung;Nam Sung Kim;Jung Ho Ahn

{"title":"Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-Based Generative Models","authors":"Jaewan Choi;Jaehyun Park;Kwanhee Kyung;Nam Sung Kim;Jung Ho Ahn","doi":"10.1109/LCA.2023.3305386","DOIUrl":null,"url":null,"abstract":"Transformer-based generative models, such as GPT, summarize an input sequence by generating key/value (KV) matrices through attention and generate the corresponding output sequence by utilizing these matrices once per token of the sequence. Both input and output sequences tend to get longer, which improves the understanding of contexts and conversation quality. These models are also typically batched for inference to improve the serving throughput. All these trends enable the models’ weights to be reused effectively, increasing the relative importance of sequence generation, especially in processing KV matrices through attention. We identify that the conventional computing platforms (e.g., GPUs) are not efficient at handling this attention part for inference because each request generates different KV matrices, it has a low operation per byte ratio regardless of the batch size, and the aggregate size of the KV matrices can even surpass that of the entire model weights. This motivates us to propose AttAcc, which exploits the fact that the KV matrices are written once during summarization but used many times (proportional to the output sequence length), each multiplied by the embedding vector corresponding to an output token. The volume of data entering/leaving AttAcc could be more than orders of magnitude smaller than what should be read internally for attention. We design AttAcc with multiple processing-in-memory devices, each multiplying the embedding vector with the portion of the KV matrices within the devices, saving external (inter-device) bandwidth and energy consumption.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"22 2","pages":"113-116"},"PeriodicalIF":1.4000,"publicationDate":"2023-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/10208/10189818/10218731.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Computer Architecture Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10218731/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer-based generative models, such as GPT, summarize an input sequence by generating key/value (KV) matrices through attention and generate the corresponding output sequence by utilizing these matrices once per token of the sequence. Both input and output sequences tend to get longer, which improves the understanding of contexts and conversation quality. These models are also typically batched for inference to improve the serving throughput. All these trends enable the models’ weights to be reused effectively, increasing the relative importance of sequence generation, especially in processing KV matrices through attention. We identify that the conventional computing platforms (e.g., GPUs) are not efficient at handling this attention part for inference because each request generates different KV matrices, it has a low operation per byte ratio regardless of the batch size, and the aggregate size of the KV matrices can even surpass that of the entire model weights. This motivates us to propose AttAcc, which exploits the fact that the KV matrices are written once during summarization but used many times (proportional to the output sequence length), each multiplied by the embedding vector corresponding to an output token. The volume of data entering/leaving AttAcc could be more than orders of magnitude smaller than what should be read internally for attention. We design AttAcc with multiple processing-in-memory devices, each multiplying the embedding vector with the portion of the KV matrices within the devices, saving external (inter-device) bandwidth and energy consumption.

查看原文本刊更多论文

释放PIM的潜力：加速基于变压器的生成模型的大批量推理

基于转换器的生成模型，如GPT，通过关注生成键/值（KV）矩阵来总结输入序列，并通过对序列的每个令牌使用这些矩阵一次来生成相应的输出序列。输入和输出序列往往会变长，这提高了对上下文的理解和会话质量。这些模型通常也被分批进行推理，以提高服务吞吐量。所有这些趋势都使模型的权重能够有效地重用，增加了序列生成的相对重要性，尤其是在通过注意力处理KV矩阵时。我们发现，传统的计算平台（例如GPU）在处理这一注意力部分进行推理方面并不有效，因为每个请求都会生成不同的KV矩阵，无论批大小如何，它的每字节运算率都很低，KV矩阵的总大小甚至可以超过整个模型权重的总大小。这促使我们提出了AttAcc，它利用了这样一个事实，即KV矩阵在摘要期间被写一次，但被使用了多次（与输出序列长度成比例），每次都乘以对应于输出令牌的嵌入向量。进入/离开AttAcc的数据量可能比应该在内部读取以引起注意的数据量小几个数量级。我们在存储器设备中设计了具有多个处理的AttAcc，每个处理将嵌入向量与设备内的KV矩阵部分相乘，从而节省了外部（设备间）带宽和能耗。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Computer Architecture Letters COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

CiteScore

4.60

自引率

4.30%

发文量

期刊介绍： IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. Submissions are welcomed on any topic in computer architecture, especially but not limited to: microprocessor and multiprocessor systems, microarchitecture and ILP processors, workload characterization, performance evaluation and simulation techniques, compiler-hardware and operating system-hardware interactions, interconnect architectures, memory and cache systems, power and thermal issues at the architecture level, I/O architectures and techniques, independent validation of previously published results, analysis of unsuccessful techniques, domain-specific processor architectures (e.g., embedded, graphics, network, etc.), real-time and high-availability architectures, reconfigurable systems.