Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs

IF 1.5 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-08-26 DOI:10.1145/3617689

Jiangsu Du, Jiazhi Jiang, Jiang Zheng, Hongbin Zhang, Dan-E Huang, Yutong Lu

{"title":"Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs","authors":"Jiangsu Du, Jiazhi Jiang, Jiang Zheng, Hongbin Zhang, Dan-E Huang, Yutong Lu","doi":"10.1145/3617689","DOIUrl":null,"url":null,"abstract":"Transformer models have emerged as a leading approach in the field of natural language processing (NLP) and are increasingly being deployed in production environments. Graphic processing units (GPUs) have become a popular choice for the transformer deployment, and often rely on the batch processing technique to ensure high hardware performance. Nonetheless, the current practice for transformer inference encounters computational and memory redundancy due to the heavy-tailed distribution of sequence lengths in NLP scenarios, resulting in low practical performance. In this paper, we propose a unified solution for improving both computation and memory efficiency of the real-world transformer inference on GPUs. The solution eliminates the redundant computation and memory footprint across a transformer model. At first, a GPU-oriented computation approach is proposed to process the self-attention module in a fine-grained manner, eliminating its redundant computation. Next, the multi-layer perceptron module continues to use the word-accumulation approach to eliminate its redundant computation. Then, to better unify the fine-grained approach and the word-accumulation approach, it organizes the data layout of the self-attention module in block granularity. Since aforementioned approaches make the required memory size largely reduce and constantly fluctuate, we propose the chunk-based approach to enable a better balance between memory footprint and allocation/free efficiency. Our experimental results show that our unified solution achieves a decrease of average latency by 28 \\(\\% \\) on the entire transformer model, 63.8 \\(\\% \\) on the self-attention module and reduces memory footprint of intermediate results by 7.8 ×, compared with prevailing frameworks.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"54 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3617689","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer models have emerged as a leading approach in the field of natural language processing (NLP) and are increasingly being deployed in production environments. Graphic processing units (GPUs) have become a popular choice for the transformer deployment, and often rely on the batch processing technique to ensure high hardware performance. Nonetheless, the current practice for transformer inference encounters computational and memory redundancy due to the heavy-tailed distribution of sequence lengths in NLP scenarios, resulting in low practical performance. In this paper, we propose a unified solution for improving both computation and memory efficiency of the real-world transformer inference on GPUs. The solution eliminates the redundant computation and memory footprint across a transformer model. At first, a GPU-oriented computation approach is proposed to process the self-attention module in a fine-grained manner, eliminating its redundant computation. Next, the multi-layer perceptron module continues to use the word-accumulation approach to eliminate its redundant computation. Then, to better unify the fine-grained approach and the word-accumulation approach, it organizes the data layout of the self-attention module in block granularity. Since aforementioned approaches make the required memory size largely reduce and constantly fluctuate, we propose the chunk-based approach to enable a better balance between memory footprint and allocation/free efficiency. Our experimental results show that our unified solution achieves a decrease of average latency by 28 \(\% \) on the entire transformer model, 63.8 \(\% \) on the self-attention module and reduces memory footprint of intermediate results by 7.8 ×, compared with prevailing frameworks.

查看原文本刊更多论文

提高gpu上真实世界变压器推理的计算和内存效率

Transformer模型已经成为自然语言处理(NLP)领域的一种领先方法，并且越来越多地部署在生产环境中。图形处理单元(gpu)已成为变压器部署的热门选择，并且通常依赖于批处理技术来确保高硬件性能。尽管如此，由于NLP场景中序列长度的重尾分布，目前的变压器推理实践遇到了计算和内存冗余，导致实际性能较低。在本文中，我们提出了一个统一的解决方案，以提高gpu上实际变压器推理的计算和存储效率。该解决方案消除了跨变压器模型的冗余计算和内存占用。首先提出了一种面向gpu的计算方法，对自关注模块进行细粒度处理，消除了自关注模块的冗余计算。接下来，多层感知器模块继续使用单词积累方法来消除冗余计算。然后，为了更好地统一细粒度方法和单词积累方法，以块粒度组织自关注模块的数据布局。由于上述方法使所需的内存大小大大减少并不断波动，因此我们建议使用基于块的方法来实现内存占用和分配/空闲效率之间的更好平衡。实验结果表明，与现有框架相比，我们的统一解决方案在整个变压器模型上实现了28 \(\% \)的平均延迟降低，在自关注模块上实现了63.8 \(\% \)的平均延迟降低，中间结果的内存占用减少了7.8 x。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Architecture and Code Optimization 工程技术-计算机：理论方法

CiteScore

3.60

自引率

6.20%

发文量

审稿时长

6-12 weeks

期刊介绍： ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.