Characterizing and Optimizing Transformer Inference on ARM Many-core Processor

Jiazhi Jiang, Jiangsu Du, Dan-E Huang, Dongsheng Li, Jiang Zheng, Yutong Lu
{"title":"Characterizing and Optimizing Transformer Inference on ARM Many-core Processor","authors":"Jiazhi Jiang, Jiangsu Du, Dan-E Huang, Dongsheng Li, Jiang Zheng, Yutong Lu","doi":"10.1145/3545008.3545022","DOIUrl":null,"url":null,"abstract":"Transformer has experienced tremendous success and revolutionized the field of natural language processing (NLP). While GPU has become the de facto standard for deep learning computation in many cases, there are still many scenarios where using CPU for deep learning remains a prevalent choice. In particular, ARM many-core processor is emerging as a competitive candidate for HPC systems, which is promising to deploy Transformer inference. In this paper, we first position three performance bottlenecks of Transformer inference on many-core CPU, including isolated thread scheduling and configuration, inappropriate GEMM implementation and redundant computations for variable-length inputs. To tackle these problems, we proposed cross-layer optimizations for these challenges from operator to runtime layer. To improve parallel efficiency, we design NUMA-aware thread scheduling and a look-up table for optimal parallel configurations.The implementation of GEMM is tailored for some critical modules to suit the characteristics of Transformer workload. To eliminate redundant computations, a novel storage format is designed and implemented to pack sparse data and a load balancing distribution strategy is proposed for tasks with different sparsity. Our experimental results show that our implementation can outperform existing solutions by 1.1x to 6x for fixed-length inputs and 1.9x to 6x for variable-length inputs depending on different sequence lengths and batch sizes.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"234 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Transformer has experienced tremendous success and revolutionized the field of natural language processing (NLP). While GPU has become the de facto standard for deep learning computation in many cases, there are still many scenarios where using CPU for deep learning remains a prevalent choice. In particular, ARM many-core processor is emerging as a competitive candidate for HPC systems, which is promising to deploy Transformer inference. In this paper, we first position three performance bottlenecks of Transformer inference on many-core CPU, including isolated thread scheduling and configuration, inappropriate GEMM implementation and redundant computations for variable-length inputs. To tackle these problems, we proposed cross-layer optimizations for these challenges from operator to runtime layer. To improve parallel efficiency, we design NUMA-aware thread scheduling and a look-up table for optimal parallel configurations.The implementation of GEMM is tailored for some critical modules to suit the characteristics of Transformer workload. To eliminate redundant computations, a novel storage format is designed and implemented to pack sparse data and a load balancing distribution strategy is proposed for tasks with different sparsity. Our experimental results show that our implementation can outperform existing solutions by 1.1x to 6x for fixed-length inputs and 1.9x to 6x for variable-length inputs depending on different sequence lengths and batch sizes.
ARM多核处理器上变压器推理的表征与优化
Transformer已经取得了巨大的成功,并彻底改变了自然语言处理(NLP)领域。虽然GPU在许多情况下已经成为深度学习计算的事实上的标准,但在许多情况下,使用CPU进行深度学习仍然是一个普遍的选择。特别是,ARM多核处理器正在成为高性能计算系统的竞争对手,它有望部署Transformer推理。在本文中,我们首先定位了Transformer推理在多核CPU上的三个性能瓶颈,包括孤立的线程调度和配置,不适当的GEMM实现以及可变长度输入的冗余计算。为了解决这些问题,我们提出了从操作员到运行时层的跨层优化。为了提高并行效率,我们设计了numa感知的线程调度和最优并行配置查找表。GEMM的实现针对一些关键模块进行了定制,以适应Transformer工作负载的特点。为了消除冗余计算,设计并实现了一种新的存储格式来封装稀疏数据,并针对不同稀疏度的任务提出了负载均衡分配策略。我们的实验结果表明,根据不同的序列长度和批大小,我们的实现可以比现有的解决方案在固定长度输入方面的性能提高1.1到6倍,在可变长度输入方面的性能提高1.9到6倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信