Characterizing and Optimizing Transformer Inference on ARM Many-core Processor

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3545008.3545022

Jiazhi Jiang, Jiangsu Du, Dan-E Huang, Dongsheng Li, Jiang Zheng, Yutong Lu

{"title":"Characterizing and Optimizing Transformer Inference on ARM Many-core Processor","authors":"Jiazhi Jiang, Jiangsu Du, Dan-E Huang, Dongsheng Li, Jiang Zheng, Yutong Lu","doi":"10.1145/3545008.3545022","DOIUrl":null,"url":null,"abstract":"Transformer has experienced tremendous success and revolutionized the field of natural language processing (NLP). While GPU has become the de facto standard for deep learning computation in many cases, there are still many scenarios where using CPU for deep learning remains a prevalent choice. In particular, ARM many-core processor is emerging as a competitive candidate for HPC systems, which is promising to deploy Transformer inference. In this paper, we first position three performance bottlenecks of Transformer inference on many-core CPU, including isolated thread scheduling and configuration, inappropriate GEMM implementation and redundant computations for variable-length inputs. To tackle these problems, we proposed cross-layer optimizations for these challenges from operator to runtime layer. To improve parallel efficiency, we design NUMA-aware thread scheduling and a look-up table for optimal parallel configurations.The implementation of GEMM is tailored for some critical modules to suit the characteristics of Transformer workload. To eliminate redundant computations, a novel storage format is designed and implemented to pack sparse data and a load balancing distribution strategy is proposed for tasks with different sparsity. Our experimental results show that our implementation can outperform existing solutions by 1.1x to 6x for fixed-length inputs and 1.9x to 6x for variable-length inputs depending on different sequence lengths and batch sizes.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"234 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Transformer has experienced tremendous success and revolutionized the field of natural language processing (NLP). While GPU has become the de facto standard for deep learning computation in many cases, there are still many scenarios where using CPU for deep learning remains a prevalent choice. In particular, ARM many-core processor is emerging as a competitive candidate for HPC systems, which is promising to deploy Transformer inference. In this paper, we first position three performance bottlenecks of Transformer inference on many-core CPU, including isolated thread scheduling and configuration, inappropriate GEMM implementation and redundant computations for variable-length inputs. To tackle these problems, we proposed cross-layer optimizations for these challenges from operator to runtime layer. To improve parallel efficiency, we design NUMA-aware thread scheduling and a look-up table for optimal parallel configurations.The implementation of GEMM is tailored for some critical modules to suit the characteristics of Transformer workload. To eliminate redundant computations, a novel storage format is designed and implemented to pack sparse data and a load balancing distribution strategy is proposed for tasks with different sparsity. Our experimental results show that our implementation can outperform existing solutions by 1.1x to 6x for fixed-length inputs and 1.9x to 6x for variable-length inputs depending on different sequence lengths and batch sizes.

查看原文本刊更多论文

ARM多核处理器上变压器推理的表征与优化

Transformer已经取得了巨大的成功，并彻底改变了自然语言处理(NLP)领域。虽然GPU在许多情况下已经成为深度学习计算的事实上的标准，但在许多情况下，使用CPU进行深度学习仍然是一个普遍的选择。特别是，ARM多核处理器正在成为高性能计算系统的竞争对手，它有望部署Transformer推理。在本文中，我们首先定位了Transformer推理在多核CPU上的三个性能瓶颈，包括孤立的线程调度和配置，不适当的GEMM实现以及可变长度输入的冗余计算。为了解决这些问题，我们提出了从操作员到运行时层的跨层优化。为了提高并行效率，我们设计了numa感知的线程调度和最优并行配置查找表。GEMM的实现针对一些关键模块进行了定制，以适应Transformer工作负载的特点。为了消除冗余计算，设计并实现了一种新的存储格式来封装稀疏数据，并针对不同稀疏度的任务提出了负载均衡分配策略。我们的实验结果表明，根据不同的序列长度和批大小，我们的实现可以比现有的解决方案在固定长度输入方面的性能提高1.1到6倍，在可变长度输入方面的性能提高1.9到6倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量