Jiazhi Jiang, Jiangsu Du, Dan-E Huang, Dongsheng Li, Jiang Zheng, Yutong Lu
{"title":"Characterizing and Optimizing Transformer Inference on ARM Many-core Processor","authors":"Jiazhi Jiang, Jiangsu Du, Dan-E Huang, Dongsheng Li, Jiang Zheng, Yutong Lu","doi":"10.1145/3545008.3545022","DOIUrl":null,"url":null,"abstract":"Transformer has experienced tremendous success and revolutionized the field of natural language processing (NLP). While GPU has become the de facto standard for deep learning computation in many cases, there are still many scenarios where using CPU for deep learning remains a prevalent choice. In particular, ARM many-core processor is emerging as a competitive candidate for HPC systems, which is promising to deploy Transformer inference. In this paper, we first position three performance bottlenecks of Transformer inference on many-core CPU, including isolated thread scheduling and configuration, inappropriate GEMM implementation and redundant computations for variable-length inputs. To tackle these problems, we proposed cross-layer optimizations for these challenges from operator to runtime layer. To improve parallel efficiency, we design NUMA-aware thread scheduling and a look-up table for optimal parallel configurations.The implementation of GEMM is tailored for some critical modules to suit the characteristics of Transformer workload. To eliminate redundant computations, a novel storage format is designed and implemented to pack sparse data and a load balancing distribution strategy is proposed for tasks with different sparsity. Our experimental results show that our implementation can outperform existing solutions by 1.1x to 6x for fixed-length inputs and 1.9x to 6x for variable-length inputs depending on different sequence lengths and batch sizes.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"234 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Transformer has experienced tremendous success and revolutionized the field of natural language processing (NLP). While GPU has become the de facto standard for deep learning computation in many cases, there are still many scenarios where using CPU for deep learning remains a prevalent choice. In particular, ARM many-core processor is emerging as a competitive candidate for HPC systems, which is promising to deploy Transformer inference. In this paper, we first position three performance bottlenecks of Transformer inference on many-core CPU, including isolated thread scheduling and configuration, inappropriate GEMM implementation and redundant computations for variable-length inputs. To tackle these problems, we proposed cross-layer optimizations for these challenges from operator to runtime layer. To improve parallel efficiency, we design NUMA-aware thread scheduling and a look-up table for optimal parallel configurations.The implementation of GEMM is tailored for some critical modules to suit the characteristics of Transformer workload. To eliminate redundant computations, a novel storage format is designed and implemented to pack sparse data and a load balancing distribution strategy is proposed for tasks with different sparsity. Our experimental results show that our implementation can outperform existing solutions by 1.1x to 6x for fixed-length inputs and 1.9x to 6x for variable-length inputs depending on different sequence lengths and batch sizes.