{"title":"Deep Learning Operators Performance Tuning for Changeable Sized Input Data on Tensor Accelerate Hardware","authors":"Pengyu Mu;Yi Liu;Rui Wang;Guoxiang Liu;Hangcheng An;Qianhe Zhao;Hailong Yang;Chenhao Xie;Zhongzhi Luan;Chunye Gong;Depei Qian","doi":"10.1109/TC.2025.3551937","DOIUrl":null,"url":null,"abstract":"The operator library is the fundamental infrastructure of deep learning acceleration hardware. Automatically generating the library and tuning its performance is promising because the manual development by well-trained and skillful programmers is costly in terms of both time and money. Tensor hardware has the best computing efficiency for deep learning applications, but the operator library programs are hard to tune because the tensor hardware primitives have many limitations. Otherwise, the performance is difficult to be fully explored. The recent advancement in LLM exacerbates this problem because the size of input data is not fixed. Therefore, mapping the computing tasks of operators to tensor hardware units is a significant challenge when the shape of the input tensor is unknown before the runtime. We propose DSAT, a deep learning operator performance autotuning technique for changeable-sized input data on tensor hardware. To match the input tensor's undetermined shape, we choose a group of abstract computing units as the basic building blocks of operators for changeable-sized input tensor shapes. We design a group of programming tuning rules to construct a large exploration space of the variant implementation of the operator programs. Based on these rules, we construct an intermediate representation of computing and memory access to describe the computing process and use it to map the abstract computing units to tensor primitives. To speed up the tuning process, we narrow down the optimization space by predicting the actual hardware resource requirement and providing an optimized cost model for performance prediction. DSAT achieves performance comparable to the vendor's manually tuned operator libraries. Compared to state-of-the-art deep learning compilers, it improves the performance of inference by 13% on average and decreases the tuning time by an order of magnitude.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2101-2113"},"PeriodicalIF":3.6000,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10929012/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
The operator library is the fundamental infrastructure of deep learning acceleration hardware. Automatically generating the library and tuning its performance is promising because the manual development by well-trained and skillful programmers is costly in terms of both time and money. Tensor hardware has the best computing efficiency for deep learning applications, but the operator library programs are hard to tune because the tensor hardware primitives have many limitations. Otherwise, the performance is difficult to be fully explored. The recent advancement in LLM exacerbates this problem because the size of input data is not fixed. Therefore, mapping the computing tasks of operators to tensor hardware units is a significant challenge when the shape of the input tensor is unknown before the runtime. We propose DSAT, a deep learning operator performance autotuning technique for changeable-sized input data on tensor hardware. To match the input tensor's undetermined shape, we choose a group of abstract computing units as the basic building blocks of operators for changeable-sized input tensor shapes. We design a group of programming tuning rules to construct a large exploration space of the variant implementation of the operator programs. Based on these rules, we construct an intermediate representation of computing and memory access to describe the computing process and use it to map the abstract computing units to tensor primitives. To speed up the tuning process, we narrow down the optimization space by predicting the actual hardware resource requirement and providing an optimized cost model for performance prediction. DSAT achieves performance comparable to the vendor's manually tuned operator libraries. Compared to state-of-the-art deep learning compilers, it improves the performance of inference by 13% on average and decreases the tuning time by an order of magnitude.
期刊介绍:
The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.