Deep Learning Operators Performance Tuning for Changeable Sized Input Data on Tensor Accelerate Hardware

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers Pub Date : 2025-03-17 DOI:10.1109/TC.2025.3551937

Pengyu Mu;Yi Liu;Rui Wang;Guoxiang Liu;Hangcheng An;Qianhe Zhao;Hailong Yang;Chenhao Xie;Zhongzhi Luan;Chunye Gong;Depei Qian

{"title":"Deep Learning Operators Performance Tuning for Changeable Sized Input Data on Tensor Accelerate Hardware","authors":"Pengyu Mu;Yi Liu;Rui Wang;Guoxiang Liu;Hangcheng An;Qianhe Zhao;Hailong Yang;Chenhao Xie;Zhongzhi Luan;Chunye Gong;Depei Qian","doi":"10.1109/TC.2025.3551937","DOIUrl":null,"url":null,"abstract":"The operator library is the fundamental infrastructure of deep learning acceleration hardware. Automatically generating the library and tuning its performance is promising because the manual development by well-trained and skillful programmers is costly in terms of both time and money. Tensor hardware has the best computing efficiency for deep learning applications, but the operator library programs are hard to tune because the tensor hardware primitives have many limitations. Otherwise, the performance is difficult to be fully explored. The recent advancement in LLM exacerbates this problem because the size of input data is not fixed. Therefore, mapping the computing tasks of operators to tensor hardware units is a significant challenge when the shape of the input tensor is unknown before the runtime. We propose DSAT, a deep learning operator performance autotuning technique for changeable-sized input data on tensor hardware. To match the input tensor's undetermined shape, we choose a group of abstract computing units as the basic building blocks of operators for changeable-sized input tensor shapes. We design a group of programming tuning rules to construct a large exploration space of the variant implementation of the operator programs. Based on these rules, we construct an intermediate representation of computing and memory access to describe the computing process and use it to map the abstract computing units to tensor primitives. To speed up the tuning process, we narrow down the optimization space by predicting the actual hardware resource requirement and providing an optimized cost model for performance prediction. DSAT achieves performance comparable to the vendor's manually tuned operator libraries. Compared to state-of-the-art deep learning compilers, it improves the performance of inference by 13% on average and decreases the tuning time by an order of magnitude.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2101-2113"},"PeriodicalIF":3.6000,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10929012/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

The operator library is the fundamental infrastructure of deep learning acceleration hardware. Automatically generating the library and tuning its performance is promising because the manual development by well-trained and skillful programmers is costly in terms of both time and money. Tensor hardware has the best computing efficiency for deep learning applications, but the operator library programs are hard to tune because the tensor hardware primitives have many limitations. Otherwise, the performance is difficult to be fully explored. The recent advancement in LLM exacerbates this problem because the size of input data is not fixed. Therefore, mapping the computing tasks of operators to tensor hardware units is a significant challenge when the shape of the input tensor is unknown before the runtime. We propose DSAT, a deep learning operator performance autotuning technique for changeable-sized input data on tensor hardware. To match the input tensor's undetermined shape, we choose a group of abstract computing units as the basic building blocks of operators for changeable-sized input tensor shapes. We design a group of programming tuning rules to construct a large exploration space of the variant implementation of the operator programs. Based on these rules, we construct an intermediate representation of computing and memory access to describe the computing process and use it to map the abstract computing units to tensor primitives. To speed up the tuning process, we narrow down the optimization space by predicting the actual hardware resource requirement and providing an optimized cost model for performance prediction. DSAT achieves performance comparable to the vendor's manually tuned operator libraries. Compared to state-of-the-art deep learning compilers, it improves the performance of inference by 13% on average and decreases the tuning time by an order of magnitude.

查看原文本刊更多论文

张量加速硬件上可变大小输入数据的深度学习算子性能调优

算子库是深度学习加速硬件的基础设施。自动生成库并调优其性能是有希望的，因为由训练有素和熟练的程序员进行手工开发在时间和金钱方面都是昂贵的。张量硬件在深度学习应用中具有最佳的计算效率，但由于张量硬件原语有许多限制，算子库程序难以调优。否则，很难充分挖掘其性能。LLM的最新进展加剧了这个问题，因为输入数据的大小不是固定的。因此，当输入张量的形状在运行前未知时，将算子的计算任务映射到张量硬件单元是一个重大挑战。我们提出了DSAT，一种深度学习算子性能自动调谐技术，用于张量硬件上的可变大小输入数据。为了匹配输入张量的不确定形状，我们选择了一组抽象计算单元作为可变大小输入张量形状算子的基本构建块。设计了一组编程调优规则，构建了算子程序变体实现的大探索空间。基于这些规则，我们构建了计算和内存访问的中间表示来描述计算过程，并使用它将抽象计算单元映射到张量基元。为了加快调优过程，我们通过预测实际硬件资源需求和为性能预测提供优化的成本模型来缩小优化空间。DSAT的性能可与供应商手动调整的操作员库相媲美。与最先进的深度学习编译器相比，它平均提高了13%的推理性能，并将调优时间减少了一个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.