Deep Learning Operators Performance Tuning for Changeable Sized Input Data on Tensor Accelerate Hardware

IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Pengyu Mu;Yi Liu;Rui Wang;Guoxiang Liu;Hangcheng An;Qianhe Zhao;Hailong Yang;Chenhao Xie;Zhongzhi Luan;Chunye Gong;Depei Qian
{"title":"Deep Learning Operators Performance Tuning for Changeable Sized Input Data on Tensor Accelerate Hardware","authors":"Pengyu Mu;Yi Liu;Rui Wang;Guoxiang Liu;Hangcheng An;Qianhe Zhao;Hailong Yang;Chenhao Xie;Zhongzhi Luan;Chunye Gong;Depei Qian","doi":"10.1109/TC.2025.3551937","DOIUrl":null,"url":null,"abstract":"The operator library is the fundamental infrastructure of deep learning acceleration hardware. Automatically generating the library and tuning its performance is promising because the manual development by well-trained and skillful programmers is costly in terms of both time and money. Tensor hardware has the best computing efficiency for deep learning applications, but the operator library programs are hard to tune because the tensor hardware primitives have many limitations. Otherwise, the performance is difficult to be fully explored. The recent advancement in LLM exacerbates this problem because the size of input data is not fixed. Therefore, mapping the computing tasks of operators to tensor hardware units is a significant challenge when the shape of the input tensor is unknown before the runtime. We propose DSAT, a deep learning operator performance autotuning technique for changeable-sized input data on tensor hardware. To match the input tensor's undetermined shape, we choose a group of abstract computing units as the basic building blocks of operators for changeable-sized input tensor shapes. We design a group of programming tuning rules to construct a large exploration space of the variant implementation of the operator programs. Based on these rules, we construct an intermediate representation of computing and memory access to describe the computing process and use it to map the abstract computing units to tensor primitives. To speed up the tuning process, we narrow down the optimization space by predicting the actual hardware resource requirement and providing an optimized cost model for performance prediction. DSAT achieves performance comparable to the vendor's manually tuned operator libraries. Compared to state-of-the-art deep learning compilers, it improves the performance of inference by 13% on average and decreases the tuning time by an order of magnitude.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2101-2113"},"PeriodicalIF":3.6000,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10929012/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

The operator library is the fundamental infrastructure of deep learning acceleration hardware. Automatically generating the library and tuning its performance is promising because the manual development by well-trained and skillful programmers is costly in terms of both time and money. Tensor hardware has the best computing efficiency for deep learning applications, but the operator library programs are hard to tune because the tensor hardware primitives have many limitations. Otherwise, the performance is difficult to be fully explored. The recent advancement in LLM exacerbates this problem because the size of input data is not fixed. Therefore, mapping the computing tasks of operators to tensor hardware units is a significant challenge when the shape of the input tensor is unknown before the runtime. We propose DSAT, a deep learning operator performance autotuning technique for changeable-sized input data on tensor hardware. To match the input tensor's undetermined shape, we choose a group of abstract computing units as the basic building blocks of operators for changeable-sized input tensor shapes. We design a group of programming tuning rules to construct a large exploration space of the variant implementation of the operator programs. Based on these rules, we construct an intermediate representation of computing and memory access to describe the computing process and use it to map the abstract computing units to tensor primitives. To speed up the tuning process, we narrow down the optimization space by predicting the actual hardware resource requirement and providing an optimized cost model for performance prediction. DSAT achieves performance comparable to the vendor's manually tuned operator libraries. Compared to state-of-the-art deep learning compilers, it improves the performance of inference by 13% on average and decreases the tuning time by an order of magnitude.
张量加速硬件上可变大小输入数据的深度学习算子性能调优
算子库是深度学习加速硬件的基础设施。自动生成库并调优其性能是有希望的,因为由训练有素和熟练的程序员进行手工开发在时间和金钱方面都是昂贵的。张量硬件在深度学习应用中具有最佳的计算效率,但由于张量硬件原语有许多限制,算子库程序难以调优。否则,很难充分挖掘其性能。LLM的最新进展加剧了这个问题,因为输入数据的大小不是固定的。因此,当输入张量的形状在运行前未知时,将算子的计算任务映射到张量硬件单元是一个重大挑战。我们提出了DSAT,一种深度学习算子性能自动调谐技术,用于张量硬件上的可变大小输入数据。为了匹配输入张量的不确定形状,我们选择了一组抽象计算单元作为可变大小输入张量形状算子的基本构建块。设计了一组编程调优规则,构建了算子程序变体实现的大探索空间。基于这些规则,我们构建了计算和内存访问的中间表示来描述计算过程,并使用它将抽象计算单元映射到张量基元。为了加快调优过程,我们通过预测实际硬件资源需求和为性能预测提供优化的成本模型来缩小优化空间。DSAT的性能可与供应商手动调整的操作员库相媲美。与最先进的深度学习编译器相比,它平均提高了13%的推理性能,并将调优时间减少了一个数量级。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Computers
IEEE Transactions on Computers 工程技术-工程:电子与电气
CiteScore
6.60
自引率
5.40%
发文量
199
审稿时长
6.0 months
期刊介绍: The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信