Anxing Xie , Yonghua Hu , Yaohua Wang , Zhe Li , Yuxiang Gao , Zenghua Cheng
{"title":"GTA: Generating high-performance tensorized program with dual-task scheduling","authors":"Anxing Xie , Yonghua Hu , Yaohua Wang , Zhe Li , Yuxiang Gao , Zenghua Cheng","doi":"10.1016/j.sysarc.2025.103359","DOIUrl":null,"url":null,"abstract":"<div><div>Generating high-performance tensorized programs for deep learning accelerators (DLAs) is crucial for ensuring the efficient execution of deep neural networks. But, producing such programs for different operators across various DLAs is notoriously challenging. Existing methods utilize hardware abstraction to represent acceleration intrinsics, enabling end-to-end automated exploration of the intrinsics mapping space. However, their limited search space and inefficient exploration strategies often result in suboptimal tensorized programs and significant search time overhead.</div><div>In this paper, we propose GTA, a framework designed to generate high-performance tensorized programs for DLAs. Unlike existing deep learning compilers, we first coordinate intrinsic-based mapping abstraction with rule-based program generation strategy, followed by the application of resource-constrained rules to eliminate ineffective tensor program candidates from the search space. Second, we employ a dual-task scheduling strategy to allocate tuning resources across multiple subgraphs of deep neural networks and their mapping candidates. As a result, GTA can find high-performance tensor programs that are outside the search space of existing state-of-the-art methods. Our experiments show that GTA achieves an average speedup of more than 1.88<span><math><mo>×</mo></math></span> over AMOS and 2.29<span><math><mo>×</mo></math></span> over Ansor on NVIDIA GPU with Tensor Core, as well as 1.49<span><math><mo>×</mo></math></span> over Ansor and 2.76<span><math><mo>×</mo></math></span> over PyTorch on CPU with AVX512.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"160 ","pages":"Article 103359"},"PeriodicalIF":3.7000,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762125000311","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Generating high-performance tensorized programs for deep learning accelerators (DLAs) is crucial for ensuring the efficient execution of deep neural networks. But, producing such programs for different operators across various DLAs is notoriously challenging. Existing methods utilize hardware abstraction to represent acceleration intrinsics, enabling end-to-end automated exploration of the intrinsics mapping space. However, their limited search space and inefficient exploration strategies often result in suboptimal tensorized programs and significant search time overhead.
In this paper, we propose GTA, a framework designed to generate high-performance tensorized programs for DLAs. Unlike existing deep learning compilers, we first coordinate intrinsic-based mapping abstraction with rule-based program generation strategy, followed by the application of resource-constrained rules to eliminate ineffective tensor program candidates from the search space. Second, we employ a dual-task scheduling strategy to allocate tuning resources across multiple subgraphs of deep neural networks and their mapping candidates. As a result, GTA can find high-performance tensor programs that are outside the search space of existing state-of-the-art methods. Our experiments show that GTA achieves an average speedup of more than 1.88 over AMOS and 2.29 over Ansor on NVIDIA GPU with Tensor Core, as well as 1.49 over Ansor and 2.76 over PyTorch on CPU with AVX512.
期刊介绍:
The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software.
Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.