A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI:10.1109/ipdps53621.2022.00089

Jialin Li, Huang Ye, Shaobo Tian, Xinyuan Li, Jian Zhang

{"title":"A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility","authors":"Jialin Li, Huang Ye, Shaobo Tian, Xinyuan Li, Jian Zhang","doi":"10.1109/ipdps53621.2022.00089","DOIUrl":null,"url":null,"abstract":"General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ipdps53621.2022.00089","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

General Matrix Multiplication (GEMM) is one of the fundamental kernels for scientific and high-performance computing. When optimizing the performance of GEMM on GPU, the matrix is usually partitioned into a hierarchy of tiles to fit the thread hierarchy. In practice, the thread-level parallelism is affected not only by the tiling scheme but also by the resources that each tile consumes, such as registers and local data share memory. This paper presents a fine-grained prefetching scheme that improves the thread-level parallelism by balancing the usage of such resources. The gain and loss on instruction and thread level parallelism are analyzed and a mathematical model is developed to estimate the overall performance gain. Moreover, the proposed scheme is integrated into the open-source tool Tensile to automatically generate assembly and tune a collection of kernels to maximize the performance of DGEMM for a family of problem sizes. Experiments show about 1.10X performance speedup on a wide range of matrix sizes for both single and batched matrix-matrix multiplication.

查看原文本刊更多论文

具有自动调优兼容性的GPU上DGEMM内核的细粒度预取方案

通用矩阵乘法(GEMM)是科学计算和高性能计算的基本核心之一。在GPU上优化GEMM性能时，通常将矩阵划分为贴片层次结构，以适应线程层次结构。在实践中，线程级并行性不仅受到平铺方案的影响，还受到每个平铺所消耗的资源的影响，比如寄存器和本地数据共享内存。本文提出了一种细粒度的预取方案，通过平衡这些资源的使用来提高线程级并行性。分析了指令级和线程级并行度的增益和损失，并建立了一个数学模型来估计总体性能增益。此外，所提出的方案被集成到开源工具tension中，以自动生成装配和调优内核集合，以最大限度地提高DGEMM在一系列问题规模下的性能。实验表明，对于单个和批处理矩阵-矩阵乘法，在广泛的矩阵大小范围内，性能加速约为1.10倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量