{"title":"Strassen’s Algorithm Reloaded on GPUs","authors":"Jianyu Huang, Chenhan D. Yu, R. Geijn","doi":"10.1145/3372419","DOIUrl":null,"url":null,"abstract":"Conventional Graphics Processing Unit (GPU) implementations of Strassen’s algorithm (Strassen) rely on the existing high-performance matrix multiplication (gemm), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, “squarish” matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms. Our algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from gemm, fusing additional operations, and avoiding extra workspace. We further exploit intra- and inter-kernel parallelism by batching, streaming, and employing atomic operations. We develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for gemm and Strassen. Overall, our 1-level Strassen can achieve up to 1.11× speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU. With additional workspace, our 2-level Strassen can achieve 1.19× speedup with a crossover point at 7,680.","PeriodicalId":7036,"journal":{"name":"ACM Transactions on Mathematical Software (TOMS)","volume":"54 1","pages":"1 - 22"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Mathematical Software (TOMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3372419","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12
Abstract
Conventional Graphics Processing Unit (GPU) implementations of Strassen’s algorithm (Strassen) rely on the existing high-performance matrix multiplication (gemm), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, “squarish” matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms. Our algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from gemm, fusing additional operations, and avoiding extra workspace. We further exploit intra- and inter-kernel parallelism by batching, streaming, and employing atomic operations. We develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for gemm and Strassen. Overall, our 1-level Strassen can achieve up to 1.11× speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU. With additional workspace, our 2-level Strassen can achieve 1.19× speedup with a crossover point at 7,680.
传统图形处理单元(GPU)实现的Strassen算法(Strassen)依赖于现有的高性能矩阵乘法(gem),以空间换取时间。因此,由于额外的内存开销,这种方法只能实现相对较大的“平方”矩阵的实际加速,并且由于相当大的工作空间,它们的使用受到限制。我们提出了新的Strassen原语的gpu,可以组成生成一个家族的Strassen算法。我们的算法利用gpu上的内存和线程层次结构,重用从gem继承的共享内存和注册文件,融合额外的操作,并避免额外的工作空间。我们通过批处理、流处理和原子操作进一步利用内核内部和内核间的并行性。我们建立了NVIDIA Volta gpu的性能模型,以选择合适的阻塞参数并预测gem和Strassen的性能。总的来说,与NVIDIA Tesla V100 GPU上的cublassgem相比,我们的1级Strassen可以实现高达1.11倍的加速,交叉点小至1536。有了额外的工作空间,我们的2级Strassen可以实现1.19倍的加速,交叉点为7680。