Design of a high-performance tensor–matrix multiplication with BLAS

IF 3.7 3区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Computational Science Pub Date : 2025-03-22 DOI:10.1016/j.jocs.2025.102568

Cem Savaş Başsoy

{"title":"Design of a high-performance tensor–matrix multiplication with BLAS","authors":"Cem Savaş Başsoy","doi":"10.1016/j.jocs.2025.102568","DOIUrl":null,"url":null,"abstract":"<div><div>The tensor–matrix multiplication (TTM) is a basic tensor operation required by various tensor methods such as the HOSVD. This paper presents flexible high-performance algorithms that compute the tensor–matrix product according to the Loops-over-GEMM (LOG) approach. The proposed algorithms can process dense tensors with any linear tensor layout, arbitrary tensor order and dimensions all of which can be runtime variable. The paper discusses two slicing methods with orthogonal parallelization strategies and propose four algorithms that call BLAS with subtensors or tensor slices. It also provides a simple heuristic which selects one of the four proposed algorithms at runtime. All algorithms have been evaluated on a large set of tensors with various tensor shapes and linear tensor layouts. In case of large tensor slices, our best-performing algorithm achieves a median performance of 2.47 TFLOPS on an Intel Xeon Gold 5318Y and 2.93 TFLOPS an AMD EPYC 9354. Furthermore, it outperforms batched GEMM implementation of Intel MKL by a factor of 2.57 with large tensor slices. Our runtime tests show that our best-performing algorithm is, on average, at least 6.21% and up to 334.31% faster than frameworks implementing state-of-the-art approaches, including actively developed libraries such as Libtorch and Eigen. For the majority of tensor shapes, it is on par with TBLIS which uses optimized kernels for the TTM computation. Our algorithm performs better than all other competing implementations for the majority of real world tensors from the SDRBench, reaching a speedup of 2x or more for some tensor instances. This work is an extended version of ”Fast and Layout-Oblivious Tensor–Matrix Multiplication with BLAS” (Başsoy 2024).</div></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"87 ","pages":"Article 102568"},"PeriodicalIF":3.7000,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877750325000456","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

The tensor–matrix multiplication (TTM) is a basic tensor operation required by various tensor methods such as the HOSVD. This paper presents flexible high-performance algorithms that compute the tensor–matrix product according to the Loops-over-GEMM (LOG) approach. The proposed algorithms can process dense tensors with any linear tensor layout, arbitrary tensor order and dimensions all of which can be runtime variable. The paper discusses two slicing methods with orthogonal parallelization strategies and propose four algorithms that call BLAS with subtensors or tensor slices. It also provides a simple heuristic which selects one of the four proposed algorithms at runtime. All algorithms have been evaluated on a large set of tensors with various tensor shapes and linear tensor layouts. In case of large tensor slices, our best-performing algorithm achieves a median performance of 2.47 TFLOPS on an Intel Xeon Gold 5318Y and 2.93 TFLOPS an AMD EPYC 9354. Furthermore, it outperforms batched GEMM implementation of Intel MKL by a factor of 2.57 with large tensor slices. Our runtime tests show that our best-performing algorithm is, on average, at least 6.21% and up to 334.31% faster than frameworks implementing state-of-the-art approaches, including actively developed libraries such as Libtorch and Eigen. For the majority of tensor shapes, it is on par with TBLIS which uses optimized kernels for the TTM computation. Our algorithm performs better than all other competing implementations for the majority of real world tensors from the SDRBench, reaching a speedup of 2x or more for some tensor instances. This work is an extended version of ”Fast and Layout-Oblivious Tensor–Matrix Multiplication with BLAS” (Başsoy 2024).

查看原文本刊更多论文

基于BLAS的高性能张量矩阵乘法的设计

张量-矩阵乘法（TTM）是各种张量方法（如HOSVD）所需的基本张量运算。本文提出了一种基于LOG （Loops-over-GEMM）方法计算张量-矩阵积的灵活的高性能算法。该算法可以处理任意线性张量布局、任意张量阶数和维数的密集张量，这些张量都可以是运行时变量。本文讨论了两种采用正交并行化策略的切片方法，并提出了四种调用带有子张量或张量切片的BLAS的算法。它还提供了一个简单的启发式，在运行时从四种算法中选择一种。所有算法都在具有各种张量形状和线性张量布局的大张量集上进行了评估。在大张量切片的情况下，我们性能最好的算法在Intel Xeon Gold 5318Y上实现了2.47 TFLOPS的中位数性能，在AMD EPYC 9354上实现了2.93 TFLOPS。此外，在使用大张量切片时，它比英特尔MKL的批处理GEMM实现的性能高出2.57倍。我们的运行时测试表明，我们的最佳算法平均比实现最先进方法的框架（包括Libtorch和Eigen等积极开发的库）快至少6.21%，最高可达334.31%。对于大多数张量形状，它与使用优化核进行TTM计算的TBLIS相当。对于来自SDRBench的大多数真实世界张量，我们的算法比所有其他竞争实现都表现得更好，在一些张量实例中达到了2倍或更多的加速。这项工作是“快速和布局无关的张量矩阵乘法与BLAS”（ba soy 2024）的扩展版本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Computational Science COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-COMPUTER SCIENCE, THEORY & METHODS

CiteScore

5.50

自引率

3.00%

发文量

227

审稿时长

41 days

期刊介绍： Computational Science is a rapidly growing multi- and interdisciplinary field that uses advanced computing and data analysis to understand and solve complex problems. It has reached a level of predictive capability that now firmly complements the traditional pillars of experimentation and theory. The recent advances in experimental techniques such as detectors, on-line sensor networks and high-resolution imaging techniques, have opened up new windows into physical and biological processes at many levels of detail. The resulting data explosion allows for detailed data driven modeling and simulation. This new discipline in science combines computational thinking, modern computational methods, devices and collateral technologies to address problems far beyond the scope of traditional numerical methods. Computational science typically unifies three distinct elements: • Modeling, Algorithms and Simulations (e.g. numerical and non-numerical, discrete and continuous); • Software developed to solve science (e.g., biological, physical, and social), engineering, medicine, and humanities problems; • Computer and information science that develops and optimizes the advanced system hardware, software, networking, and data management components (e.g. problem solving environments).