High-Performance Tensor-Train Primitives Using GPU Tensor Cores

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers Pub Date : 2024-08-12 DOI:10.1109/TC.2024.3441831

Xiao-Yang Liu;Hao Hong;Zeliang Zhang;Weiqin Tong;Jean Kossaifi;Xiaodong Wang;Anwar Walid

{"title":"High-Performance Tensor-Train Primitives Using GPU Tensor Cores","authors":"Xiao-Yang Liu;Hao Hong;Zeliang Zhang;Weiqin Tong;Jean Kossaifi;Xiaodong Wang;Anwar Walid","doi":"10.1109/TC.2024.3441831","DOIUrl":null,"url":null,"abstract":"Learning tensor-train (TT) structure (a.k.a matrix product state (MPS) representation) from large-scale high-dimensional data has been a common task in big data analysis, deep learning, and quantum machine learning. However, tensor-train algorithms are compute-intensive, which hinders their real-world applications. In this paper, we present high-performance tensor-train primitives using GPU tensor cores and demonstrate three applications. First, we use GPU tensor cores to optimize tensor-train primitives, including tensor contraction, singular value decomposition, and data transfer and computing. Second, we utilize the optimized primitives to accelerate tensor-train decomposition algorithms for big data analysis. Further, we propose a shard mode for high-order tensor computations on multiple GPUs. Third, we apply the optimized primitives to accelerate the tensor-train layer for compressing deep neural networks. Last, we utilize the optimized primitives to accelerate a quantum machine learning algorithm called \n<i>Density Matrix Renormalization Group (DMRG)</i>\n. In performance evaluations, our third-order TT tensor decomposition achieves up to \n<inline-formula><tex-math>$3.34\\times$</tex-math></inline-formula>\n and \n<inline-formula><tex-math>$6.91\\times$</tex-math></inline-formula>\n speedups over two popular libraries (namely T3F and tntorch) on an A100 GPU, respectively. The proposed sixth-order tensor-train decomposition achieves up to a speedup of \n<inline-formula><tex-math>$5.01\\times$</tex-math></inline-formula>\n over T3F on multiple A100 GPUs. Our tensor-train layer for a fully connected neural network achieves a compression ratio of \n<inline-formula><tex-math>$65.3\\times$</tex-math></inline-formula>\n at the cost of \n<inline-formula><tex-math>$0.3\\%$</tex-math></inline-formula>\n drop in accuracy and a speedup of \n<inline-formula><tex-math>$1.53\\times$</tex-math></inline-formula>\n over a PyTorch implementation on CUDA cores. The optimized \n<i>DMRG</i>\n algorithm achieves up to a speedup of \n<inline-formula><tex-math>$14.0\\times$</tex-math></inline-formula>\n over TensorNetwork, indicating the potential of the optimized tensor primitives for the classical simulation of quantum machine learning algorithms.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2634-2648"},"PeriodicalIF":3.6000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10633902/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Learning tensor-train (TT) structure (a.k.a matrix product state (MPS) representation) from large-scale high-dimensional data has been a common task in big data analysis, deep learning, and quantum machine learning. However, tensor-train algorithms are compute-intensive, which hinders their real-world applications. In this paper, we present high-performance tensor-train primitives using GPU tensor cores and demonstrate three applications. First, we use GPU tensor cores to optimize tensor-train primitives, including tensor contraction, singular value decomposition, and data transfer and computing. Second, we utilize the optimized primitives to accelerate tensor-train decomposition algorithms for big data analysis. Further, we propose a shard mode for high-order tensor computations on multiple GPUs. Third, we apply the optimized primitives to accelerate the tensor-train layer for compressing deep neural networks. Last, we utilize the optimized primitives to accelerate a quantum machine learning algorithm called Density Matrix Renormalization Group (DMRG) . In performance evaluations, our third-order TT tensor decomposition achieves up to

$3.34\times$

and

$6.91\times$

speedups over two popular libraries (namely T3F and tntorch) on an A100 GPU, respectively. The proposed sixth-order tensor-train decomposition achieves up to a speedup of

$5.01\times$

over T3F on multiple A100 GPUs. Our tensor-train layer for a fully connected neural network achieves a compression ratio of

$65.3\times$

at the cost of

$0.3\%$

drop in accuracy and a speedup of

$1.53\times$

over a PyTorch implementation on CUDA cores. The optimized DMRG algorithm achieves up to a speedup of

$14.0\times$

over TensorNetwork, indicating the potential of the optimized tensor primitives for the classical simulation of quantum machine learning algorithms.

查看原文本刊更多论文

利用 GPU 张量核实现高性能张量训练原语

从大规模高维数据中学习张量-训练（TT）结构（又称矩阵乘积状态（MPS）表示）一直是大数据分析、深度学习和量子机器学习中的一项常见任务。然而，张量训练算法是计算密集型的，这阻碍了其在现实世界中的应用。在本文中，我们提出了使用 GPU 张量核的高性能张量-训练基元，并演示了三种应用。首先，我们利用 GPU 张量核优化张量训练基元，包括张量收缩、奇异值分解以及数据传输和计算。其次，我们利用优化后的基元加速大数据分析中的张量-列车分解算法。此外，我们还提出了在多个 GPU 上进行高阶张量计算的碎片模式。第三，我们应用优化基元加速张量-训练层，以压缩深度神经网络。最后，我们利用优化的基元加速了一种名为密度矩阵重归一化组（DMRG）的量子机器学习算法。在性能评估中，我们的三阶张量分解在A100 GPU上比两个流行库（即T3F和tntorch）分别提高了3.34美元和6.91美元。在多个A100 GPU上，拟议的六阶张量-列车分解比T3F最多提高了5.01（times$）美元。我们为全连接神经网络设计的张量-训练层实现了65.3倍的压缩率，其代价是精度下降了0.3%，与CUDA内核上的PyTorch实现相比，速度提高了1.53倍。优化后的DMRG算法比TensorNetwork的速度提高了14.0\times$，这表明优化后的张量基元在量子机器学习算法的经典模拟方面具有潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.