High-Performance Tensor-Train Primitives Using GPU Tensor Cores

IF 3.6 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Xiao-Yang Liu;Hao Hong;Zeliang Zhang;Weiqin Tong;Jean Kossaifi;Xiaodong Wang;Anwar Walid
{"title":"High-Performance Tensor-Train Primitives Using GPU Tensor Cores","authors":"Xiao-Yang Liu;Hao Hong;Zeliang Zhang;Weiqin Tong;Jean Kossaifi;Xiaodong Wang;Anwar Walid","doi":"10.1109/TC.2024.3441831","DOIUrl":null,"url":null,"abstract":"Learning tensor-train (TT) structure (a.k.a matrix product state (MPS) representation) from large-scale high-dimensional data has been a common task in big data analysis, deep learning, and quantum machine learning. However, tensor-train algorithms are compute-intensive, which hinders their real-world applications. In this paper, we present high-performance tensor-train primitives using GPU tensor cores and demonstrate three applications. First, we use GPU tensor cores to optimize tensor-train primitives, including tensor contraction, singular value decomposition, and data transfer and computing. Second, we utilize the optimized primitives to accelerate tensor-train decomposition algorithms for big data analysis. Further, we propose a shard mode for high-order tensor computations on multiple GPUs. Third, we apply the optimized primitives to accelerate the tensor-train layer for compressing deep neural networks. Last, we utilize the optimized primitives to accelerate a quantum machine learning algorithm called \n<i>Density Matrix Renormalization Group (DMRG)</i>\n. In performance evaluations, our third-order TT tensor decomposition achieves up to \n<inline-formula><tex-math>$3.34\\times$</tex-math></inline-formula>\n and \n<inline-formula><tex-math>$6.91\\times$</tex-math></inline-formula>\n speedups over two popular libraries (namely T3F and tntorch) on an A100 GPU, respectively. The proposed sixth-order tensor-train decomposition achieves up to a speedup of \n<inline-formula><tex-math>$5.01\\times$</tex-math></inline-formula>\n over T3F on multiple A100 GPUs. Our tensor-train layer for a fully connected neural network achieves a compression ratio of \n<inline-formula><tex-math>$65.3\\times$</tex-math></inline-formula>\n at the cost of \n<inline-formula><tex-math>$0.3\\%$</tex-math></inline-formula>\n drop in accuracy and a speedup of \n<inline-formula><tex-math>$1.53\\times$</tex-math></inline-formula>\n over a PyTorch implementation on CUDA cores. The optimized \n<i>DMRG</i>\n algorithm achieves up to a speedup of \n<inline-formula><tex-math>$14.0\\times$</tex-math></inline-formula>\n over TensorNetwork, indicating the potential of the optimized tensor primitives for the classical simulation of quantum machine learning algorithms.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2634-2648"},"PeriodicalIF":3.6000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10633902/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Learning tensor-train (TT) structure (a.k.a matrix product state (MPS) representation) from large-scale high-dimensional data has been a common task in big data analysis, deep learning, and quantum machine learning. However, tensor-train algorithms are compute-intensive, which hinders their real-world applications. In this paper, we present high-performance tensor-train primitives using GPU tensor cores and demonstrate three applications. First, we use GPU tensor cores to optimize tensor-train primitives, including tensor contraction, singular value decomposition, and data transfer and computing. Second, we utilize the optimized primitives to accelerate tensor-train decomposition algorithms for big data analysis. Further, we propose a shard mode for high-order tensor computations on multiple GPUs. Third, we apply the optimized primitives to accelerate the tensor-train layer for compressing deep neural networks. Last, we utilize the optimized primitives to accelerate a quantum machine learning algorithm called Density Matrix Renormalization Group (DMRG) . In performance evaluations, our third-order TT tensor decomposition achieves up to $3.34\times$ and $6.91\times$ speedups over two popular libraries (namely T3F and tntorch) on an A100 GPU, respectively. The proposed sixth-order tensor-train decomposition achieves up to a speedup of $5.01\times$ over T3F on multiple A100 GPUs. Our tensor-train layer for a fully connected neural network achieves a compression ratio of $65.3\times$ at the cost of $0.3\%$ drop in accuracy and a speedup of $1.53\times$ over a PyTorch implementation on CUDA cores. The optimized DMRG algorithm achieves up to a speedup of $14.0\times$ over TensorNetwork, indicating the potential of the optimized tensor primitives for the classical simulation of quantum machine learning algorithms.
利用 GPU 张量核实现高性能张量训练原语
从大规模高维数据中学习张量-训练(TT)结构(又称矩阵乘积状态(MPS)表示)一直是大数据分析、深度学习和量子机器学习中的一项常见任务。然而,张量训练算法是计算密集型的,这阻碍了其在现实世界中的应用。在本文中,我们提出了使用 GPU 张量核的高性能张量-训练基元,并演示了三种应用。首先,我们利用 GPU 张量核优化张量训练基元,包括张量收缩、奇异值分解以及数据传输和计算。其次,我们利用优化后的基元加速大数据分析中的张量-列车分解算法。此外,我们还提出了在多个 GPU 上进行高阶张量计算的碎片模式。第三,我们应用优化基元加速张量-训练层,以压缩深度神经网络。最后,我们利用优化的基元加速了一种名为密度矩阵重归一化组(DMRG)的量子机器学习算法。在性能评估中,我们的三阶张量分解在A100 GPU上比两个流行库(即T3F和tntorch)分别提高了3.34美元和6.91美元。在多个A100 GPU上,拟议的六阶张量-列车分解比T3F最多提高了5.01(times$)美元。我们为全连接神经网络设计的张量-训练层实现了65.3倍的压缩率,其代价是精度下降了0.3%,与CUDA内核上的PyTorch实现相比,速度提高了1.53倍。优化后的DMRG算法比TensorNetwork的速度提高了14.0\times$,这表明优化后的张量基元在量子机器学习算法的经典模拟方面具有潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Computers
IEEE Transactions on Computers 工程技术-工程:电子与电气
CiteScore
6.60
自引率
5.40%
发文量
199
审稿时长
6.0 months
期刊介绍: The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信