{"title":"High-Performance Tensor-Train Primitives Using GPU Tensor Cores","authors":"Xiao-Yang Liu;Hao Hong;Zeliang Zhang;Weiqin Tong;Jean Kossaifi;Xiaodong Wang;Anwar Walid","doi":"10.1109/TC.2024.3441831","DOIUrl":null,"url":null,"abstract":"Learning tensor-train (TT) structure (a.k.a matrix product state (MPS) representation) from large-scale high-dimensional data has been a common task in big data analysis, deep learning, and quantum machine learning. However, tensor-train algorithms are compute-intensive, which hinders their real-world applications. In this paper, we present high-performance tensor-train primitives using GPU tensor cores and demonstrate three applications. First, we use GPU tensor cores to optimize tensor-train primitives, including tensor contraction, singular value decomposition, and data transfer and computing. Second, we utilize the optimized primitives to accelerate tensor-train decomposition algorithms for big data analysis. Further, we propose a shard mode for high-order tensor computations on multiple GPUs. Third, we apply the optimized primitives to accelerate the tensor-train layer for compressing deep neural networks. Last, we utilize the optimized primitives to accelerate a quantum machine learning algorithm called \n<i>Density Matrix Renormalization Group (DMRG)</i>\n. In performance evaluations, our third-order TT tensor decomposition achieves up to \n<inline-formula><tex-math>$3.34\\times$</tex-math></inline-formula>\n and \n<inline-formula><tex-math>$6.91\\times$</tex-math></inline-formula>\n speedups over two popular libraries (namely T3F and tntorch) on an A100 GPU, respectively. The proposed sixth-order tensor-train decomposition achieves up to a speedup of \n<inline-formula><tex-math>$5.01\\times$</tex-math></inline-formula>\n over T3F on multiple A100 GPUs. Our tensor-train layer for a fully connected neural network achieves a compression ratio of \n<inline-formula><tex-math>$65.3\\times$</tex-math></inline-formula>\n at the cost of \n<inline-formula><tex-math>$0.3\\%$</tex-math></inline-formula>\n drop in accuracy and a speedup of \n<inline-formula><tex-math>$1.53\\times$</tex-math></inline-formula>\n over a PyTorch implementation on CUDA cores. The optimized \n<i>DMRG</i>\n algorithm achieves up to a speedup of \n<inline-formula><tex-math>$14.0\\times$</tex-math></inline-formula>\n over TensorNetwork, indicating the potential of the optimized tensor primitives for the classical simulation of quantum machine learning algorithms.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"73 11","pages":"2634-2648"},"PeriodicalIF":3.6000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10633902/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Learning tensor-train (TT) structure (a.k.a matrix product state (MPS) representation) from large-scale high-dimensional data has been a common task in big data analysis, deep learning, and quantum machine learning. However, tensor-train algorithms are compute-intensive, which hinders their real-world applications. In this paper, we present high-performance tensor-train primitives using GPU tensor cores and demonstrate three applications. First, we use GPU tensor cores to optimize tensor-train primitives, including tensor contraction, singular value decomposition, and data transfer and computing. Second, we utilize the optimized primitives to accelerate tensor-train decomposition algorithms for big data analysis. Further, we propose a shard mode for high-order tensor computations on multiple GPUs. Third, we apply the optimized primitives to accelerate the tensor-train layer for compressing deep neural networks. Last, we utilize the optimized primitives to accelerate a quantum machine learning algorithm called
Density Matrix Renormalization Group (DMRG)
. In performance evaluations, our third-order TT tensor decomposition achieves up to
$3.34\times$
and
$6.91\times$
speedups over two popular libraries (namely T3F and tntorch) on an A100 GPU, respectively. The proposed sixth-order tensor-train decomposition achieves up to a speedup of
$5.01\times$
over T3F on multiple A100 GPUs. Our tensor-train layer for a fully connected neural network achieves a compression ratio of
$65.3\times$
at the cost of
$0.3\%$
drop in accuracy and a speedup of
$1.53\times$
over a PyTorch implementation on CUDA cores. The optimized
DMRG
algorithm achieves up to a speedup of
$14.0\times$
over TensorNetwork, indicating the potential of the optimized tensor primitives for the classical simulation of quantum machine learning algorithms.
期刊介绍:
The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.