On Optimizing Distributed Tucker Decomposition for Sparse Tensors

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-04-25 DOI:10.1145/3205289.3205315

Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Prakash Murali, Yogish Sabharwal, Shivmaran S. Pandian, D. Sreedhar

{"title":"On Optimizing Distributed Tucker Decomposition for Sparse Tensors","authors":"Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Prakash Murali, Yogish Sabharwal, Shivmaran S. Pandian, D. Sreedhar","doi":"10.1145/3205289.3205315","DOIUrl":null,"url":null,"abstract":"The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input tensor among the processors (MPI ranks) critically influences the HOOI execution time. Prior work has proposed different distribution schemes: an offline scheme based on sophisticated hypergraph partitioning method and simple, lightweight alternatives that can be used real-time. While the hypergraph based scheme typically results in faster HOOI execution time, being complex, the time taken for determining the distribution is an order of magnitude higher than the execution time of a single HOOI iteration. Our main contribution is a lightweight distribution scheme, which achieves the best of both worlds. We show that the scheme is near-optimal on certain fundamental metrics associated with the HOOI procedure and as a result, near-optimal on the computational load (FLOPs). Though the scheme may incur higher communication volume, the computation time is the dominant factor and as the result, the scheme achieves better performance on the overall HOOI execution time. Our experimental evaluation on large real-life tensors (having up to 4 billion elements) shows that the scheme outperforms the prior schemes on the HOOI execution time by a factor of up to 3x. On the other hand, its distribution time is comparable to the prior lightweight schemes and is typically lesser than the execution time of a single HOOI iteration.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"175 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3205289.3205315","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

Abstract

The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input tensor among the processors (MPI ranks) critically influences the HOOI execution time. Prior work has proposed different distribution schemes: an offline scheme based on sophisticated hypergraph partitioning method and simple, lightweight alternatives that can be used real-time. While the hypergraph based scheme typically results in faster HOOI execution time, being complex, the time taken for determining the distribution is an order of magnitude higher than the execution time of a single HOOI iteration. Our main contribution is a lightweight distribution scheme, which achieves the best of both worlds. We show that the scheme is near-optimal on certain fundamental metrics associated with the HOOI procedure and as a result, near-optimal on the computational load (FLOPs). Though the scheme may incur higher communication volume, the computation time is the dominant factor and as the result, the scheme achieves better performance on the overall HOOI execution time. Our experimental evaluation on large real-life tensors (having up to 4 billion elements) shows that the scheme outperforms the prior schemes on the HOOI execution time by a factor of up to 3x. On the other hand, its distribution time is comparable to the prior lightweight schemes and is typically lesser than the execution time of a single HOOI iteration.

查看原文本刊更多论文

稀疏张量分布Tucker分解的优化研究

Tucker分解将奇异值分解(SVD)的概念推广到张量，即矩阵的高维类似物。本文研究了分布式存储系统上稀疏张量的Tucker分解的构造问题。用于在处理器之间分配输入张量的方案(MPI排名)对HOOI的执行时间有重要影响。先前的工作提出了不同的分发方案:基于复杂的超图划分方法的离线方案和可以实时使用的简单轻量级替代方案。虽然基于超图的方案通常会导致更快的HOOI执行时间，但它很复杂，确定分布所花费的时间比单个HOOI迭代的执行时间高一个数量级。我们的主要贡献是一个轻量级的分发方案，它实现了两个世界的最佳效果。我们表明，该方案在与HOOI过程相关的某些基本指标上接近最优，因此在计算负载(FLOPs)上接近最优。虽然该方案可能会产生更高的通信量，但计算时间是主要因素，因此该方案在整体HOOI执行时间上具有更好的性能。我们在大型现实张量(具有多达40亿个元素)上的实验评估表明，该方案在HOOI执行时间上优于先前方案，最高可达3倍。另一方面，它的分发时间与之前的轻量级模式相当，并且通常小于单个HOOI迭代的执行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 International Conference on Supercomputing

自引率

0.00%

发文量