Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition

IF 1.2 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing Pub Date : 2023-02-07 DOI:10.1145/3580315

Zheng Miao, Jon C. Calhoun, Rong Ge, Jiajia Li

{"title":"Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition","authors":"Zheng Miao, Jon C. Calhoun, Rong Ge, Jiajia Li","doi":"10.1145/3580315","DOIUrl":null,"url":null,"abstract":"Tensors are used by a wide variety of applications to represent multi-dimensional data; tensor decompositions are a class of methods for latent data analytics, data compression, and so on. Many of these applications generate large tensors with irregular dimension sizes and nonzero distribution. CANDECOMP/PARAFAC decomposition (Cpd) is a popular low-rank tensor decomposition for discovering latent features. The increasing overhead on memory and execution time of Cpd for large tensors requires distributed memory implementations as the only feasible solution. The sparsity and irregularity of tensors hinder the improvement of performance and scalability of distributed memory implementations. While previous works have been proved successful in Cpd for tensors with relatively regular dimension sizes and nonzero distribution, they either deliver unsatisfactory performance and scalability for irregular tensors or require significant time overhead in preprocessing. In this work, we focus on medium-grained tensor distribution to address their limitation for irregular tensors. We first thoroughly investigate through theoretical and experimental analysis. We disclose that the main cause of poor Cpd performance and scalability is the imbalance of multiple types of computations and communications and their tradeoffs; and sparsity and irregularity make it challenging to achieve their balances and tradeoffs. Irregularity of a sparse tensor is categorized based on two aspects: very different dimension sizes and a non-uniform nonzero distribution. Typically, focusing on optimizing one type of load imbalance causes other ones more severe for irregular tensors. To address such challenges, we propose irregularity-aware distributed Cpd that leverages the sparsity and irregularity information to identify the best tradeoff between different imbalances with low time overhead. We materialize the idea with two optimization methods: the prediction-based grid configuration and matrix-oriented distribution policy, where the former forms the global balance among computations and communications, and the latter further adjusts the balances among computations. The experimental results show that our proposed irregularity-aware distributed Cpd is more scalable and outperforms the medium- and fine-grained distributed implementations by up to 4.4 × and 11.4 × on 1,536 processors, respectively. Our optimizations support different sparse tensor formats, such as compressed sparse fiber (CSF), coordinate (COO), and Hierarchical Coordinate (HiCOO), and gain good scalability for all of them.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 27"},"PeriodicalIF":1.2000,"publicationDate":"2023-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Parallel Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3580315","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Tensors are used by a wide variety of applications to represent multi-dimensional data; tensor decompositions are a class of methods for latent data analytics, data compression, and so on. Many of these applications generate large tensors with irregular dimension sizes and nonzero distribution. CANDECOMP/PARAFAC decomposition (Cpd) is a popular low-rank tensor decomposition for discovering latent features. The increasing overhead on memory and execution time of Cpd for large tensors requires distributed memory implementations as the only feasible solution. The sparsity and irregularity of tensors hinder the improvement of performance and scalability of distributed memory implementations. While previous works have been proved successful in Cpd for tensors with relatively regular dimension sizes and nonzero distribution, they either deliver unsatisfactory performance and scalability for irregular tensors or require significant time overhead in preprocessing. In this work, we focus on medium-grained tensor distribution to address their limitation for irregular tensors. We first thoroughly investigate through theoretical and experimental analysis. We disclose that the main cause of poor Cpd performance and scalability is the imbalance of multiple types of computations and communications and their tradeoffs; and sparsity and irregularity make it challenging to achieve their balances and tradeoffs. Irregularity of a sparse tensor is categorized based on two aspects: very different dimension sizes and a non-uniform nonzero distribution. Typically, focusing on optimizing one type of load imbalance causes other ones more severe for irregular tensors. To address such challenges, we propose irregularity-aware distributed Cpd that leverages the sparsity and irregularity information to identify the best tradeoff between different imbalances with low time overhead. We materialize the idea with two optimization methods: the prediction-based grid configuration and matrix-oriented distribution policy, where the former forms the global balance among computations and communications, and the latter further adjusts the balances among computations. The experimental results show that our proposed irregularity-aware distributed Cpd is more scalable and outperforms the medium- and fine-grained distributed implementations by up to 4.4 × and 11.4 × on 1,536 processors, respectively. Our optimizations support different sparse tensor formats, such as compressed sparse fiber (CSF), coordinate (COO), and Hierarchical Coordinate (HiCOO), and gain good scalability for all of them.

查看原文本刊更多论文

张量不规则性的性能蕴涵与分布式张量分解的优化

张量被各种各样的应用程序用来表示多维数据；张量分解是一类用于潜在数据分析、数据压缩等的方法。其中许多应用程序生成具有不规则维数和非零分布的大张量。CANDECOMP/PARAFAC分解（Cpd）是一种流行的用于发现潜在特征的低阶张量分解。对于大张量，不断增加的内存开销和Cpd的执行时间需要分布式内存实现作为唯一可行的解决方案。张量的稀疏性和不规则性阻碍了分布式存储器实现的性能和可扩展性的提高。虽然先前的工作已被证明在具有相对规则维度大小和非零分布的张量的Cpd中是成功的，但它们要么对不规则张量提供了不令人满意的性能和可扩展性，要么在预处理中需要大量的时间开销。在这项工作中，我们专注于中等粒度张量分布，以解决它们对不规则张量的限制。我们首先通过理论和实验分析进行深入研究。我们披露了Cpd性能和可扩展性较差的主要原因是多种类型的计算和通信的不平衡及其权衡；稀疏性和不规则性使得实现它们的平衡和权衡具有挑战性。稀疏张量的不规则性基于两个方面进行分类：非常不同的维度大小和非均匀的非零分布。通常，专注于优化一种类型的负载不平衡会导致其他类型的不规则张量更严重。为了应对这些挑战，我们提出了不规则感知分布式Cpd，该分布式Cpd利用稀疏性和不规则性信息来确定不同不平衡之间的最佳折衷，同时降低时间开销。我们用两种优化方法来实现这一想法：基于预测的网格配置和面向矩阵的分配策略，前者形成计算和通信之间的全局平衡，后者进一步调整计算之间的平衡。实验结果表明，我们提出的不规则感知分布式Cpd更具可扩展性，在1536个处理器上分别比中粒度和细粒度分布式实现高出4.4倍和11.4倍。我们的优化支持不同的稀疏张量格式，如压缩稀疏光纤（CSF）、坐标（COO）和层次坐标（HiCOO），并为所有这些格式获得了良好的可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Parallel Computing COMPUTER SCIENCE, THEORY & METHODS-

CiteScore

4.10

自引率

0.00%

发文量