Avoiding communication in sparse matrix computations

2008 IEEE International Symposium on Parallel and Distributed Processing Pub Date : 2008-04-14 DOI:10.1109/IPDPS.2008.4536305

J. Demmel, M. Hoemmen, M. Mohiyuddin, K. Yelick

{"title":"Avoiding communication in sparse matrix computations","authors":"J. Demmel, M. Hoemmen, M. Mohiyuddin, K. Yelick","doi":"10.1109/IPDPS.2008.4536305","DOIUrl":null,"url":null,"abstract":"The performance of sparse iterative solvers is typically limited by sparse matrix-vector multiplication, which is itself limited by memory system and network performance. As the gap between computation and communication speed continues to widen, these traditional sparse methods will suffer. In this paper we focus on an alternative building block for sparse iterative solvers, the \"matrix powers kernel\" [x, Ax, A2x, ..., Akx], and show that by organizing computations around this kernel, we can achieve near-minimal communication costs. We consider communication very broadly as both network communication in parallel code and memory hierarchy access in sequential code. In particular, we introduce a parallel algorithm for which the number of messages (total latency cost) is independent of the power k, and a sequential algorithm, that reduces both the number and volume of accesses, so that it is independent of k in both latency and bandwidth costs. This is part of a larger project to develop \"communication-avoiding Krylov subspace methods,\" which also addresses the numerical issues associated with these methods. Our algorithms work for general sparse matrices that \"partition well\". We introduce parallel performance models of matrices arising from 2D and 3D problems and show predicted speedups over a conventional algorithm of up to 7times on a petaflop-scale machine and up to 22times on computation across the grid. Analogous sequential performance models of the same problems predict speedups over a conventional algorithm of up to 10times on an out-of-core implementation, and up to 2.5times when we use our ideas to reduce off-chip latency and bandwidth to DRAM. Finally, we validate the model on an out-of-core sequential implementation and measured a speedup of over 3times, which is close to the predicted speedup.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"151","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 IEEE International Symposium on Parallel and Distributed Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2008.4536305","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 151

Abstract

The performance of sparse iterative solvers is typically limited by sparse matrix-vector multiplication, which is itself limited by memory system and network performance. As the gap between computation and communication speed continues to widen, these traditional sparse methods will suffer. In this paper we focus on an alternative building block for sparse iterative solvers, the "matrix powers kernel" [x, Ax, A2x, ..., Akx], and show that by organizing computations around this kernel, we can achieve near-minimal communication costs. We consider communication very broadly as both network communication in parallel code and memory hierarchy access in sequential code. In particular, we introduce a parallel algorithm for which the number of messages (total latency cost) is independent of the power k, and a sequential algorithm, that reduces both the number and volume of accesses, so that it is independent of k in both latency and bandwidth costs. This is part of a larger project to develop "communication-avoiding Krylov subspace methods," which also addresses the numerical issues associated with these methods. Our algorithms work for general sparse matrices that "partition well". We introduce parallel performance models of matrices arising from 2D and 3D problems and show predicted speedups over a conventional algorithm of up to 7times on a petaflop-scale machine and up to 22times on computation across the grid. Analogous sequential performance models of the same problems predict speedups over a conventional algorithm of up to 10times on an out-of-core implementation, and up to 2.5times when we use our ideas to reduce off-chip latency and bandwidth to DRAM. Finally, we validate the model on an out-of-core sequential implementation and measured a speedup of over 3times, which is close to the predicted speedup.

查看原文本刊更多论文

稀疏矩阵计算中避免通信

稀疏迭代求解器的性能通常受到稀疏矩阵-向量乘法的限制，而稀疏矩阵-向量乘法本身又受到内存系统和网络性能的限制。随着计算速度和通信速度之间的差距不断扩大，这些传统的稀疏方法将受到影响。在本文中，我们关注稀疏迭代求解的另一种构建块，即“矩阵幂核”[x, Ax, A2x，…][Akx]，并表明通过围绕这个内核组织计算，我们可以实现近乎最小的通信成本。我们将通信广泛地视为并行代码中的网络通信和顺序代码中的内存层次访问。特别是，我们引入了一种并行算法，其中消息的数量(总延迟成本)与k的幂无关，以及一种顺序算法，该算法减少了访问的数量和容量，因此它在延迟和带宽成本方面都与k无关。这是开发“避免通信的Krylov子空间方法”的更大项目的一部分，该项目还解决了与这些方法相关的数值问题。我们的算法适用于“划分良好”的一般稀疏矩阵。我们引入了由2D和3D问题产生的矩阵的并行性能模型，并显示了在千万亿次浮点运算规模的机器上，与传统算法相比，预测的速度提高了7倍，在网格计算上提高了22倍。相同问题的类似顺序性能模型预测，在核外实现上，与传统算法相比，速度提高了10倍，而当我们使用我们的想法来减少片外延迟和DRAM带宽时，速度提高了2.5倍。最后，我们在一个核外序列实现上验证了该模型，并测量了超过3倍的加速，这接近于预测的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 IEEE International Symposium on Parallel and Distributed Processing

自引率

0.00%

发文量