Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI:10.1109/SC.2014.82

Jongsoo Park, M. Smelyanskiy, K. Vaidyanathan, A. Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mostofa Ali Patwary, Yutong Lu, P. Dubey

{"title":"Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices","authors":"Jongsoo Park, M. Smelyanskiy, K. Vaidyanathan, A. Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mostofa Ali Patwary, Yutong Lu, P. Dubey","doi":"10.1109/SC.2014.82","DOIUrl":null,"url":null,"abstract":"A new sparse high performance conjugate gradient benchmark (HPCG) has been recently released to address challenges in the design of sparse linear solvers for the next generation extreme-scale computing systems. Key computation, data access, and communication pattern in HPCG represent building blocks commonly found in today's HPC applications. While it is a well known challenge to efficiently parallelize Gauss-Seidel smoother, the most time-consuming kernel in HPCG, our algorithmic and architecture-aware optimizations deliver 95% and 68% of the achievable bandwidth on Xeon and Xeon Phi, respectively. Based on available parallelism, our Xeon Phi shared-memory implementation of Gauss-Seidel smoother selectively applies block multi-color reordering. Combined with MPI parallelization, our implementation balances parallelism, data access locality, CG convergence rate, and communication overhead. Our implementation achieved 580 TFLOPS (82% parallelization efficiency) on Tianhe-2 system, ranking first on the most recent HPCG list in July 2014. In addition, we demonstrate that our optimizations not only benefit HPCG original dataset, which is based on structured 3D grid, but also a wide range of unstructured matrices.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"49","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.2014.82","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 49

Abstract

A new sparse high performance conjugate gradient benchmark (HPCG) has been recently released to address challenges in the design of sparse linear solvers for the next generation extreme-scale computing systems. Key computation, data access, and communication pattern in HPCG represent building blocks commonly found in today's HPC applications. While it is a well known challenge to efficiently parallelize Gauss-Seidel smoother, the most time-consuming kernel in HPCG, our algorithmic and architecture-aware optimizations deliver 95% and 68% of the achievable bandwidth on Xeon and Xeon Phi, respectively. Based on available parallelism, our Xeon Phi shared-memory implementation of Gauss-Seidel smoother selectively applies block multi-color reordering. Combined with MPI parallelization, our implementation balances parallelism, data access locality, CG convergence rate, and communication overhead. Our implementation achieved 580 TFLOPS (82% parallelization efficiency) on Tianhe-2 system, ranking first on the most recent HPCG list in July 2014. In addition, we demonstrate that our optimizations not only benefit HPCG original dataset, which is based on structured 3D grid, but also a wide range of unstructured matrices.

查看原文本刊更多论文

高性能共轭梯度基准的共享内存实现及其在非结构化矩阵中的应用

最近发布了一种新的稀疏高性能共轭梯度基准(HPCG)，以解决下一代极端规模计算系统中稀疏线性求解器设计中的挑战。HPCG中的密钥计算、数据访问和通信模式是当今HPC应用中常见的构建模块。虽然高效地并行化Gauss-Seidel平滑是一个众所周知的挑战，但我们的算法和架构感知优化分别在Xeon和Xeon Phi上提供了95%和68%的可实现带宽。基于可用的并行性，我们的Xeon Phi共享内存实现Gauss-Seidel平滑有选择地应用块多色重排序。结合MPI并行化，我们的实现平衡了并行性、数据访问局部性、CG收敛率和通信开销。我们的实现在天河二号系统上实现了580 TFLOPS(82%的并行化效率)，在2014年7月的最新HPCG排行榜上排名第一。此外，我们还证明了我们的优化不仅有利于基于结构化3D网格的HPCG原始数据集，也有利于广泛的非结构化矩阵。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量