Asynchronous Task-Based Parallelization of Algebraic Multigrid

Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2017-06-26 DOI:10.1145/3093172.3093230

Amani Alonazi, George S. Markomanolis, D. Keyes

{"title":"Asynchronous Task-Based Parallelization of Algebraic Multigrid","authors":"Amani Alonazi, George S. Markomanolis, D. Keyes","doi":"10.1145/3093172.3093230","DOIUrl":null,"url":null,"abstract":"As processor clock rates become more dynamic and workloads become more adaptive, the vulnerability to global synchronization that already complicates programming for performance in today's petascale environment will be exacerbated. Algebraic multigrid (AMG), the solver of choice in many large-scale PDE-based simulations, scales well in the weak sense, with fixed problem size per node, on tightly coupled systems when loads are well balanced and core performance is reliable. However, its strong scaling to many cores within a node is challenging. Reducing synchronization and increasing concurrency are vital adaptations of AMG to hybrid architectures. Recent communication-reducing improvements to classical additive AMG by Vassilevski and Yang improve concurrency and increase communication-computation overlap, while retaining convergence properties close to those of standard multiplicative AMG, but remain bulk synchronous. We extend the Vassilevski and Yang additive AMG to asynchronous task-based parallelism using a hybrid MPI+OmpSs (from the Barcelona Supercomputer Center) within a node, along with MPI for internode communications. We implement a tiling approach to decompose the grid hierarchy into parallel units within task containers. We compare against the MPI-only BoomerAMG and the Auxiliary-space Maxwell Solver (AMS) in the hypre library for the 3D Laplacian operator and the electromagnetic diffusion, respectively. In time to solution for a full solve an MPI-OmpSs hybrid improves over an all-MPI approach in strong scaling at full core count (32 threads per single Haswell node of the Cray XC40) and maintains this per node advantage as both weak scale to thousands of cores, with MPI between nodes.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Platform for Advanced Scientific Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3093172.3093230","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

As processor clock rates become more dynamic and workloads become more adaptive, the vulnerability to global synchronization that already complicates programming for performance in today's petascale environment will be exacerbated. Algebraic multigrid (AMG), the solver of choice in many large-scale PDE-based simulations, scales well in the weak sense, with fixed problem size per node, on tightly coupled systems when loads are well balanced and core performance is reliable. However, its strong scaling to many cores within a node is challenging. Reducing synchronization and increasing concurrency are vital adaptations of AMG to hybrid architectures. Recent communication-reducing improvements to classical additive AMG by Vassilevski and Yang improve concurrency and increase communication-computation overlap, while retaining convergence properties close to those of standard multiplicative AMG, but remain bulk synchronous. We extend the Vassilevski and Yang additive AMG to asynchronous task-based parallelism using a hybrid MPI+OmpSs (from the Barcelona Supercomputer Center) within a node, along with MPI for internode communications. We implement a tiling approach to decompose the grid hierarchy into parallel units within task containers. We compare against the MPI-only BoomerAMG and the Auxiliary-space Maxwell Solver (AMS) in the hypre library for the 3D Laplacian operator and the electromagnetic diffusion, respectively. In time to solution for a full solve an MPI-OmpSs hybrid improves over an all-MPI approach in strong scaling at full core count (32 threads per single Haswell node of the Cray XC40) and maintains this per node advantage as both weak scale to thousands of cores, with MPI between nodes.

查看原文本刊更多论文

基于异步任务的代数多重网格并行化

随着处理器时钟频率变得更加动态，工作负载变得更加自适应，在当今千万亿级环境中，已经使编程变得复杂的全局同步漏洞将会加剧。代数多重网格(algeaic multigrid, AMG)是许多基于pde的大规模仿真的首选解算器，在弱意义上，当负载平衡良好且核心性能可靠时，每个节点的问题大小是固定的。然而，它在一个节点内强大的可扩展性是具有挑战性的。减少同步和增加并发性是AMG对混合体系结构的重要适应。Vassilevski和Yang最近对经典加法AMG的通信减少改进改进了并发性，增加了通信计算重叠，同时保留了接近标准乘法AMG的收敛特性，但保持了批量同步。我们使用节点内的混合MPI+ omps(来自巴塞罗那超级计算机中心)，以及用于节点间通信的MPI，将Vassilevski和Yang的加法AMG扩展到基于异步任务的并行性。我们实现了一种平铺方法，将网格层次结构分解为任务容器中的并行单元。我们分别比较了超库中仅mpi的BoomerAMG和辅助空间Maxwell Solver (AMS)对三维拉普拉斯算子和电磁扩散问题的求解。MPI- omps混合解决方案在全核数(Cray XC40的每个Haswell节点32个线程)下的强扩展方面优于全MPI方法，并且在弱扩展到数千个核心时保持每个节点的优势，节点之间具有MPI。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Platform for Advanced Scientific Computing Conference

自引率

0.00%

发文量