Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2020-05-01 DOI:10.1109/IPDPSW50202.2020.00127

Yu Pei, Qinglei Cao, G. Bosilca, P. Luszczek, V. Eijkhout, J. Dongarra

{"title":"Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime","authors":"Yu Pei, Qinglei Cao, G. Bosilca, P. Luszczek, V. Eijkhout, J. Dongarra","doi":"10.1109/IPDPSW50202.2020.00127","DOIUrl":null,"url":null,"abstract":"Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2X speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively.","PeriodicalId":398819,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW50202.2020.00127","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2X speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively.

查看原文本刊更多论文

在基于PaRSEC任务的运行时上避免2D模板实现的通信

模板计算或一般稀疏矩阵向量积(SpMV)是几何多重网格或Krylov求解等算法的关键组成部分。但是它们较低的算术强度意味着内存带宽和网络延迟将成为性能限制因素。当前的架构趋势更倾向于计算而不是带宽，从而加剧了本已不利的不平衡。以前的工作通过提高内存带宽使用或提供通信避免(CA)方案来实现模板内核优化，通过复制远程工作来最小化重复稀疏向量乘法中的网络延迟，从而延迟关键路径上的通信。为了最大限度地减少分布式模板计算中的通信瓶颈，在本研究中，我们将CA方案与基于数据流任务的运行时系统(如PaRSEC)中固有的计算和通信重叠结合起来，以展示它们的综合优势。我们在PETSc中实现了2D五点模板(Jacobi迭代)，并在PaRSEC上以两种方式实现，即完全通信(基本PaRSEC)和直接在2D计算网格上操作的CA-PaRSEC。我们在两个集群(NaCL和Stampede2)上运行的结果表明，我们可以比在PETSc中实现的标准SpMV解决方案实现2倍的加速，并且在内核执行不主导执行时间的某些情况下，CA-PaRSEC版本分别比在NaCL和Stampede2上实现的base-PaRSEC实现实现达到57%和33%的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量