UPC++: A High-Performance Communication Framework for Asynchronous Computation

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-01 DOI:10.1109/IPDPS.2019.00104

J. Bachan, S. Baden, S. Hofmeyr, M. Jacquelin, A. Kamil, D. Bonachea, Paul H. Hargrove, H. Ahmed

{"title":"UPC++: A High-Performance Communication Framework for Asynchronous Computation","authors":"J. Bachan, S. Baden, S. Hofmeyr, M. Jacquelin, A. Kamil, D. Bonachea, Paul H. Hargrove, H. Ahmed","doi":"10.1109/IPDPS.2019.00104","DOIUrl":null,"url":null,"abstract":"UPC++ is a C++ library that supports high-performance computation via an asynchronous communication framework. This paper describes a new incarnation that differs substantially from its predecessor, and we discuss the reasons for our design decisions. We present new design features, including future-based asynchrony management, distributed objects, and generalized Remote Procedure Call (RPC). We show microbenchmark performance results demonstrating that one-sided Remote Memory Access (RMA) in UPC++ is competitive with MPI-3 RMA; on a Cray XC40 UPC++ delivers up to a 25% improvement in the latency of blocking RMA put, and up to a 33% bandwidth improvement in an RMA throughput test. We showcase the benefits of UPC++ with irregular applications through a pair of application motifs, a distributed hash table and a sparse solver component. Our distributed hash table in UPC++ delivers near-linear weak scaling up to 34816 cores of a Cray XC40. Our UPC++ implementation of the sparse solver component shows robust strong scaling up to 2048 cores, where it outperforms variants communicating using MPI by up to 3.1x. UPC++ encourages the use of aggressive asynchrony in low-overhead RMA and RPC, improving programmer productivity and delivering high performance in irregular applications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2019.00104","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 43

Abstract

UPC++ is a C++ library that supports high-performance computation via an asynchronous communication framework. This paper describes a new incarnation that differs substantially from its predecessor, and we discuss the reasons for our design decisions. We present new design features, including future-based asynchrony management, distributed objects, and generalized Remote Procedure Call (RPC). We show microbenchmark performance results demonstrating that one-sided Remote Memory Access (RMA) in UPC++ is competitive with MPI-3 RMA; on a Cray XC40 UPC++ delivers up to a 25% improvement in the latency of blocking RMA put, and up to a 33% bandwidth improvement in an RMA throughput test. We showcase the benefits of UPC++ with irregular applications through a pair of application motifs, a distributed hash table and a sparse solver component. Our distributed hash table in UPC++ delivers near-linear weak scaling up to 34816 cores of a Cray XC40. Our UPC++ implementation of the sparse solver component shows robust strong scaling up to 2048 cores, where it outperforms variants communicating using MPI by up to 3.1x. UPC++ encourages the use of aggressive asynchrony in low-overhead RMA and RPC, improving programmer productivity and delivering high performance in irregular applications.

查看原文本刊更多论文

用于异步计算的高性能通信框架

upc++是一个c++库，通过异步通信框架支持高性能计算。本文描述了一个新的化身，它与它的前身有很大的不同，我们讨论了我们设计决策的原因。我们提出了新的设计特性，包括基于未来的异步管理、分布式对象和广义远程过程调用(RPC)。我们展示了微基准性能结果，表明upc++中的单边远程内存访问(RMA)与MPI-3 RMA具有竞争力;在Cray XC40上，upc++可将阻塞RMA放置的延迟提高25%，并在RMA吞吐量测试中将带宽提高33%。我们通过一对应用程序主题、一个分布式哈希表和一个稀疏求解器组件，展示了upc++在不规则应用程序中的好处。我们的分布式哈希表在upc++中提供了近线性的弱扩展，可扩展到Cray XC40的34816核。我们的upc++实现的稀疏求解器组件显示出强大的扩展到2048个内核，它比使用MPI通信的变体性能高出3.1倍。upc++鼓励在低开销的RMA和RPC中使用积极的异步，提高程序员的生产力，并在不规则的应用程序中提供高性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量