Yutaka Watanabe, M. Sato, Miwako Tsuji, H. Murai, T. Boku
{"title":"Design and Performance Evaluation of UCX for Tofu-D Interconnect with OpenSHMEM-UCX on Fugaku","authors":"Yutaka Watanabe, M. Sato, Miwako Tsuji, H. Murai, T. Boku","doi":"10.1109/PAW-ATM56565.2022.00010","DOIUrl":"https://doi.org/10.1109/PAW-ATM56565.2022.00010","url":null,"abstract":"The partitioned global address space (PGAS) model with one-sided communication has recently received attention as an easy and intuitive method for describing remote data access in nodes. PGAS can be implemented using remote direct memory access, which provides lightweight one-sided communication and low overhead synchronization semantics. In this paper, to enable portable, lightweight, and efficient one-sided communication on the Fugaku supercomputer, we designed and implemented Universal Communication X (UCX) for Tofu Interconnect D. An evaluation using OpenSHMEM-UCX and OSHMPI indicates that OpenSHMEM with UCX on Tofu Interconnect D enables smaller latency and better efficiency compared with that for OpenSHMEM with MPI and that it is beneficial for several applications based on PGAS models.","PeriodicalId":231452,"journal":{"name":"2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115394257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Hérault, Joseph Schuchart, Edward F. Valeev, G. Bosilca
{"title":"Composition of Algorithmic Building Blocks in Template Task Graphs","authors":"T. Hérault, Joseph Schuchart, Edward F. Valeev, G. Bosilca","doi":"10.1109/PAW-ATM56565.2022.00008","DOIUrl":"https://doi.org/10.1109/PAW-ATM56565.2022.00008","url":null,"abstract":"In this paper, we explore the composition capabilities of the Template Task Graph (TTG) programming model. We show how fine-grain composition of tasks is possible in TTG between DAGs belonging to different libraries, even in a distributed setup. We illustrate the benefits of this fine-grain composition on a linear algebra operation, the matrix inversion via the Cholesky method, which consists of three operations that need to be applied in sequence.Evaluation on a cluster of many core shows that the transparent fine-grain composition implements the complex operation without introducing unnecessary synchronizations, increasing the overlap of communication and computation, and thus improving significantly the performance of the entire composed operation.","PeriodicalId":231452,"journal":{"name":"2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)","volume":"208 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131435819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extending OpenMP and OpenSHMEM for Efficient Heterogeneous Computing","authors":"Wen-wei Lu, Shilei Tian, Tony Curtis, B. Chapman","doi":"10.1109/PAW-ATM56565.2022.00006","DOIUrl":"https://doi.org/10.1109/PAW-ATM56565.2022.00006","url":null,"abstract":"Heterogeneous supercomputing systems are becoming mainstream thanks to their powerful accelerators. However, the accelerators’ special memory model and APIs increase the development complexity, and calls for innovative programming model designs. To address this issue, OpenMP has added target offloading for portable accelerator programming, and MPI allows transparent send-receive of accelerator memory buffers. Meanwhile, Partitioned Global Address Space (PGAS) languages like OpenSHMEM are falling behind for heterogeneous computing because their special memory models pose additional challenges.We propose language and runtime interoperability extensions for both OpenMP and OpenSHMEM to enable portable remote access on GPU buffers, with minimal amount of code changes. Our modified runtime systems work in coordination to manage accelerator memory, eliminating the need for staging communication buffers. Compared to the standard implementation, our extensions attain 6x point-to-point latency improvement, 1.3x better collective operation latency, 4.9x random access throughput, and up to 12.5% better performance in strong scaling configurations.","PeriodicalId":231452,"journal":{"name":"2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132941413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Task Fusion in Distributed Runtimes","authors":"S. Sundram, Wonchan Lee, A. Aiken","doi":"10.1109/PAW-ATM56565.2022.00007","DOIUrl":"https://doi.org/10.1109/PAW-ATM56565.2022.00007","url":null,"abstract":"We present distributed task fusion, a run-time optimization for task-based runtimes operating on parallel and heterogeneous systems. Distributed task fusion dynamically performs an efficient buffering, analysis, and fusion of asynchronously-evaluated distributed operations, reducing the overheads inherent to scheduling distributed tasks in implicitly parallel frameworks and runtimes. We identify the constraints under which distributed task fusion is permissible and describe an implementation in Legate, a domain-agnostic library for constructing portable and scalable task-based libraries. We present performance results using cuNumeric, a Legate library that enables scalable execution of NumPy pipelines on parallel and heterogeneous systems. We realize speedups up to 1.5x with task fusion enabled on up to 32 P100 GPUs, thus demonstrating efficient execution of pipelines involving many successive fine-grained tasks. Finally, we discuss potential future work, including complementary optimizations that could result in additional performance improvements.","PeriodicalId":231452,"journal":{"name":"2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130255134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Asynchronous Workload Balancing through Persistent Work-Stealing and Offloading for a Distributed Actor Model Library","authors":"Yakup Budanaz, Mario Wille, M. Bader","doi":"10.1109/PAW-ATM56565.2022.00009","DOIUrl":"https://doi.org/10.1109/PAW-ATM56565.2022.00009","url":null,"abstract":"With dynamic imbalances caused by both software and ever more complex hardware, applications and runtime systems must adapt to dynamic load imbalances. We present a diffusion-based, reactive, fully asynchronous, and decentralized dynamic load balancer for a distributed actor library. With the asynchronous execution model, features such as remote procedure calls, and support for serialization of arbitrary types, UPC++ is especially feasible for the implementation of the actor model. While providing a substantial speedup for small- to medium-sized jobs with both predictable and unpredictable workload imbalances, the scalability of the diffusion-based approaches remains below expectations in most presented test cases.","PeriodicalId":231452,"journal":{"name":"2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124587510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}