Implementing the Broadcast Operation in a Distributed Task-based Runtime

2022 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) Pub Date : 2022-11-01 DOI:10.1109/SBAC-PADW56527.2022.00014

Rodrigo Ceccato, H. Yviquel, M. Pereira, Alan Souza, G. Araújo

{"title":"Implementing the Broadcast Operation in a Distributed Task-based Runtime","authors":"Rodrigo Ceccato, H. Yviquel, M. Pereira, Alan Souza, G. Araújo","doi":"10.1109/SBAC-PADW56527.2022.00014","DOIUrl":null,"url":null,"abstract":"Scientific applications that require high performance rely on multi-node and multi-core systems equipped with accelerators. Code for these heterogeneous architectures often mixes different programming paradigms and is hard to read and maintain. Task-based distributed runtimes can improve portability and readability by allowing programmers to write tasks that are automatically scheduled and offloaded for execution. Nevertheless, in large systems, communication can dominate the time spent during execution. To mitigate this, these systems usually implement collective operation algorithms that efficiently execute common data movement patterns across a group of processes. This work studies the usage of different broadcast strategies on the OpenMP Cluster (OMPC) task-based runtime. In addition to OMPC default behavior of on-demand data delivery, we introduce a routine to automatically detect data movement that is equivalent to a broadcast in the task graph and actively send it through a specialized algorithm. Our largest test setup using 64 worker nodes and broadcasting 64GB of data on the Santos Dumont cluster, using an extended version of Task Bench, showed a 2.02x speedup using the Dynamic Broadcast algorithm and 2.49x speedup when using the MPI broadcast routine, when compared to the default on-demand delivery.","PeriodicalId":263889,"journal":{"name":"2022 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PADW56527.2022.00014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Scientific applications that require high performance rely on multi-node and multi-core systems equipped with accelerators. Code for these heterogeneous architectures often mixes different programming paradigms and is hard to read and maintain. Task-based distributed runtimes can improve portability and readability by allowing programmers to write tasks that are automatically scheduled and offloaded for execution. Nevertheless, in large systems, communication can dominate the time spent during execution. To mitigate this, these systems usually implement collective operation algorithms that efficiently execute common data movement patterns across a group of processes. This work studies the usage of different broadcast strategies on the OpenMP Cluster (OMPC) task-based runtime. In addition to OMPC default behavior of on-demand data delivery, we introduce a routine to automatically detect data movement that is equivalent to a broadcast in the task graph and actively send it through a specialized algorithm. Our largest test setup using 64 worker nodes and broadcasting 64GB of data on the Santos Dumont cluster, using an extended version of Task Bench, showed a 2.02x speedup using the Dynamic Broadcast algorithm and 2.49x speedup when using the MPI broadcast routine, when compared to the default on-demand delivery.

查看原文本刊更多论文

在基于分布式任务的运行时中实现广播操作

需要高性能的科学应用依赖于配备加速器的多节点和多核系统。这些异构体系结构的代码通常混合了不同的编程范式，难以阅读和维护。基于任务的分布式运行时可以通过允许程序员编写自动调度和卸载执行的任务来提高可移植性和可读性。然而，在大型系统中，通信可能会支配执行期间花费的时间。为了减轻这种情况，这些系统通常实现集体操作算法，这些算法可以跨一组进程有效地执行公共数据移动模式。本文研究了OpenMP集群(OMPC)基于任务的运行时中不同广播策略的使用情况。除了OMPC默认的按需数据传递行为之外，我们还引入了一个例程来自动检测数据移动，这相当于任务图中的广播，并通过专门的算法主动发送它。我们最大的测试设置使用64个工作节点，在Santos Dumont集群上广播64GB的数据，使用扩展版本的Task Bench，与默认的按需交付相比，使用动态广播算法加速2.2.0倍，使用MPI广播例程加速2.49倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

自引率

0.00%

发文量