Design and Characterization of Shared Address Space MPI Collectives on Modern Architectures

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI:10.1109/CCGRID.2019.00055

J. Hashmi, S. Chakraborty, Mohammadreza Bayatpour, H. Subramoni, D. Panda

{"title":"Design and Characterization of Shared Address Space MPI Collectives on Modern Architectures","authors":"J. Hashmi, S. Chakraborty, Mohammadreza Bayatpour, H. Subramoni, D. Panda","doi":"10.1109/CCGRID.2019.00055","DOIUrl":null,"url":null,"abstract":"Emerging multi-/many-cores such as Intel Xeon and Xeon Phi are widely being adopted for modern large-scale supercomputing systems. The architectural features such as high core density, mesh interconnects, deeper memory hierarchies and hardware multi-threading offered by these systems provide opportunities for application developers to exploit more parallelism. However, it also poses significant challenges for the MPI runtimes to optimize communication performance. One of the major challenges involves optimizing collective communication for dense multi-/many-core processors. Traditionally, MPI runtimes have used send/recv, direct shared-memory (\"double-copy\") or kernel-assisted (\"single-copy\") mechanisms for intra-node collective communication. However, existing collective designs that are based on these mechanisms suffer from several bottlenecks such as multiple copies, per message handshake, and kernel-level lock contention that limit their performance. In this paper, we first characterize the bottlenecks associated with the aforementioned approaches in designing collectives in MPI. Then, we propose efficient \"Shared-address space\"-based designs to implement different MPI collectives. Finally, we show the efficacy of our approach by implementing various MPI collectives. Our proposed designs show up to 11x, 50x, 17x, and 5x performance improvement for Bcast, Scatter, Gather, and Alltoall over other state-of-the-art MPI libraries on different multi-/many-core architectures.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Emerging multi-/many-cores such as Intel Xeon and Xeon Phi are widely being adopted for modern large-scale supercomputing systems. The architectural features such as high core density, mesh interconnects, deeper memory hierarchies and hardware multi-threading offered by these systems provide opportunities for application developers to exploit more parallelism. However, it also poses significant challenges for the MPI runtimes to optimize communication performance. One of the major challenges involves optimizing collective communication for dense multi-/many-core processors. Traditionally, MPI runtimes have used send/recv, direct shared-memory ("double-copy") or kernel-assisted ("single-copy") mechanisms for intra-node collective communication. However, existing collective designs that are based on these mechanisms suffer from several bottlenecks such as multiple copies, per message handshake, and kernel-level lock contention that limit their performance. In this paper, we first characterize the bottlenecks associated with the aforementioned approaches in designing collectives in MPI. Then, we propose efficient "Shared-address space"-based designs to implement different MPI collectives. Finally, we show the efficacy of our approach by implementing various MPI collectives. Our proposed designs show up to 11x, 50x, 17x, and 5x performance improvement for Bcast, Scatter, Gather, and Alltoall over other state-of-the-art MPI libraries on different multi-/many-core architectures.

查看原文本刊更多论文

现代体系结构中共享地址空间MPI群的设计与表征

新兴的多核/多核处理器如Intel Xeon和Xeon Phi被广泛应用于现代大型超级计算系统。这些系统提供的架构特性，如高核密度、网状互连、更深的内存层次和硬件多线程，为应用程序开发人员提供了利用更多并行性的机会。然而，它也对MPI运行时优化通信性能提出了重大挑战。其中一个主要挑战涉及优化密集多核/多核处理器的集体通信。传统上，MPI运行时使用send/recv、直接共享内存(“双拷贝”)或内核辅助(“单拷贝”)机制进行节点内集体通信。然而，基于这些机制的现有集体设计存在几个瓶颈，比如多副本、每条消息握手和内核级锁争用，这些都限制了它们的性能。在本文中，我们首先描述了在MPI中设计集体时与上述方法相关的瓶颈。然后，我们提出了有效的基于“共享地址空间”的设计来实现不同的MPI集合。最后，我们通过实施各种MPI集体来展示我们的方法的有效性。我们提出的设计显示，在不同的多核/多核架构上，Bcast、Scatter、Gather和Alltoall比其他最先进的MPI库的性能提高了11倍、50倍、17倍和5倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

自引率

0.00%

发文量