An MPI Halo-Cell Implementation for Zero-Copy Abstraction

Proceedings of the 22nd European MPI Users' Group Meeting Pub Date : 2015-09-21 DOI:10.1145/2802658.2802669

Jean-Baptiste Besnard, A. Malony, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger

{"title":"An MPI Halo-Cell Implementation for Zero-Copy Abstraction","authors":"Jean-Baptiste Besnard, A. Malony, S. Shende, Marc Pérache, Patrick Carribault, Julien Jaeger","doi":"10.1145/2802658.2802669","DOIUrl":null,"url":null,"abstract":"In the race for Exascale, the advent of many-core processors will bring a shift in parallel computing architectures to systems of much higher concurrency, but with a relatively smaller memory per thread. This shift raises concerns for the adaptability of HPC software, for the current generation to the brave new world. In this paper, we study domain splitting on an increasing number of memory areas as an example problem where negative performance impact on computation could arise. We identify the specific parameters that drive scalability for this problem, and then model the halo-cell ratio on common mesh topologies to study the memory and communication implications. Such analysis argues for the use of shared-memory parallelism, such as with OpenMP, to address the performance problems that could occur. In contrast, we propose an original solution based entirely on MPI programming semantics, while providing the performance advantages of hybrid parallel programming. Our solution transparently replaces halo-cells transfers with pointer exchanges when MPI tasks are running on the same node, effectively removing memory copies. The results we present demonstrate gains in terms of memory and computation time on Xeon Phi (compared to OpenMP-only and MPI-only) using a representative domain decomposition benchmark.","PeriodicalId":365272,"journal":{"name":"Proceedings of the 22nd European MPI Users' Group Meeting","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2802658.2802669","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

In the race for Exascale, the advent of many-core processors will bring a shift in parallel computing architectures to systems of much higher concurrency, but with a relatively smaller memory per thread. This shift raises concerns for the adaptability of HPC software, for the current generation to the brave new world. In this paper, we study domain splitting on an increasing number of memory areas as an example problem where negative performance impact on computation could arise. We identify the specific parameters that drive scalability for this problem, and then model the halo-cell ratio on common mesh topologies to study the memory and communication implications. Such analysis argues for the use of shared-memory parallelism, such as with OpenMP, to address the performance problems that could occur. In contrast, we propose an original solution based entirely on MPI programming semantics, while providing the performance advantages of hybrid parallel programming. Our solution transparently replaces halo-cells transfers with pointer exchanges when MPI tasks are running on the same node, effectively removing memory copies. The results we present demonstrate gains in terms of memory and computation time on Xeon Phi (compared to OpenMP-only and MPI-only) using a representative domain decomposition benchmark.

查看原文本刊更多论文

零拷贝抽象的MPI光晕单元实现

在Exascale的竞争中，多核处理器的出现将带来并行计算体系结构的转变，使其具有更高的并发性，但每个线程的内存相对较小。这种转变引起了对高性能计算软件的适应性的关注，使当代人能够适应美丽的新世界。在本文中，我们研究了在越来越多的内存区域上的域分割，作为一个可能对计算产生负面性能影响的示例问题。我们确定了驱动该问题可扩展性的特定参数，然后在常见的网格拓扑上对halo-cell比率进行建模，以研究内存和通信含义。这种分析支持使用共享内存并行性(例如OpenMP)来解决可能出现的性能问题。相比之下，我们提出了一个完全基于MPI编程语义的原始解决方案，同时提供了混合并行编程的性能优势。当MPI任务在同一节点上运行时，我们的解决方案透明地用指针交换取代晕格传输，从而有效地删除内存副本。我们给出的结果显示了在Xeon Phi上(与仅openmp和仅mpi相比)使用代表性域分解基准在内存和计算时间方面的收益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 22nd European MPI Users' Group Meeting

自引率

0.00%

发文量