rmalloc() and rpipe(): a uGNI-based Distributed Remote Memory Allocator and Access Library for One-sided Messaging

Proceedings of the 8th International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2018-06-12 DOI:10.1145/3217189.3217191

U. Wickramasinghe, A. Lumsdaine

{"title":"rmalloc() and rpipe(): a uGNI-based Distributed Remote Memory Allocator and Access Library for One-sided Messaging","authors":"U. Wickramasinghe, A. Lumsdaine","doi":"10.1145/3217189.3217191","DOIUrl":null,"url":null,"abstract":"Optimizing communication is essential for high-performance computing because synchronization bottlenecks inhibit the overall performance and scalability of parallel applications. Today's cutting-edge computing hardware, as well as networking interfaces like Cray Aries/Gemini, features extremely low latency and high bandwidth remote memory access (RMA) operations for optimized data movement. However for any efficient data movement to occur between two logical processing units, software substrates must be able to properly exploit hardware resources for the underlying fabric. Overheads due to coarse granular synchronization and stalls during irregular access of remote memory regions may hint at two adverse effects of resource under-utilization in time and space. We introduce a uGNI-based distributed remote memory allocator called \"rmalloc\" which expands RDMA-enabled memory utilization, and a communication substrate called \"rpipe\" that tries to mitigate synchronization bottlenecks. Our UNIX-inspired RMA programming model is simple to use and equally applicable to both higher-level applications as well as lower-level runtime systems for enabling efficient data movement. Our micro-benchmark results suggest that \"rmalloc\" default next-fit allocator outperforms MPI-3.0 RMA by 1.5X and up to 6X in most cases, while other variants of \"rmalloc\" (i.e. best-fit, worst-fit) reduce external fragmentation and perform comparably or better than the default \"rmalloc\" allocator for irregular RMA.","PeriodicalId":183802,"journal":{"name":"Proceedings of the 8th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th International Workshop on Runtime and Operating Systems for Supercomputers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3217189.3217191","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Optimizing communication is essential for high-performance computing because synchronization bottlenecks inhibit the overall performance and scalability of parallel applications. Today's cutting-edge computing hardware, as well as networking interfaces like Cray Aries/Gemini, features extremely low latency and high bandwidth remote memory access (RMA) operations for optimized data movement. However for any efficient data movement to occur between two logical processing units, software substrates must be able to properly exploit hardware resources for the underlying fabric. Overheads due to coarse granular synchronization and stalls during irregular access of remote memory regions may hint at two adverse effects of resource under-utilization in time and space. We introduce a uGNI-based distributed remote memory allocator called "rmalloc" which expands RDMA-enabled memory utilization, and a communication substrate called "rpipe" that tries to mitigate synchronization bottlenecks. Our UNIX-inspired RMA programming model is simple to use and equally applicable to both higher-level applications as well as lower-level runtime systems for enabling efficient data movement. Our micro-benchmark results suggest that "rmalloc" default next-fit allocator outperforms MPI-3.0 RMA by 1.5X and up to 6X in most cases, while other variants of "rmalloc" (i.e. best-fit, worst-fit) reduce external fragmentation and perform comparably or better than the default "rmalloc" allocator for irregular RMA.

查看原文本刊更多论文

rmalloc()和rpipe():基于ugni的分布式远程内存分配器和单向消息访问库

优化通信对于高性能计算至关重要，因为同步瓶颈会抑制并行应用程序的整体性能和可伸缩性。当今尖端的计算硬件，以及网络接口，如Cray Aries/Gemini，具有极低的延迟和高带宽远程内存访问(RMA)操作，以优化数据移动。然而，要在两个逻辑处理单元之间进行任何有效的数据移动，软件基板必须能够正确地利用底层结构的硬件资源。在远程内存区域的不规则访问期间，由于粗粒度同步和停顿造成的开销可能暗示了资源在时间和空间上利用率不足的两种不利影响。我们介绍了一个基于ugni的分布式远程内存分配器，名为“rmalloc”，它扩展了支持rdma的内存利用率，以及一个名为“rpipe”的通信基板，它试图缓解同步瓶颈。我们受unix启发的RMA编程模型使用简单，并且同样适用于高级应用程序和低级运行时系统，以实现有效的数据移动。我们的微基准测试结果表明，在大多数情况下，“rmalloc”默认下一个适合分配器的性能比MPI-3.0 RMA高1.5倍，最高可达6倍，而“rmalloc”的其他变体(即最佳适合，最差适合)减少了外部碎片，并且在不规则RMA方面的性能与默认的“rmalloc”分配器相当或更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 8th International Workshop on Runtime and Operating Systems for Supercomputers

自引率

0.00%

发文量