rmalloc() and rpipe(): a uGNI-based Distributed Remote Memory Allocator and Access Library for One-sided Messaging

U. Wickramasinghe, A. Lumsdaine
{"title":"rmalloc() and rpipe(): a uGNI-based Distributed Remote Memory Allocator and Access Library for One-sided Messaging","authors":"U. Wickramasinghe, A. Lumsdaine","doi":"10.1145/3217189.3217191","DOIUrl":null,"url":null,"abstract":"Optimizing communication is essential for high-performance computing because synchronization bottlenecks inhibit the overall performance and scalability of parallel applications. Today's cutting-edge computing hardware, as well as networking interfaces like Cray Aries/Gemini, features extremely low latency and high bandwidth remote memory access (RMA) operations for optimized data movement. However for any efficient data movement to occur between two logical processing units, software substrates must be able to properly exploit hardware resources for the underlying fabric. Overheads due to coarse granular synchronization and stalls during irregular access of remote memory regions may hint at two adverse effects of resource under-utilization in time and space. We introduce a uGNI-based distributed remote memory allocator called \"rmalloc\" which expands RDMA-enabled memory utilization, and a communication substrate called \"rpipe\" that tries to mitigate synchronization bottlenecks. Our UNIX-inspired RMA programming model is simple to use and equally applicable to both higher-level applications as well as lower-level runtime systems for enabling efficient data movement. Our micro-benchmark results suggest that \"rmalloc\" default next-fit allocator outperforms MPI-3.0 RMA by 1.5X and up to 6X in most cases, while other variants of \"rmalloc\" (i.e. best-fit, worst-fit) reduce external fragmentation and perform comparably or better than the default \"rmalloc\" allocator for irregular RMA.","PeriodicalId":183802,"journal":{"name":"Proceedings of the 8th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th International Workshop on Runtime and Operating Systems for Supercomputers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3217189.3217191","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Optimizing communication is essential for high-performance computing because synchronization bottlenecks inhibit the overall performance and scalability of parallel applications. Today's cutting-edge computing hardware, as well as networking interfaces like Cray Aries/Gemini, features extremely low latency and high bandwidth remote memory access (RMA) operations for optimized data movement. However for any efficient data movement to occur between two logical processing units, software substrates must be able to properly exploit hardware resources for the underlying fabric. Overheads due to coarse granular synchronization and stalls during irregular access of remote memory regions may hint at two adverse effects of resource under-utilization in time and space. We introduce a uGNI-based distributed remote memory allocator called "rmalloc" which expands RDMA-enabled memory utilization, and a communication substrate called "rpipe" that tries to mitigate synchronization bottlenecks. Our UNIX-inspired RMA programming model is simple to use and equally applicable to both higher-level applications as well as lower-level runtime systems for enabling efficient data movement. Our micro-benchmark results suggest that "rmalloc" default next-fit allocator outperforms MPI-3.0 RMA by 1.5X and up to 6X in most cases, while other variants of "rmalloc" (i.e. best-fit, worst-fit) reduce external fragmentation and perform comparably or better than the default "rmalloc" allocator for irregular RMA.
rmalloc()和rpipe():基于ugni的分布式远程内存分配器和单向消息访问库
优化通信对于高性能计算至关重要,因为同步瓶颈会抑制并行应用程序的整体性能和可伸缩性。当今尖端的计算硬件,以及网络接口,如Cray Aries/Gemini,具有极低的延迟和高带宽远程内存访问(RMA)操作,以优化数据移动。然而,要在两个逻辑处理单元之间进行任何有效的数据移动,软件基板必须能够正确地利用底层结构的硬件资源。在远程内存区域的不规则访问期间,由于粗粒度同步和停顿造成的开销可能暗示了资源在时间和空间上利用率不足的两种不利影响。我们介绍了一个基于ugni的分布式远程内存分配器,名为“rmalloc”,它扩展了支持rdma的内存利用率,以及一个名为“rpipe”的通信基板,它试图缓解同步瓶颈。我们受unix启发的RMA编程模型使用简单,并且同样适用于高级应用程序和低级运行时系统,以实现有效的数据移动。我们的微基准测试结果表明,在大多数情况下,“rmalloc”默认下一个适合分配器的性能比MPI-3.0 RMA高1.5倍,最高可达6倍,而“rmalloc”的其他变体(即最佳适合,最差适合)减少了外部碎片,并且在不规则RMA方面的性能与默认的“rmalloc”分配器相当或更好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信