Runtime support for CPU-GPU high-performance computing on distributed memory platforms

Polykarpos Thomadakis, Nikos Chrisochoides
{"title":"Runtime support for CPU-GPU high-performance computing on distributed memory platforms","authors":"Polykarpos Thomadakis, Nikos Chrisochoides","doi":"10.3389/fhpcp.2024.1417040","DOIUrl":null,"url":null,"abstract":"Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures.This work introduces a runtime framework that enables effortless programming for heterogeneous systems while efficiently utilizing hardware resources. The framework is integrated within a distributed and scalable runtime system to facilitate performance portability across heterogeneous nodes. Along with the design, this paper describes the implementation and optimizations performed, achieving up to 300% improvement on a single device and linear scalability on a node equipped with four GPUs.The framework in a distributed memory environment offers portable abstractions that enable efficient inter-node communication among devices with varying capabilities. It delivers superior performance compared to MPI+CUDA by up to 20% for large messages while keeping the overheads for small messages within 10%. Furthermore, the results of our performance evaluation in a distributed Jacobi proxy application demonstrate that our software imposes minimal overhead and achieves a performance improvement of up to 40%.This is accomplished by the optimizations at the library level and by creating opportunities to leverage application-specific optimizations like over-decomposition.","PeriodicalId":399190,"journal":{"name":"Frontiers in High Performance Computing","volume":" 923","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in High Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fhpcp.2024.1417040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures.This work introduces a runtime framework that enables effortless programming for heterogeneous systems while efficiently utilizing hardware resources. The framework is integrated within a distributed and scalable runtime system to facilitate performance portability across heterogeneous nodes. Along with the design, this paper describes the implementation and optimizations performed, achieving up to 300% improvement on a single device and linear scalability on a node equipped with four GPUs.The framework in a distributed memory environment offers portable abstractions that enable efficient inter-node communication among devices with varying capabilities. It delivers superior performance compared to MPI+CUDA by up to 20% for large messages while keeping the overheads for small messages within 10%. Furthermore, the results of our performance evaluation in a distributed Jacobi proxy application demonstrate that our software imposes minimal overhead and achieves a performance improvement of up to 40%.This is accomplished by the optimizations at the library level and by creating opportunities to leverage application-specific optimizations like over-decomposition.
为分布式内存平台上的 CPU-GPU 高性能计算提供运行时支持
硬件异构化是高性能计算的发展趋势。目前,大规模系统的每个计算节点都配备了多个 GPU 加速器,预计未来还会集成更多专用硬件。计算生态系统的这一转变为性能提升提供了许多机会,但同时也增加了为此类架构编程的复杂性。这项工作引入了一个运行时框架,可在高效利用硬件资源的同时为异构系统轻松编程。该框架集成在一个分布式、可扩展的运行时系统中,以促进异构节点间的性能移植。除了设计,本文还介绍了实现和优化过程,在单个设备上实现了高达 300% 的性能提升,在配备四个 GPU 的节点上实现了线性可扩展性。与MPI+CUDA相比,该框架在处理大型信息时性能提高了20%,而处理小型信息时的开销则保持在10%以内。此外,我们在一个分布式雅可比代理应用中进行的性能评估结果表明,我们的软件开销极小,性能却提高了 40%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信