Implementation Techniques for SPMD Kernels on CPUs

Joachim Meyer, Aksel Alpay, Sebastian Hack, H. Fröning, Vincent Heuveline
{"title":"Implementation Techniques for SPMD Kernels on CPUs","authors":"Joachim Meyer, Aksel Alpay, Sebastian Hack, H. Fröning, Vincent Heuveline","doi":"10.1145/3585341.3585342","DOIUrl":null,"url":null,"abstract":"More and more frameworks and simulations are developed using heterogeneous programming models such as OpenCL, SYCL, CUDA, or HIP. A significant hurdle to mapping these models to CPUs in a performance-portable manner is that implementing work-group barriers for such kernels requires providing forward-progress guarantees so that all work-items can reach the barrier. This work provides guidance for implementations of single-program multiple-data (SPMD) programming models, such as OpenCL, SYCL, CUDA, or HIP, on non-SPMD devices, such as CPUs. We discuss the trade-offs of multiple approaches to handling work-group-level barriers. We present our experience with the integration of two known compiler-based approaches for low-overhead work-group synchronization on CPUs. Thereby we discuss a general design flaw in deep loop fission approaches, as used in the popular Portable Computing Language (PoCL) project, that makes them miscompile certain kernels. For our evaluation, we integrate PoCL’s “loopvec” kernel compiler into hipSYCL and implement continuation-based synchronization (CBS) in the same. We compare both against hipSYCL’s library-only fiber implementation using diverse hardware: we use recent AMD Rome and Intel Icelake server CPUs but also two Arm server CPUs, namely Fujitsu’s A64FX and Marvell’s ThunderX2. We show that compiler-based approaches outperform library-only implementations by up to multiple orders of magnitude. Further, we adapt our CBS implementation into PoCL and compare it against its loopvec approach in both, PoCL and hipSYCL. We find that our implementation of CBS, while being more general than PoCL’s approach, gives comparable performance in PoCL and even surpasses it in hipSYCL. Therefore we recommend its use in general.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3585341.3585342","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

More and more frameworks and simulations are developed using heterogeneous programming models such as OpenCL, SYCL, CUDA, or HIP. A significant hurdle to mapping these models to CPUs in a performance-portable manner is that implementing work-group barriers for such kernels requires providing forward-progress guarantees so that all work-items can reach the barrier. This work provides guidance for implementations of single-program multiple-data (SPMD) programming models, such as OpenCL, SYCL, CUDA, or HIP, on non-SPMD devices, such as CPUs. We discuss the trade-offs of multiple approaches to handling work-group-level barriers. We present our experience with the integration of two known compiler-based approaches for low-overhead work-group synchronization on CPUs. Thereby we discuss a general design flaw in deep loop fission approaches, as used in the popular Portable Computing Language (PoCL) project, that makes them miscompile certain kernels. For our evaluation, we integrate PoCL’s “loopvec” kernel compiler into hipSYCL and implement continuation-based synchronization (CBS) in the same. We compare both against hipSYCL’s library-only fiber implementation using diverse hardware: we use recent AMD Rome and Intel Icelake server CPUs but also two Arm server CPUs, namely Fujitsu’s A64FX and Marvell’s ThunderX2. We show that compiler-based approaches outperform library-only implementations by up to multiple orders of magnitude. Further, we adapt our CBS implementation into PoCL and compare it against its loopvec approach in both, PoCL and hipSYCL. We find that our implementation of CBS, while being more general than PoCL’s approach, gives comparable performance in PoCL and even surpasses it in hipSYCL. Therefore we recommend its use in general.
SPMD内核在cpu上的实现技术
越来越多的框架和仿真是使用异构编程模型(如OpenCL、SYCL、CUDA或HIP)开发的。以性能可移植的方式将这些模型映射到cpu的一个重要障碍是,为这样的内核实现工作组屏障需要提供前进的保证,以便所有工作项都能到达屏障。这项工作为在非SPMD设备(如cpu)上实现单程序多数据(SPMD)编程模型(如OpenCL、SYCL、CUDA或HIP)提供了指导。我们讨论了处理工作组级别障碍的多种方法的权衡。我们介绍了在cpu上集成两种已知的基于编译器的低开销工作组同步方法的经验。因此,我们讨论了流行的便携式计算语言(PoCL)项目中使用的深回路裂变方法的一般设计缺陷,该缺陷使它们错误地编译某些内核。为了进行评估,我们将PoCL的“loopvec”内核编译器集成到hipSYCL中,并在其中实现基于连续的同步(CBS)。我们将两者与使用不同硬件的hipSYCL的仅库光纤实现进行比较:我们使用最近的AMD Rome和Intel Icelake服务器cpu,但也使用两个Arm服务器cpu,即富士通的A64FX和Marvell的ThunderX2。我们表明,基于编译器的方法比仅库实现的性能高出多个数量级。此外,我们将CBS实现调整到PoCL中,并将其与PoCL和hipSYCL中的loopvec方法进行比较。我们发现我们的CBS实现虽然比PoCL的方法更通用,但在PoCL中提供了相当的性能,甚至在hipSYCL中超过了它。因此我们建议在一般情况下使用它。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信