Particle track reconstruction on heterogeneous platforms with SYCL

Bartosz Sobol, G. Korcyl
{"title":"Particle track reconstruction on heterogeneous platforms with SYCL","authors":"Bartosz Sobol, G. Korcyl","doi":"10.1145/3585341.3585344","DOIUrl":null,"url":null,"abstract":"With the SYCL programming model comes the promise of relatively easy parallel and accelerated code development as well as out-of-the-box portability between various hardware platforms from different vendors. One of the areas which can highly benefit from this kind of characteristics of the programming model is particle physics experiments, where large amounts of data need to be processed on multiple stages by a wide variety of algorithms of different profiles. Such a data processing pipeline is often required to consume streaming data from the detectors in an online manner. Modern hardware platforms, accelerators, and their increasing performance are an opportunity for collaborations to collect and analyze more data, more effectively and with better accuracy. On the other hand, building a complex software stack by teams with a limited number of developers becomes more and more challenging in a multi-vendor landscape and with new programming models and APIs emerging. As the physics experiments are designed and computing solutions evaluated many years ahead of the actual run, there is also a need for the codebase of this kind of scientific software to be future-proof, e.g., being able to run on a next-generation computing cluster that uses GPU accelerators from different vendors or entirely different platforms like upcoming powerful APU devices. In this project, we begin with a simple single-threaded implementation of particle track reconstruction algorithm proposed for one of the subdetectors in the PANDA experiment being under development as a part of the FAIR Facility at GSI, Darmstadt, Garmany. We start with a task to port the algorithm to SYCL with minimal effort, I.e., trying to keep the kernel code as close to the original implementation as possible, while attempting to maintain good parallelization and competitive performance in an accelerated environment. After many iterations, experimentation with different memory layouts as well as various approaches to express parallelism and data flow to tame the memory-bounded characteristics of the algorithm, we came up with a final version, that’s still similar in terms of code structure to the original implementation and can achieve satisfying performance across all kinds of different targets. This ultimate implementation, comprising 7 kernels and multiple auxiliary accelerated functions, was evaluated using major SYCL implementations: hipSYCL and DPC++. Benchmarks were conducted on a wide variety of platforms from leading vendors including NVIDIA V100, NVIDIA A100, and AMD MI250 GPUs as well as AMD EPYC Rome and Intel Cascade Lake CPUs, and finally AMD/Xilinx Alveo U280 FPGA accelerator card. For the latter, an experimental AMD/Xilinx compiler based on Intel’s LLVM version was used. We also compare the performance with CUDA implementation built in the same manner as the final SYCL one, showing that it can achieve performance comparable to the native version. We show that developing performant and portable code with truly single source code for CPU and GPU is possible and accessible for developers with an intermediate understanding of parallelization and how to effectively interact with GPU-based accelerators. Finally, for more exotic types of devices, like FPGA-based accelerators, some host code modifications are required to successfully compile and execute the software on such platforms. While not competitive in terms of performance, we believe that the ability to run this kind of algorithm on FPGA without significant adjustments is an achievement in itself.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3585341.3585344","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

With the SYCL programming model comes the promise of relatively easy parallel and accelerated code development as well as out-of-the-box portability between various hardware platforms from different vendors. One of the areas which can highly benefit from this kind of characteristics of the programming model is particle physics experiments, where large amounts of data need to be processed on multiple stages by a wide variety of algorithms of different profiles. Such a data processing pipeline is often required to consume streaming data from the detectors in an online manner. Modern hardware platforms, accelerators, and their increasing performance are an opportunity for collaborations to collect and analyze more data, more effectively and with better accuracy. On the other hand, building a complex software stack by teams with a limited number of developers becomes more and more challenging in a multi-vendor landscape and with new programming models and APIs emerging. As the physics experiments are designed and computing solutions evaluated many years ahead of the actual run, there is also a need for the codebase of this kind of scientific software to be future-proof, e.g., being able to run on a next-generation computing cluster that uses GPU accelerators from different vendors or entirely different platforms like upcoming powerful APU devices. In this project, we begin with a simple single-threaded implementation of particle track reconstruction algorithm proposed for one of the subdetectors in the PANDA experiment being under development as a part of the FAIR Facility at GSI, Darmstadt, Garmany. We start with a task to port the algorithm to SYCL with minimal effort, I.e., trying to keep the kernel code as close to the original implementation as possible, while attempting to maintain good parallelization and competitive performance in an accelerated environment. After many iterations, experimentation with different memory layouts as well as various approaches to express parallelism and data flow to tame the memory-bounded characteristics of the algorithm, we came up with a final version, that’s still similar in terms of code structure to the original implementation and can achieve satisfying performance across all kinds of different targets. This ultimate implementation, comprising 7 kernels and multiple auxiliary accelerated functions, was evaluated using major SYCL implementations: hipSYCL and DPC++. Benchmarks were conducted on a wide variety of platforms from leading vendors including NVIDIA V100, NVIDIA A100, and AMD MI250 GPUs as well as AMD EPYC Rome and Intel Cascade Lake CPUs, and finally AMD/Xilinx Alveo U280 FPGA accelerator card. For the latter, an experimental AMD/Xilinx compiler based on Intel’s LLVM version was used. We also compare the performance with CUDA implementation built in the same manner as the final SYCL one, showing that it can achieve performance comparable to the native version. We show that developing performant and portable code with truly single source code for CPU and GPU is possible and accessible for developers with an intermediate understanding of parallelization and how to effectively interact with GPU-based accelerators. Finally, for more exotic types of devices, like FPGA-based accelerators, some host code modifications are required to successfully compile and execute the software on such platforms. While not competitive in terms of performance, we believe that the ability to run this kind of algorithm on FPGA without significant adjustments is an achievement in itself.
基于SYCL的异构平台粒子轨迹重建
SYCL编程模型带来了相对容易的并行和加速代码开发,以及不同供应商的各种硬件平台之间的开箱即用的可移植性。在粒子物理实验中,大量的数据需要在多个阶段通过各种不同类型的算法进行处理,这是可以从编程模型的这种特性中高度受益的领域之一。通常需要这样的数据处理管道以在线方式使用来自检测器的流数据。现代硬件平台、加速器及其不断提高的性能为协作提供了一个机会,可以更有效、更准确地收集和分析更多数据。另一方面,在多供应商环境下,随着新的编程模型和api的出现,由有限数量的开发人员组成的团队构建复杂的软件堆栈变得越来越具有挑战性。由于物理实验的设计和计算解决方案的评估要比实际运行提前很多年,因此这类科学软件的代码库也需要面向未来,例如,能够在下一代计算集群上运行,这些集群使用来自不同供应商的GPU加速器或完全不同的平台,如即将推出的强大的APU设备。在这个项目中,我们从一个简单的单线程实现粒子轨迹重建算法开始,该算法是为PANDA实验中的一个子探测器提出的,该实验是德国达姆施塔特GSI FAIR设施的一部分,正在开发中。我们从一个任务开始,以最小的努力将算法移植到SYCL,即,尝试使内核代码尽可能接近原始实现,同时尝试在加速环境中保持良好的并行化和竞争性性能。经过多次迭代,尝试不同的内存布局以及各种表达并行性和数据流的方法来驯服算法的内存限制特征,我们提出了最终版本,在代码结构方面仍然与原始实现相似,并且可以在各种不同的目标上获得令人满意的性能。这个最终实现包括7个内核和多个辅助加速函数,使用主要的SYCL实现进行了评估:hipSYCL和dpc++。基准测试在各种主要供应商的平台上进行,包括NVIDIA V100, NVIDIA A100, AMD MI250 gpu以及AMD EPYC Rome和Intel Cascade Lake cpu,最后是AMD/Xilinx Alveo U280 FPGA加速卡。对于后者,使用了基于Intel LLVM版本的实验性AMD/Xilinx编译器。我们还将性能与最终SYCL相同方式构建的CUDA实现进行了比较,表明它可以实现与本机版本相当的性能。我们展示了用真正的单一源代码为CPU和GPU开发高性能和可移植的代码是可能的,并且对于具有并行化和如何有效地与基于GPU的加速器交互的中级理解的开发人员来说是可以访问的。最后,对于更奇特的设备类型,如基于fpga的加速器,需要对主机代码进行一些修改才能在这样的平台上成功编译和执行软件。虽然在性能方面没有竞争力,但我们相信能够在FPGA上运行这种算法而无需进行重大调整本身就是一项成就。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信