将SYCL加速的神经网络框架移植到边缘设备

Dylan Angus, S. Georgiev, Hector Arroyo Gonzalez, J. Riordan, P. Keir, M. Goli
{"title":"将SYCL加速的神经网络框架移植到边缘设备","authors":"Dylan Angus, S. Georgiev, Hector Arroyo Gonzalez, J. Riordan, P. Keir, M. Goli","doi":"10.1145/3585341.3585346","DOIUrl":null,"url":null,"abstract":"Portable hardware acceleration has become increasingly necessary with the rise of the popularity of edge computing. Edge computing, referring to the distributed computing paradigm that encourages data to be processed and stored as close to the source of origination as possible, is needed in areas where bandwidth and latency are restricted and network stability, privacy, or security are unreliable or insecure. Examples of such situations are autonomous mobile robotics, such as autonomous tractors, which often have numerous cameras connected to the host, all needing processing in areas where there can be no reliable connection to a cloud-based platform. Additionally, bridge surveying drones, where mapping and path-planning are needed with low latency, can benefit from a lightweight, compact, low-powered device, especially when there are size and energy consumption requirements. Thus, edge devices, which work as small but compact computers, leverage onboard accelerators to tackle various Robotics, Computer Vision and AI tasks directly on the device without needing an external connection. These accelerators often take the popular form of a GPU like Nvidia’s Jetson development kit series, which are driven by the same workflows of Nvidia’s AI software and cloud-native frameworks while staying lean, compact and less energy-demanding. However, with the increasing popularity of FPGAs, in the future we could see more edge devices like AMD and Xilinx’s KR260 robotics development kit, that operate at low power. Hence, with the surge of the usefulness of edge devices and variety in the brand and type of accelerators, the need for hardware portability in edge devices expands as well. Thus, as we will show in this talk, SYCL as an open-standard, high-level parallel programming model which provides portability not only at the API level but also at the compiler level provides this hardware portability by enabling the same software to be run on both CPU, GPU and FPGA-based edge devices. Additionally, we will show how we maintain performance through device-specific kernel specialisation. The Open Neural Network Exchange (ONNX) is an open-source artificial intelligence ecosystem of technology companies and research organizations that establish open standards for representing machine learning algorithms and software tools. ONNX is available on GitHub. This presentation will explain how we used DPC++, an open source SYCL implementation, to compile the SYCL backend of the ONNX runtime, to target NVIDIA’s Jetson series architecture. DPC++ allows us to compile for the ONNX runtime SYCL backend and use the Jetson’s onboard GPU and also use ComputeAorta, Codeplay’s multi-target, multi-platform framework, as an OpenCL implementation to target the Jetson’s onboard CPU. We will show the performance we get using the ONNX runtime CPU backend and the SYCL backend targeting Jetson’s GPU and CPU. The ONNX runtime SYCL backend is implemented using the lightweight templated SYCL-BLAS and SYCL-DNN libraries that include kernels with tuning parameters such as cache size, workgroup size and local memory size based on the device-specific hardware. Once tuned for the Jetson, the SYCL backend showed comparable performance with the native CUDA backend used by ONNX. Finally, using the ONNX runtime SYCL backend and an Nvidia Jetson Xavier NX edge device, we will discuss ongoing work of aerial classification using image/radar data. Furthermore, we will discuss preliminary lab results to show how our stack affects latency and energy consumption and why it is so important in this use case. For future work, we hope to enable and tune SYCL-DNN/SYCL-BLAS for other Jetson devices as well as FPGA and RISC-V-based edge devices.","PeriodicalId":360830,"journal":{"name":"Proceedings of the 2023 International Workshop on OpenCL","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Porting SYCL accelerated neural network frameworks to edge devices\",\"authors\":\"Dylan Angus, S. Georgiev, Hector Arroyo Gonzalez, J. Riordan, P. Keir, M. Goli\",\"doi\":\"10.1145/3585341.3585346\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Portable hardware acceleration has become increasingly necessary with the rise of the popularity of edge computing. Edge computing, referring to the distributed computing paradigm that encourages data to be processed and stored as close to the source of origination as possible, is needed in areas where bandwidth and latency are restricted and network stability, privacy, or security are unreliable or insecure. Examples of such situations are autonomous mobile robotics, such as autonomous tractors, which often have numerous cameras connected to the host, all needing processing in areas where there can be no reliable connection to a cloud-based platform. Additionally, bridge surveying drones, where mapping and path-planning are needed with low latency, can benefit from a lightweight, compact, low-powered device, especially when there are size and energy consumption requirements. Thus, edge devices, which work as small but compact computers, leverage onboard accelerators to tackle various Robotics, Computer Vision and AI tasks directly on the device without needing an external connection. These accelerators often take the popular form of a GPU like Nvidia’s Jetson development kit series, which are driven by the same workflows of Nvidia’s AI software and cloud-native frameworks while staying lean, compact and less energy-demanding. However, with the increasing popularity of FPGAs, in the future we could see more edge devices like AMD and Xilinx’s KR260 robotics development kit, that operate at low power. Hence, with the surge of the usefulness of edge devices and variety in the brand and type of accelerators, the need for hardware portability in edge devices expands as well. Thus, as we will show in this talk, SYCL as an open-standard, high-level parallel programming model which provides portability not only at the API level but also at the compiler level provides this hardware portability by enabling the same software to be run on both CPU, GPU and FPGA-based edge devices. Additionally, we will show how we maintain performance through device-specific kernel specialisation. The Open Neural Network Exchange (ONNX) is an open-source artificial intelligence ecosystem of technology companies and research organizations that establish open standards for representing machine learning algorithms and software tools. ONNX is available on GitHub. This presentation will explain how we used DPC++, an open source SYCL implementation, to compile the SYCL backend of the ONNX runtime, to target NVIDIA’s Jetson series architecture. DPC++ allows us to compile for the ONNX runtime SYCL backend and use the Jetson’s onboard GPU and also use ComputeAorta, Codeplay’s multi-target, multi-platform framework, as an OpenCL implementation to target the Jetson’s onboard CPU. We will show the performance we get using the ONNX runtime CPU backend and the SYCL backend targeting Jetson’s GPU and CPU. The ONNX runtime SYCL backend is implemented using the lightweight templated SYCL-BLAS and SYCL-DNN libraries that include kernels with tuning parameters such as cache size, workgroup size and local memory size based on the device-specific hardware. Once tuned for the Jetson, the SYCL backend showed comparable performance with the native CUDA backend used by ONNX. Finally, using the ONNX runtime SYCL backend and an Nvidia Jetson Xavier NX edge device, we will discuss ongoing work of aerial classification using image/radar data. Furthermore, we will discuss preliminary lab results to show how our stack affects latency and energy consumption and why it is so important in this use case. For future work, we hope to enable and tune SYCL-DNN/SYCL-BLAS for other Jetson devices as well as FPGA and RISC-V-based edge devices.\",\"PeriodicalId\":360830,\"journal\":{\"name\":\"Proceedings of the 2023 International Workshop on OpenCL\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2023 International Workshop on OpenCL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3585341.3585346\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3585341.3585346","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

随着边缘计算的普及,便携式硬件加速变得越来越必要。边缘计算指的是分布式计算范式,它鼓励数据在尽可能接近源的地方进行处理和存储,在带宽和延迟受到限制以及网络稳定性、隐私性或安全性不可靠或不安全的领域是需要的。这种情况的例子是自主移动机器人,例如自动拖拉机,它们通常有许多摄像机连接到主机,所有这些都需要在无法与基于云的平台可靠连接的区域进行处理。此外,在需要低延迟的测绘和路径规划的桥梁测量无人机中,轻量、紧凑、低功耗的设备可以受益,特别是在有尺寸和能耗要求的情况下。因此,作为小型紧凑计算机的边缘设备,利用板载加速器直接在设备上处理各种机器人、计算机视觉和人工智能任务,而无需外部连接。这些加速器通常采用英伟达Jetson开发套件系列等GPU的流行形式,它们由英伟达人工智能软件和云原生框架的相同工作流程驱动,同时保持精简、紧凑和低能耗。然而,随着fpga的日益普及,在未来,我们可以看到更多的边缘设备,如AMD和赛灵思的KR260机器人开发套件,以低功耗运行。因此,随着边缘设备的实用性激增,以及加速器品牌和类型的多样性,边缘设备对硬件可移植性的需求也在扩大。因此,正如我们将在本次演讲中展示的那样,SYCL作为一个开放标准的高级并行编程模型,不仅在API级别提供可移植性,而且在编译器级别提供可移植性,通过使相同的软件能够在CPU, GPU和基于fpga的边缘设备上运行,从而提供这种硬件可移植性。此外,我们将展示如何通过特定于设备的内核专门化来维护性能。开放神经网络交换(ONNX)是一个由科技公司和研究机构组成的开源人工智能生态系统,旨在建立代表机器学习算法和软件工具的开放标准。ONNX在GitHub上可用。本演讲将解释我们如何使用dpc++,一个开源的SYCL实现,来编译ONNX运行时的SYCL后端,以瞄准NVIDIA的Jetson系列架构。dpc++允许我们为ONNX运行时SYCL后端进行编译,并使用Jetson的板载GPU,还可以使用ComputeAorta (Codeplay的多目标、多平台框架)作为OpenCL实现来瞄准Jetson的板载CPU。我们将展示使用ONNX运行时CPU后端和针对Jetson的GPU和CPU的SYCL后端获得的性能。ONNX运行时SYCL后端是使用轻量级模板SYCL- blas和SYCL- dnn库实现的,这些库包括带有调优参数的内核,如基于特定于设备的硬件的缓存大小、工作组大小和本地内存大小。一旦调整到Jetson, SYCL后端显示出与ONNX使用的本地CUDA后端相当的性能。最后,使用ONNX运行时SYCL后端和Nvidia Jetson Xavier NX边缘设备,我们将讨论使用图像/雷达数据进行航空分类的工作。此外,我们将讨论初步的实验结果,以展示我们的堆栈如何影响延迟和能耗,以及为什么它在这个用例中如此重要。对于未来的工作,我们希望为其他Jetson设备以及FPGA和基于risc - v的边缘设备启用和调整SYCL-DNN/SYCL-BLAS。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Porting SYCL accelerated neural network frameworks to edge devices
Portable hardware acceleration has become increasingly necessary with the rise of the popularity of edge computing. Edge computing, referring to the distributed computing paradigm that encourages data to be processed and stored as close to the source of origination as possible, is needed in areas where bandwidth and latency are restricted and network stability, privacy, or security are unreliable or insecure. Examples of such situations are autonomous mobile robotics, such as autonomous tractors, which often have numerous cameras connected to the host, all needing processing in areas where there can be no reliable connection to a cloud-based platform. Additionally, bridge surveying drones, where mapping and path-planning are needed with low latency, can benefit from a lightweight, compact, low-powered device, especially when there are size and energy consumption requirements. Thus, edge devices, which work as small but compact computers, leverage onboard accelerators to tackle various Robotics, Computer Vision and AI tasks directly on the device without needing an external connection. These accelerators often take the popular form of a GPU like Nvidia’s Jetson development kit series, which are driven by the same workflows of Nvidia’s AI software and cloud-native frameworks while staying lean, compact and less energy-demanding. However, with the increasing popularity of FPGAs, in the future we could see more edge devices like AMD and Xilinx’s KR260 robotics development kit, that operate at low power. Hence, with the surge of the usefulness of edge devices and variety in the brand and type of accelerators, the need for hardware portability in edge devices expands as well. Thus, as we will show in this talk, SYCL as an open-standard, high-level parallel programming model which provides portability not only at the API level but also at the compiler level provides this hardware portability by enabling the same software to be run on both CPU, GPU and FPGA-based edge devices. Additionally, we will show how we maintain performance through device-specific kernel specialisation. The Open Neural Network Exchange (ONNX) is an open-source artificial intelligence ecosystem of technology companies and research organizations that establish open standards for representing machine learning algorithms and software tools. ONNX is available on GitHub. This presentation will explain how we used DPC++, an open source SYCL implementation, to compile the SYCL backend of the ONNX runtime, to target NVIDIA’s Jetson series architecture. DPC++ allows us to compile for the ONNX runtime SYCL backend and use the Jetson’s onboard GPU and also use ComputeAorta, Codeplay’s multi-target, multi-platform framework, as an OpenCL implementation to target the Jetson’s onboard CPU. We will show the performance we get using the ONNX runtime CPU backend and the SYCL backend targeting Jetson’s GPU and CPU. The ONNX runtime SYCL backend is implemented using the lightweight templated SYCL-BLAS and SYCL-DNN libraries that include kernels with tuning parameters such as cache size, workgroup size and local memory size based on the device-specific hardware. Once tuned for the Jetson, the SYCL backend showed comparable performance with the native CUDA backend used by ONNX. Finally, using the ONNX runtime SYCL backend and an Nvidia Jetson Xavier NX edge device, we will discuss ongoing work of aerial classification using image/radar data. Furthermore, we will discuss preliminary lab results to show how our stack affects latency and energy consumption and why it is so important in this use case. For future work, we hope to enable and tune SYCL-DNN/SYCL-BLAS for other Jetson devices as well as FPGA and RISC-V-based edge devices.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信