嵌入式人工智能应用中卷积神经网络推理的硬件感知划分

2022 18th International Conference on Distributed Computing in Sensor Systems (DCOSS) Pub Date : 2022-05-01 DOI:10.1109/DCOSS54816.2022.00034

Fabian Kreß, Julian Höfer, Tim Hotfilter, Iris Walter, V. Sidorenko, T. Harbaum, J. Becker

{"title":"嵌入式人工智能应用中卷积神经网络推理的硬件感知划分","authors":"Fabian Kreß, Julian Höfer, Tim Hotfilter, Iris Walter, V. Sidorenko, T. Harbaum, J. Becker","doi":"10.1109/DCOSS54816.2022.00034","DOIUrl":null,"url":null,"abstract":"Embedded image processing applications like multicamera-based object detection or semantic segmentation are often based on Convolutional Neural Networks (CNNs) to provide precise and reliable results. The deployment of CNNs in embedded systems, however, imposes additional constraints such as latency restrictions and limited energy consumption in the sensor platform. These requirements have to be considered during hardware/software co-design of embedded Artifical Intelligence (AI) applications. In addition, the transmission of uncompressed image data from the sensor to a central edge node requires large bandwidth on the link, which must also be taken into account during the design phase.Therefore, we present a simulation toolchain for fast evaluation of hardware-aware CNN partitioning for embedded AI applications. This approach explores an efficient workload distribution between sensor nodes and a central edge node. Neither processing all layers close to the sensor nor transmitting all uncompressed raw data to the edge node is an optimal solution for each use case. Hence, our proposed simulation toolchain evaluates power and performance metrics for each reasonable partitioning point in a CNN. In contrast to the state of the art, our approach does not only consider the neural network architecture. In the evaluation, our simulation toolchain additionally takes into account hardware components such as special accelerators and memories that are implemented in the sensor node.Exemplary, we show the simulation results for three commonly used CNNs in embedded systems. Thereby, we identify advantageous partitioning points regarding inference latency and energy consumption. With the support of the toolchain, we are able to identify three beneficial partitioning points for FCN ResNet-50 and two for GoogLeNet as well as for SqueezeNet V1.1.","PeriodicalId":300416,"journal":{"name":"2022 18th International Conference on Distributed Computing in Sensor Systems (DCOSS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Hardware-aware Partitioning of Convolutional Neural Network Inference for Embedded AI Applications\",\"authors\":\"Fabian Kreß, Julian Höfer, Tim Hotfilter, Iris Walter, V. Sidorenko, T. Harbaum, J. Becker\",\"doi\":\"10.1109/DCOSS54816.2022.00034\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Embedded image processing applications like multicamera-based object detection or semantic segmentation are often based on Convolutional Neural Networks (CNNs) to provide precise and reliable results. The deployment of CNNs in embedded systems, however, imposes additional constraints such as latency restrictions and limited energy consumption in the sensor platform. These requirements have to be considered during hardware/software co-design of embedded Artifical Intelligence (AI) applications. In addition, the transmission of uncompressed image data from the sensor to a central edge node requires large bandwidth on the link, which must also be taken into account during the design phase.Therefore, we present a simulation toolchain for fast evaluation of hardware-aware CNN partitioning for embedded AI applications. This approach explores an efficient workload distribution between sensor nodes and a central edge node. Neither processing all layers close to the sensor nor transmitting all uncompressed raw data to the edge node is an optimal solution for each use case. Hence, our proposed simulation toolchain evaluates power and performance metrics for each reasonable partitioning point in a CNN. In contrast to the state of the art, our approach does not only consider the neural network architecture. In the evaluation, our simulation toolchain additionally takes into account hardware components such as special accelerators and memories that are implemented in the sensor node.Exemplary, we show the simulation results for three commonly used CNNs in embedded systems. Thereby, we identify advantageous partitioning points regarding inference latency and energy consumption. With the support of the toolchain, we are able to identify three beneficial partitioning points for FCN ResNet-50 and two for GoogLeNet as well as for SqueezeNet V1.1.\",\"PeriodicalId\":300416,\"journal\":{\"name\":\"2022 18th International Conference on Distributed Computing in Sensor Systems (DCOSS)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 18th International Conference on Distributed Computing in Sensor Systems (DCOSS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCOSS54816.2022.00034\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 18th International Conference on Distributed Computing in Sensor Systems (DCOSS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCOSS54816.2022.00034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

嵌入式图像处理应用，如基于多摄像机的目标检测或语义分割，通常基于卷积神经网络(cnn)来提供精确可靠的结果。然而，在嵌入式系统中部署cnn会施加额外的约束，例如传感器平台的延迟限制和有限的能耗。在嵌入式人工智能(AI)应用的硬件/软件协同设计期间，必须考虑这些需求。此外，将未压缩的图像数据从传感器传输到中心边缘节点需要很大的链路带宽，在设计阶段也必须考虑到这一点。因此，我们提出了一个仿真工具链，用于快速评估嵌入式人工智能应用的硬件感知CNN划分。该方法在传感器节点和中心边缘节点之间探索有效的工作负载分配。无论是处理靠近传感器的所有层，还是将所有未压缩的原始数据传输到边缘节点，都不是每个用例的最佳解决方案。因此，我们提出的仿真工具链评估了CNN中每个合理分区点的功率和性能指标。与目前的技术状况相反，我们的方法不仅考虑神经网络架构。在评估中，我们的仿真工具链还考虑了硬件组件，如在传感器节点中实现的特殊加速器和存储器。作为示例，我们展示了嵌入式系统中三种常用cnn的仿真结果。因此，我们确定了关于推理延迟和能量消耗的有利分区点。在工具链的支持下，我们能够为FCN ResNet-50确定三个有益的分区点，为GoogLeNet和SqueezeNet V1.1确定两个有益的分区点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Hardware-aware Partitioning of Convolutional Neural Network Inference for Embedded AI Applications

Embedded image processing applications like multicamera-based object detection or semantic segmentation are often based on Convolutional Neural Networks (CNNs) to provide precise and reliable results. The deployment of CNNs in embedded systems, however, imposes additional constraints such as latency restrictions and limited energy consumption in the sensor platform. These requirements have to be considered during hardware/software co-design of embedded Artifical Intelligence (AI) applications. In addition, the transmission of uncompressed image data from the sensor to a central edge node requires large bandwidth on the link, which must also be taken into account during the design phase.Therefore, we present a simulation toolchain for fast evaluation of hardware-aware CNN partitioning for embedded AI applications. This approach explores an efficient workload distribution between sensor nodes and a central edge node. Neither processing all layers close to the sensor nor transmitting all uncompressed raw data to the edge node is an optimal solution for each use case. Hence, our proposed simulation toolchain evaluates power and performance metrics for each reasonable partitioning point in a CNN. In contrast to the state of the art, our approach does not only consider the neural network architecture. In the evaluation, our simulation toolchain additionally takes into account hardware components such as special accelerators and memories that are implemented in the sensor node.Exemplary, we show the simulation results for three commonly used CNNs in embedded systems. Thereby, we identify advantageous partitioning points regarding inference latency and energy consumption. With the support of the toolchain, we are able to identify three beneficial partitioning points for FCN ResNet-50 and two for GoogLeNet as well as for SqueezeNet V1.1.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 18th International Conference on Distributed Computing in Sensor Systems (DCOSS)

自引率

0.00%

发文量