Architectural Analysis of Deep Learning on Edge Accelerators

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI:10.1109/HPEC43674.2020.9286209

Luke Kljucaric, A. Johnson, A. George

{"title":"Architectural Analysis of Deep Learning on Edge Accelerators","authors":"Luke Kljucaric, A. Johnson, A. George","doi":"10.1109/HPEC43674.2020.9286209","DOIUrl":null,"url":null,"abstract":"As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine-learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many devices feature hardware optimized for data types other than 32-bit floating-point numbers, the standard representation defined by MLPerf. Edge-computing devices often feature app-specific hardware to offload common operations found in ML apps from the constrained CPU. This research analyzes multiple low-power compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency for optical character recognition. Considering these models are custom and not the most widely used, many architectures are not specifically optimized for them. The performance of these models can stress devices in different, yet insightful, ways that generalizations of the performance of other models can be drawn from. The NVIDIA Jetson AGX Xavier (AGX), Intel Neural Compute Stick 2 (NCS2), and Google Edge TPU architectures are analyzed with respect to their performance. The design of the AGX and TPU devices showcased the lowest streaming latency for AlexNet and GoogLeNet, respectively. Additionally, the tightly-integrated N CS2 design showed the best generalizability in performance and efficiency across neural networks.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC43674.2020.9286209","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine-learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many devices feature hardware optimized for data types other than 32-bit floating-point numbers, the standard representation defined by MLPerf. Edge-computing devices often feature app-specific hardware to offload common operations found in ML apps from the constrained CPU. This research analyzes multiple low-power compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency for optical character recognition. Considering these models are custom and not the most widely used, many architectures are not specifically optimized for them. The performance of these models can stress devices in different, yet insightful, ways that generalizations of the performance of other models can be drawn from. The NVIDIA Jetson AGX Xavier (AGX), Intel Neural Compute Stick 2 (NCS2), and Google Edge TPU architectures are analyzed with respect to their performance. The design of the AGX and TPU devices showcased the lowest streaming latency for AlexNet and GoogLeNet, respectively. Additionally, the tightly-integrated N CS2 design showed the best generalizability in performance and efficiency across neural networks.

查看原文本刊更多论文

边缘加速器上深度学习的体系结构分析

随着计算机架构不断集成特定于应用程序的硬件，了解设备的相对性能以获得最大的应用程序加速是至关重要的。基准测试套件(如用于分析机器学习(ML)硬件性能的MLPerf)的目标是标准化不同硬件架构的公平比较。然而，有许多应用程序并没有很好地用这些标准来表示，它们需要不同的工作负载(如ML模型和数据集)来实现类似的目标。此外，许多设备都具有针对32位浮点数(MLPerf定义的标准表示)以外的数据类型进行优化的硬件。边缘计算设备通常具有特定于应用程序的硬件，可以从受限的CPU中卸载ML应用程序中的常见操作。本研究以手写体汉字识别为例，分析了具有ml专用硬件的多种低功耗计算架构。具体来说，AlexNet和自定义版本的GoogLeNet在光学字符识别的流延迟方面进行了基准测试。考虑到这些模型是自定义的，并不是最广泛使用的，许多体系结构并没有专门针对它们进行优化。这些模型的性能可以以不同但有见地的方式强调设备，可以从中得出其他模型性能的概括。分析了NVIDIA Jetson AGX Xavier (AGX)、Intel Neural Compute Stick 2 (NCS2)和Google Edge TPU架构的性能。AGX和TPU设备的设计分别为AlexNet和GoogLeNet提供了最低的流延迟。此外，紧密集成的ncs2设计在跨神经网络的性能和效率方面表现出最佳的通用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量