Demystifying TensorRT: Characterizing Neural Network Inference Engine on Nvidia Edge Devices

Omais Shafi, Chinmay Rai, Rijurekha Sen, Gayathri Ananthanarayanan
{"title":"Demystifying TensorRT: Characterizing Neural Network Inference Engine on Nvidia Edge Devices","authors":"Omais Shafi, Chinmay Rai, Rijurekha Sen, Gayathri Ananthanarayanan","doi":"10.1109/IISWC53511.2021.00030","DOIUrl":null,"url":null,"abstract":"Edge devices are seeing tremendous growth in sensing and computational capabilities. Running state-of-the-art deep neural network (NN) based data processing on multi-core CPU processors, embedded Graphics Processing Units (GPU), Tensor Processing Units (TPU), Neural Processing Units (NPU), Deep Learning Accelerators (DLA) etc., edge devices are now able to handle heavy data computations with limited or without cloud connectivity. In addition to hardware resources, software frameworks that optimize a trained neural network (NN) model through weight clustering and pruning, weight and input-output quantization to fewer bits, fusing NN layers etc., for more efficient execution of NN inferences on edge platforms, play an important role in making machine learning at the edge (namely EdgeML) a reality. This paper is a first effort in characterizing these software frameworks for DNN inference optimizations on edge devices, especially edge GPUs which are now ubiquitously used in all embedded deep learning systems. The interactions between software optimizations and the underlying GPU hardware is carefully examined. As most NN optimization engines are proprietary softwares with undocumented internal details in the public domain, our empirical analysis on real embedded GPU platforms using a variety of widely used DNNs, provide various interesting findings. We observe tremendous performance gain and non-negligible accuracy gain from the software optimizations, but also find highly unexpected non-deterministic behaviors such as different outputs on same inputs or increased execution latency for same NN model on more powerful hardware platforms. Application developers using these proprietary software optimization engines, would benefit from our analysis and the discussed implications of our findings, with examples from real applications like intelligent traffic intersection control and Advanced Driving Assistance Systems (ADAS). There are important implications of our findings on performance modeling and prediction research too, that focus on micro-architecture modeling based application performance prediction, but should now additionally consider optimization engines that this paper examines.","PeriodicalId":203713,"journal":{"name":"2021 IEEE International Symposium on Workload Characterization (IISWC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC53511.2021.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18

Abstract

Edge devices are seeing tremendous growth in sensing and computational capabilities. Running state-of-the-art deep neural network (NN) based data processing on multi-core CPU processors, embedded Graphics Processing Units (GPU), Tensor Processing Units (TPU), Neural Processing Units (NPU), Deep Learning Accelerators (DLA) etc., edge devices are now able to handle heavy data computations with limited or without cloud connectivity. In addition to hardware resources, software frameworks that optimize a trained neural network (NN) model through weight clustering and pruning, weight and input-output quantization to fewer bits, fusing NN layers etc., for more efficient execution of NN inferences on edge platforms, play an important role in making machine learning at the edge (namely EdgeML) a reality. This paper is a first effort in characterizing these software frameworks for DNN inference optimizations on edge devices, especially edge GPUs which are now ubiquitously used in all embedded deep learning systems. The interactions between software optimizations and the underlying GPU hardware is carefully examined. As most NN optimization engines are proprietary softwares with undocumented internal details in the public domain, our empirical analysis on real embedded GPU platforms using a variety of widely used DNNs, provide various interesting findings. We observe tremendous performance gain and non-negligible accuracy gain from the software optimizations, but also find highly unexpected non-deterministic behaviors such as different outputs on same inputs or increased execution latency for same NN model on more powerful hardware platforms. Application developers using these proprietary software optimization engines, would benefit from our analysis and the discussed implications of our findings, with examples from real applications like intelligent traffic intersection control and Advanced Driving Assistance Systems (ADAS). There are important implications of our findings on performance modeling and prediction research too, that focus on micro-architecture modeling based application performance prediction, but should now additionally consider optimization engines that this paper examines.
揭秘TensorRT:表征Nvidia Edge设备上的神经网络推理引擎
边缘设备在传感和计算能力方面有了巨大的增长。除了硬件资源之外,通过权重聚类和修剪、权重和输入输出量化到更少的比特、融合神经网络层等方式优化训练好的神经网络(NN)模型的软件框架,在边缘平台上更有效地执行神经网络推断,在使边缘(即EdgeML)机器学习成为现实方面发挥了重要作用。本文是在边缘设备上描述这些用于DNN推理优化的软件框架的第一次努力,特别是在所有嵌入式深度学习系统中普遍使用的边缘gpu。仔细检查软件优化和底层GPU硬件之间的相互作用。由于大多数神经网络优化引擎都是专有软件,内部细节在公共领域没有记录,因此我们对使用各种广泛使用的dnn的真实嵌入式GPU平台进行实证分析,提供了各种有趣的发现。我们从软件优化中观察到巨大的性能增益和不可忽略的精度增益,但也发现了高度意想不到的非确定性行为,例如相同输入上的不同输出或在更强大的硬件平台上相同NN模型的执行延迟增加。使用这些专有软件优化引擎的应用程序开发人员将受益于我们的分析和我们的研究结果所讨论的含义,以及智能交通路口控制和高级驾驶辅助系统(ADAS)等实际应用的示例。我们的研究结果对性能建模和预测研究也有重要的影响,这些研究关注基于应用程序性能预测的微架构建模,但现在应该额外考虑本文所研究的优化引擎。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信