Enabling Latency-Sensitive DNN Inference via Joint Optimization of Model Surgery and Resource Allocation in Heterogeneous Edge

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3545008.3545071

Zhaowu Huang, Fang Dong, Dian Shen, Huitian Wang, Xiaolin Guo, Shucun Fu

{"title":"Enabling Latency-Sensitive DNN Inference via Joint Optimization of Model Surgery and Resource Allocation in Heterogeneous Edge","authors":"Zhaowu Huang, Fang Dong, Dian Shen, Huitian Wang, Xiaolin Guo, Shucun Fu","doi":"10.1145/3545008.3545071","DOIUrl":null,"url":null,"abstract":"Nowadays, edge computing is widely adopted to resolve the emerging deep neural networks (DNNs)-driven intelligence scenarios with the requirement of low-latency and high-accuracy, which includes heterogeneous end devices and DNNs. In such scenarios, the influx of data and computation into a shared edge server incurs prohibitive latency. Thus, we exploit the advantage of Multi-exit DNNs (ME-DNNs) that tasks can exit early at appropriate depths to save inference time. However, naively using ME-DNNs in the heterogeneous edge still fails to deliver fast inference due to improper model surgery and resource allocation. In this paper, we propose an Acceleration scheme for Inference based on ME-DNNs with Adaptive model surgery and resource allocation (AIMA) to accelerate DNN inferences. We model this problem as a mixed-integer programming problem that involves jointly optimizing model surgery and resource allocation to minimize the task completion time. We first determine the optimal resource allocation policy with a given model surgery decision profile, and then the model surgery decision-making is modeled as a weighted congestion game. We prove the existence of the Nash equilibrium and propose a decentralized algorithm. Extensive experimental results show that AIMA significantly outperforms the state-of-the-art methods, achieving up to 6.01 × speedup.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Nowadays, edge computing is widely adopted to resolve the emerging deep neural networks (DNNs)-driven intelligence scenarios with the requirement of low-latency and high-accuracy, which includes heterogeneous end devices and DNNs. In such scenarios, the influx of data and computation into a shared edge server incurs prohibitive latency. Thus, we exploit the advantage of Multi-exit DNNs (ME-DNNs) that tasks can exit early at appropriate depths to save inference time. However, naively using ME-DNNs in the heterogeneous edge still fails to deliver fast inference due to improper model surgery and resource allocation. In this paper, we propose an Acceleration scheme for Inference based on ME-DNNs with Adaptive model surgery and resource allocation (AIMA) to accelerate DNN inferences. We model this problem as a mixed-integer programming problem that involves jointly optimizing model surgery and resource allocation to minimize the task completion time. We first determine the optimal resource allocation policy with a given model surgery decision profile, and then the model surgery decision-making is modeled as a weighted congestion game. We prove the existence of the Nash equilibrium and propose a decentralized algorithm. Extensive experimental results show that AIMA significantly outperforms the state-of-the-art methods, achieving up to 6.01 × speedup.

查看原文本刊更多论文

基于模型手术和异构边缘资源分配联合优化的延迟敏感DNN推理

目前，边缘计算被广泛应用于解决由深度神经网络(deep neural networks, dnn)驱动的、要求低延迟、高精度的新兴智能场景，其中包括异构终端设备和dnn。在这种情况下，数据和计算涌入共享边缘服务器会导致令人望而却步的延迟。因此，我们利用多出口深度神经网络(me - dnn)的优势，即任务可以在适当的深度提前退出，以节省推理时间。然而，由于模型手术和资源分配不当，在异构边缘天真地使用me - dnn仍然无法提供快速推理。本文提出了一种基于自适应模型手术和资源分配(AIMA)的me -DNN推理加速方案，以加速DNN推理。我们将此问题建模为一个混合整数规划问题，该问题涉及联合优化模型手术和资源分配以最小化任务完成时间。首先利用给定的模型手术决策轮廓确定最优的资源分配策略，然后将模型手术决策建模为加权拥塞博弈。证明了纳什均衡的存在性，并提出了一种去中心化算法。大量的实验结果表明，AIMA显著优于最先进的方法，达到6.01倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量