为多租户边缘AI实现灵活和自适应的模型服务

Proceedings of the 8th ACM/IEEE Conference on Internet of Things Design and Implementation Pub Date : 2023-05-09 DOI:10.1145/3576842.3582375

Qianlin Liang, Walid A. Hanafy, Noman Bashir, A. Ali-Eldin, David E. Irwin, P. Shenoy

{"title":"为多租户边缘AI实现灵活和自适应的模型服务","authors":"Qianlin Liang, Walid A. Hanafy, Noman Bashir, A. Ali-Eldin, David E. Irwin, P. Shenoy","doi":"10.1145/3576842.3582375","DOIUrl":null,"url":null,"abstract":"Model-serving systems expose machine learning (ML) models to applications programmatically via a high-level API. Cloud platforms use these systems to mask the complexities of optimally managing resources and servicing inference requests across multiple applications. Model serving at the edge is now also becoming increasingly important to support inference workloads with tight latency requirements. However, edge model serving differs substantially from cloud model serving in its latency, energy, and accuracy constraints: these systems must support multiple applications with widely different latency and accuracy requirements on embedded edge accelerators with limited computational and energy resources. To address the problem, this paper presents Dělen,1 a flexible and adaptive model-serving system for multi-tenant edge AI. Dělen exposes a high-level API that enables individual edge applications to specify a bound at runtime on the latency, accuracy, or energy of their inference requests. We efficiently implement Dělen using conditional execution in multi-exit deep neural networks (DNNs), which enables granular control over inference requests, and evaluate it on a resource-constrained Jetson Nano edge accelerator. We evaluate Dělen flexibility by implementing state-of-the-art adaptation policies using Dělen’s API, and evaluate its adaptability under different workload dynamics and goals when running single and multiple applications.","PeriodicalId":266438,"journal":{"name":"Proceedings of the 8th ACM/IEEE Conference on Internet of Things Design and Implementation","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Dělen: Enabling Flexible and Adaptive Model-serving for Multi-tenant Edge AI\",\"authors\":\"Qianlin Liang, Walid A. Hanafy, Noman Bashir, A. Ali-Eldin, David E. Irwin, P. Shenoy\",\"doi\":\"10.1145/3576842.3582375\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Model-serving systems expose machine learning (ML) models to applications programmatically via a high-level API. Cloud platforms use these systems to mask the complexities of optimally managing resources and servicing inference requests across multiple applications. Model serving at the edge is now also becoming increasingly important to support inference workloads with tight latency requirements. However, edge model serving differs substantially from cloud model serving in its latency, energy, and accuracy constraints: these systems must support multiple applications with widely different latency and accuracy requirements on embedded edge accelerators with limited computational and energy resources. To address the problem, this paper presents Dělen,1 a flexible and adaptive model-serving system for multi-tenant edge AI. Dělen exposes a high-level API that enables individual edge applications to specify a bound at runtime on the latency, accuracy, or energy of their inference requests. We efficiently implement Dělen using conditional execution in multi-exit deep neural networks (DNNs), which enables granular control over inference requests, and evaluate it on a resource-constrained Jetson Nano edge accelerator. We evaluate Dělen flexibility by implementing state-of-the-art adaptation policies using Dělen’s API, and evaluate its adaptability under different workload dynamics and goals when running single and multiple applications.\",\"PeriodicalId\":266438,\"journal\":{\"name\":\"Proceedings of the 8th ACM/IEEE Conference on Internet of Things Design and Implementation\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 8th ACM/IEEE Conference on Internet of Things Design and Implementation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3576842.3582375\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM/IEEE Conference on Internet of Things Design and Implementation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3576842.3582375","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

模型服务系统通过高级API以编程方式向应用程序公开机器学习(ML)模型。云平台使用这些系统来掩盖优化管理资源和跨多个应用程序服务推理请求的复杂性。在边缘服务的模型现在对于支持具有严格延迟要求的推理工作负载也变得越来越重要。然而，边缘模型服务在延迟、能量和精度限制方面与云模型服务有很大不同:这些系统必须在计算和能源有限的嵌入式边缘加速器上支持具有广泛不同延迟和精度要求的多个应用。为了解决这个问题，本文提出了一种灵活的、自适应的多租户边缘人工智能模型服务系统。dronlen公开了一个高级API，使各个边缘应用程序能够在运行时指定其推理请求的延迟、准确性或能量的界限。我们在多出口深度神经网络(dnn)中使用条件执行有效地实现了深度神经网络，从而实现了对推理请求的粒度控制，并在资源受限的Jetson Nano边缘加速器上对其进行了评估。我们通过使用dreamlen的API实施最先进的适应策略来评估dreamlen的灵活性，并在运行单个和多个应用程序时评估其在不同工作负载动态和目标下的适应性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dělen: Enabling Flexible and Adaptive Model-serving for Multi-tenant Edge AI

Model-serving systems expose machine learning (ML) models to applications programmatically via a high-level API. Cloud platforms use these systems to mask the complexities of optimally managing resources and servicing inference requests across multiple applications. Model serving at the edge is now also becoming increasingly important to support inference workloads with tight latency requirements. However, edge model serving differs substantially from cloud model serving in its latency, energy, and accuracy constraints: these systems must support multiple applications with widely different latency and accuracy requirements on embedded edge accelerators with limited computational and energy resources. To address the problem, this paper presents Dělen,1 a flexible and adaptive model-serving system for multi-tenant edge AI. Dělen exposes a high-level API that enables individual edge applications to specify a bound at runtime on the latency, accuracy, or energy of their inference requests. We efficiently implement Dělen using conditional execution in multi-exit deep neural networks (DNNs), which enables granular control over inference requests, and evaluate it on a resource-constrained Jetson Nano edge accelerator. We evaluate Dělen flexibility by implementing state-of-the-art adaptation policies using Dělen’s API, and evaluate its adaptability under different workload dynamics and goals when running single and multiple applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 8th ACM/IEEE Conference on Internet of Things Design and Implementation

自引率

0.00%

发文量