增强深度神经网络的服务响应能力:排队模型如何提供帮助

Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion Pub Date : 2017-04-18 DOI:10.1145/3053600.3053620

E. Smirni

{"title":"增强深度神经网络的服务响应能力:排队模型如何提供帮助","authors":"E. Smirni","doi":"10.1145/3053600.3053620","DOIUrl":null,"url":null,"abstract":"Deep neural networks (DNNs) enable a host of artificial intelligence applications. These applications are supported by large DNN models running in serving mode often on a cloud computing infrastructure. Given the compute-intensive nature of large DNN models, a key challenge for DNN serving systems is to minimize user request response latencies. We show and model two important properties of DNN workloads that can allow for the use of queueing network models for predicting user request latencies: homogeneous request service demands and performance interference among requests running concurrently due to cache/memory contention. These properties motivate the design of a dynamic scheduling framework that is powered by an interference-aware queueing-based analytic model. The framework is evaluated in the context of an image classification service using several well known benchmarks. The results demonstrate its accurate latency prediction and its ability to adapt to changing load conditions, thanks to the fast deployment and accuracy of analytic queuing models. This work is in collaboration with Feng Yan of the University of Nevada at Reno, and Yuxiong He and Olatunji Ruwase of Microsoft Research.","PeriodicalId":115833,"journal":{"name":"Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Powering the Service Responsiveness of Deep Neural Networks: How Queueing Models can Help\",\"authors\":\"E. Smirni\",\"doi\":\"10.1145/3053600.3053620\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep neural networks (DNNs) enable a host of artificial intelligence applications. These applications are supported by large DNN models running in serving mode often on a cloud computing infrastructure. Given the compute-intensive nature of large DNN models, a key challenge for DNN serving systems is to minimize user request response latencies. We show and model two important properties of DNN workloads that can allow for the use of queueing network models for predicting user request latencies: homogeneous request service demands and performance interference among requests running concurrently due to cache/memory contention. These properties motivate the design of a dynamic scheduling framework that is powered by an interference-aware queueing-based analytic model. The framework is evaluated in the context of an image classification service using several well known benchmarks. The results demonstrate its accurate latency prediction and its ability to adapt to changing load conditions, thanks to the fast deployment and accuracy of analytic queuing models. This work is in collaboration with Feng Yan of the University of Nevada at Reno, and Yuxiong He and Olatunji Ruwase of Microsoft Research.\",\"PeriodicalId\":115833,\"journal\":{\"name\":\"Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-04-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3053600.3053620\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3053600.3053620","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

深度神经网络(dnn)支持大量人工智能应用。这些应用程序由通常在云计算基础设施上以服务模式运行的大型DNN模型支持。考虑到大型深度神经网络模型的计算密集型性质，深度神经网络服务系统的一个关键挑战是最小化用户请求响应延迟。我们展示并建模了DNN工作负载的两个重要属性，它们允许使用排队网络模型来预测用户请求延迟:同构请求服务需求和由于缓存/内存争用而并发运行的请求之间的性能干扰。这些属性激发了动态调度框架的设计，该框架由基于干扰感知排队的分析模型提供支持。该框架在图像分类服务的上下文中使用几个众所周知的基准进行评估。结果表明，由于分析排队模型的快速部署和准确性，该模型具有准确的延迟预测和适应负载条件变化的能力。这项工作是与内华达大学里诺分校的冯严、微软研究院的何玉雄和奥拉通吉·鲁瓦瑟合作完成的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Powering the Service Responsiveness of Deep Neural Networks: How Queueing Models can Help

Deep neural networks (DNNs) enable a host of artificial intelligence applications. These applications are supported by large DNN models running in serving mode often on a cloud computing infrastructure. Given the compute-intensive nature of large DNN models, a key challenge for DNN serving systems is to minimize user request response latencies. We show and model two important properties of DNN workloads that can allow for the use of queueing network models for predicting user request latencies: homogeneous request service demands and performance interference among requests running concurrently due to cache/memory contention. These properties motivate the design of a dynamic scheduling framework that is powered by an interference-aware queueing-based analytic model. The framework is evaluated in the context of an image classification service using several well known benchmarks. The results demonstrate its accurate latency prediction and its ability to adapt to changing load conditions, thanks to the fast deployment and accuracy of analytic queuing models. This work is in collaboration with Feng Yan of the University of Nevada at Reno, and Yuxiong He and Olatunji Ruwase of Microsoft Research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion

自引率

0.00%

发文量