{"title":"增强深度神经网络的服务响应能力:排队模型如何提供帮助","authors":"E. Smirni","doi":"10.1145/3053600.3053620","DOIUrl":null,"url":null,"abstract":"Deep neural networks (DNNs) enable a host of artificial intelligence applications. These applications are supported by large DNN models running in serving mode often on a cloud computing infrastructure. Given the compute-intensive nature of large DNN models, a key challenge for DNN serving systems is to minimize user request response latencies. We show and model two important properties of DNN workloads that can allow for the use of queueing network models for predicting user request latencies: homogeneous request service demands and performance interference among requests running concurrently due to cache/memory contention. These properties motivate the design of a dynamic scheduling framework that is powered by an interference-aware queueing-based analytic model. The framework is evaluated in the context of an image classification service using several well known benchmarks. The results demonstrate its accurate latency prediction and its ability to adapt to changing load conditions, thanks to the fast deployment and accuracy of analytic queuing models. This work is in collaboration with Feng Yan of the University of Nevada at Reno, and Yuxiong He and Olatunji Ruwase of Microsoft Research.","PeriodicalId":115833,"journal":{"name":"Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Powering the Service Responsiveness of Deep Neural Networks: How Queueing Models can Help\",\"authors\":\"E. Smirni\",\"doi\":\"10.1145/3053600.3053620\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep neural networks (DNNs) enable a host of artificial intelligence applications. These applications are supported by large DNN models running in serving mode often on a cloud computing infrastructure. Given the compute-intensive nature of large DNN models, a key challenge for DNN serving systems is to minimize user request response latencies. We show and model two important properties of DNN workloads that can allow for the use of queueing network models for predicting user request latencies: homogeneous request service demands and performance interference among requests running concurrently due to cache/memory contention. These properties motivate the design of a dynamic scheduling framework that is powered by an interference-aware queueing-based analytic model. The framework is evaluated in the context of an image classification service using several well known benchmarks. The results demonstrate its accurate latency prediction and its ability to adapt to changing load conditions, thanks to the fast deployment and accuracy of analytic queuing models. This work is in collaboration with Feng Yan of the University of Nevada at Reno, and Yuxiong He and Olatunji Ruwase of Microsoft Research.\",\"PeriodicalId\":115833,\"journal\":{\"name\":\"Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-04-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3053600.3053620\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3053600.3053620","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Powering the Service Responsiveness of Deep Neural Networks: How Queueing Models can Help
Deep neural networks (DNNs) enable a host of artificial intelligence applications. These applications are supported by large DNN models running in serving mode often on a cloud computing infrastructure. Given the compute-intensive nature of large DNN models, a key challenge for DNN serving systems is to minimize user request response latencies. We show and model two important properties of DNN workloads that can allow for the use of queueing network models for predicting user request latencies: homogeneous request service demands and performance interference among requests running concurrently due to cache/memory contention. These properties motivate the design of a dynamic scheduling framework that is powered by an interference-aware queueing-based analytic model. The framework is evaluated in the context of an image classification service using several well known benchmarks. The results demonstrate its accurate latency prediction and its ability to adapt to changing load conditions, thanks to the fast deployment and accuracy of analytic queuing models. This work is in collaboration with Feng Yan of the University of Nevada at Reno, and Yuxiong He and Olatunji Ruwase of Microsoft Research.