Interference-Aware Scheduling for Inference Serving

Proceedings of the 1st Workshop on Machine Learning and Systems Pub Date : 2021-04-26 DOI:10.1145/3437984.3458837

Daniel Mendoza, Francisco Romero, Qian Li, N. Yadwadkar, C. Kozyrakis

{"title":"Interference-Aware Scheduling for Inference Serving","authors":"Daniel Mendoza, Francisco Romero, Qian Li, N. Yadwadkar, C. Kozyrakis","doi":"10.1145/3437984.3458837","DOIUrl":null,"url":null,"abstract":"Machine learning inference applications have proliferated through diverse domains such as healthcare, security, and analytics. Recent work has proposed inference serving systems for improving the deployment and scalability of models. To improve resource utilization, multiple models can be co-located on the same backend machine. However, co-location can cause latency degradation due to interference and can subsequently violate latency requirements. Although interference-aware schedulers for general workloads have been introduced, they do not scale appropriately to heterogeneous inference serving systems where the number of co-location configurations grows exponentially with the number of models and machine types. This paper proposes an interference-aware scheduler for heterogeneous inference serving systems, reducing the latency degradation from co-location interference. We characterize the challenges in predicting the impact of co-location interference on inference latency (e.g., varying latency degradation across machine types), and identify properties of models and hardware that should be considered during scheduling. We then propose a unified prediction model that estimates an inference model's latency degradation during co-location, and develop an interference-aware scheduler that leverages this predictor. Our preliminary results show that our interference-aware scheduler achieves 2× lower latency degradation than a commonly used least-loaded scheduler. We also discuss future research directions for interference-aware schedulers for inference serving systems.","PeriodicalId":269840,"journal":{"name":"Proceedings of the 1st Workshop on Machine Learning and Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st Workshop on Machine Learning and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3437984.3458837","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

Machine learning inference applications have proliferated through diverse domains such as healthcare, security, and analytics. Recent work has proposed inference serving systems for improving the deployment and scalability of models. To improve resource utilization, multiple models can be co-located on the same backend machine. However, co-location can cause latency degradation due to interference and can subsequently violate latency requirements. Although interference-aware schedulers for general workloads have been introduced, they do not scale appropriately to heterogeneous inference serving systems where the number of co-location configurations grows exponentially with the number of models and machine types. This paper proposes an interference-aware scheduler for heterogeneous inference serving systems, reducing the latency degradation from co-location interference. We characterize the challenges in predicting the impact of co-location interference on inference latency (e.g., varying latency degradation across machine types), and identify properties of models and hardware that should be considered during scheduling. We then propose a unified prediction model that estimates an inference model's latency degradation during co-location, and develop an interference-aware scheduler that leverages this predictor. Our preliminary results show that our interference-aware scheduler achieves 2× lower latency degradation than a commonly used least-loaded scheduler. We also discuss future research directions for interference-aware schedulers for inference serving systems.

查看原文本刊更多论文

基于干扰感知的推理服务调度

机器学习推理应用程序已经在医疗保健、安全和分析等不同领域激增。最近的工作提出了用于改进模型部署和可扩展性的推理服务系统。为了提高资源利用率，可以将多个模型放在同一台后端机器上。但是，协同位置可能会由于干扰而导致延迟降低，并随后可能违反延迟要求。尽管已经为一般工作负载引入了干扰感知调度器，但它们不能适当地扩展到异构推理服务系统，在这些系统中，协同定位配置的数量随着模型和机器类型的数量呈指数级增长。针对异构推理服务系统，提出了一种干扰感知调度器，减少了同位干扰带来的延迟退化。我们描述了在预测同址干扰对推理延迟的影响方面的挑战(例如，不同机器类型的延迟退化)，并确定了在调度期间应该考虑的模型和硬件的属性。然后，我们提出了一个统一的预测模型，该模型估计了一个推理模型在共置期间的延迟退化，并开发了一个利用该预测器的干扰感知调度器。我们的初步结果表明，我们的干扰感知调度器比常用的最小负载调度器实现了低2倍的延迟退化。讨论了推理服务系统中干扰感知调度器的未来研究方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 1st Workshop on Machine Learning and Systems

自引率

0.00%

发文量