SLOpt: Serving Real-Time Inference Pipeline With Strict Latency Constraint

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers Pub Date : 2025-01-10 DOI:10.1109/TC.2025.3528125

Zhixin Zhao;Yitao Hu;Guotao Yang;Ziqi Gong;Chen Shen;Laiping Zhao;Wenxin Li;Xiulong Liu;Wenyu Qu

{"title":"SLOpt: Serving Real-Time Inference Pipeline With Strict Latency Constraint","authors":"Zhixin Zhao;Yitao Hu;Guotao Yang;Ziqi Gong;Chen Shen;Laiping Zhao;Wenxin Li;Xiulong Liu;Wenyu Qu","doi":"10.1109/TC.2025.3528125","DOIUrl":null,"url":null,"abstract":"The rise of machine learning as a service (MLaaS) has driven the demand for complex and customized real-time inference tasks, often requiring cascading multiple deep neural network (DNN) models into inference pipelines. However, these pipelines pose significant challenges due to scheduling complexity, particularly in maintaining strict latency service level objectives (SLOs). Existing systems serve pipelines with model-independent scheduling policies, which ignore the unique workload characteristics introduced by model cascading in the inference pipeline, leading to SLO violations and resource inefficiencies. In this paper, we propose that the serving system should exploit the model-cascading nature and intermodel workload dependency of the inference pipeline to ensure strict latency SLO cost-effectively. Based on this, we design and implement <monospace>SLOpt</monospace>, a serving system optimized for real-time inference pipelines with a three-stage codesign of workload estimation, resource provisioning, and request execution. <monospace>SLOpt</monospace> proposes cascade workload estimation and ahead-of-time tuning, which together address the challenge of cascade blocking and head-of-line blocking in workload estimation and resource provisioning. <monospace>SLOpt</monospace> further implements an adaptive batch drop policy to mitigate latency amplification issues within the pipeline. These innovations enable <monospace>SLOpt</monospace> to reduce the 99th percentile latency (P99 latency) by <inline-formula><tex-math>$1.4$</tex-math></inline-formula> to <inline-formula><tex-math>$2.5$</tex-math></inline-formula> times compared to the state of the arts while lowering serving costs by up to <inline-formula><tex-math>$29\\%$</tex-math></inline-formula>. Moreover, to achieve comparable P99 latency, <monospace>SLOpt</monospace> requires up to <inline-formula><tex-math>$70\\%$</tex-math></inline-formula> less cost than existing systems. Extensive evaluations on a 64-GPU cluster demonstrate <monospace>SLOpt</monospace>'s effectiveness in meeting strict P99 latency SLOs under diverse real-world workloads.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 4","pages":"1431-1445"},"PeriodicalIF":3.8000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10836842/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

The rise of machine learning as a service (MLaaS) has driven the demand for complex and customized real-time inference tasks, often requiring cascading multiple deep neural network (DNN) models into inference pipelines. However, these pipelines pose significant challenges due to scheduling complexity, particularly in maintaining strict latency service level objectives (SLOs). Existing systems serve pipelines with model-independent scheduling policies, which ignore the unique workload characteristics introduced by model cascading in the inference pipeline, leading to SLO violations and resource inefficiencies. In this paper, we propose that the serving system should exploit the model-cascading nature and intermodel workload dependency of the inference pipeline to ensure strict latency SLO cost-effectively. Based on this, we design and implement SLOpt, a serving system optimized for real-time inference pipelines with a three-stage codesign of workload estimation, resource provisioning, and request execution. SLOpt proposes cascade workload estimation and ahead-of-time tuning, which together address the challenge of cascade blocking and head-of-line blocking in workload estimation and resource provisioning. SLOpt further implements an adaptive batch drop policy to mitigate latency amplification issues within the pipeline. These innovations enable SLOpt to reduce the 99th percentile latency (P99 latency) by

$1.4$

$2.5$

times compared to the state of the arts while lowering serving costs by up to

$29\%$

. Moreover, to achieve comparable P99 latency, SLOpt requires up to

$70\%$

less cost than existing systems. Extensive evaluations on a 64-GPU cluster demonstrate SLOpt's effectiveness in meeting strict P99 latency SLOs under diverse real-world workloads.

查看原文本刊更多论文

SLOpt：具有严格延迟约束的服务实时推理管道

机器学习即服务（MLaaS）的兴起推动了对复杂和定制的实时推理任务的需求，通常需要将多个深度神经网络（DNN）模型级联到推理管道中。然而，由于调度复杂性，特别是在维护严格的延迟服务水平目标（slo）方面，这些管道带来了重大挑战。现有系统为管道提供与模型无关的调度策略，这些策略忽略了推理管道中模型级联引入的独特工作负载特征，从而导致违反SLO和资源效率低下。在本文中，我们提出服务系统应该利用推理管道的模型级联特性和模型间工作负载依赖性，以经济有效地确保严格延迟的SLO。基于此，我们设计并实现了SLOpt，这是一个针对实时推理管道进行优化的服务系统，具有工作负载估计，资源供应和请求执行的三阶段协同设计。SLOpt提出了级联工作负载估计和提前调优，它们共同解决了工作负载估计和资源配置中的级联阻塞和排队阻塞的挑战。SLOpt进一步实现了自适应批删除策略，以减轻管道内的延迟放大问题。与现有技术相比，这些创新使SLOpt能够将第99百分位延迟（P99延迟）降低1.4美元至2.5美元，同时将服务成本降低高达29%。此外，为了达到类似的P99延迟，SLOpt需要的成本比现有系统低70%。对64 gpu集群的广泛评估表明，在各种实际工作负载下，SLOpt在满足严格的P99延迟slo方面是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.