Cost-Efficient Serverless Inference Serving with Joint Batching and Multi-Processing

Shen Cai, Zhi Zhou, Kongyange Zhao, Xu Chen
{"title":"Cost-Efficient Serverless Inference Serving with Joint Batching and Multi-Processing","authors":"Shen Cai, Zhi Zhou, Kongyange Zhao, Xu Chen","doi":"10.1145/3609510.3609816","DOIUrl":null,"url":null,"abstract":"With the emerging of machine learning, many commercial companies increasingly utilize machine learning inference systems as backend services to improve their products. Serverless computing is a modern paradigm that provides auto-scaling, event-driven services, making it particularly well-suited for various domains, including video stream analysis, IoT serving and machine learning applications. The flexible scaling feature of serverless computing is adept at handling the burstiness of ML workloads. However, despite its compatibility with ML inference tasks, the cost of serverless inference systems remain relatively high in comparison to traditional serving paradigms, primarily due to the under-utilization of CPU resources offered by serverless platforms. To tackle this challenge, we design and deploy a serverless inference serving system that incorporates batching and multi-process mechanisms to enhance cost efficiency. By applying a change-point detection algorithm to manage bursty workloads, it optimizes resource usage and achieves lower costs. We employ an Amazon EC2 server for handling request packaging and running the core Bayesian Optimization algorithm without any prior information. The preliminary system, implemented on AWS Lambda, can significantly reduce expenses and save up to 62% compared to the original serverless inference system.","PeriodicalId":149629,"journal":{"name":"Proceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3609510.3609816","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

With the emerging of machine learning, many commercial companies increasingly utilize machine learning inference systems as backend services to improve their products. Serverless computing is a modern paradigm that provides auto-scaling, event-driven services, making it particularly well-suited for various domains, including video stream analysis, IoT serving and machine learning applications. The flexible scaling feature of serverless computing is adept at handling the burstiness of ML workloads. However, despite its compatibility with ML inference tasks, the cost of serverless inference systems remain relatively high in comparison to traditional serving paradigms, primarily due to the under-utilization of CPU resources offered by serverless platforms. To tackle this challenge, we design and deploy a serverless inference serving system that incorporates batching and multi-process mechanisms to enhance cost efficiency. By applying a change-point detection algorithm to manage bursty workloads, it optimizes resource usage and achieves lower costs. We employ an Amazon EC2 server for handling request packaging and running the core Bayesian Optimization algorithm without any prior information. The preliminary system, implemented on AWS Lambda, can significantly reduce expenses and save up to 62% compared to the original serverless inference system.
具有联合批处理和多处理的低成本无服务器推理服务
随着机器学习的兴起,许多商业公司越来越多地利用机器学习推理系统作为后端服务来改进他们的产品。无服务器计算是一种现代范例,提供自动扩展,事件驱动服务,使其特别适合各种领域,包括视频流分析,物联网服务和机器学习应用程序。无服务器计算的灵活扩展特性擅长处理ML工作负载的突发性。然而,尽管它与ML推理任务兼容,与传统的服务范式相比,无服务器推理系统的成本仍然相对较高,这主要是由于无服务器平台提供的CPU资源利用率不足。为了应对这一挑战,我们设计并部署了一个无服务器推理服务系统,该系统结合了批处理和多进程机制,以提高成本效率。通过应用变更点检测算法来管理突发工作负载,它可以优化资源使用并实现更低的成本。我们使用Amazon EC2服务器来处理请求打包和运行核心贝叶斯优化算法,而无需任何先验信息。与原始的无服务器推理系统相比,在AWS Lambda上实现的初步系统可以显着降低成本并节省高达62%的成本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信