PERSEUS: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models

2020 IEEE International Conference on Cloud Engineering (IC2E) Pub Date : 2019-12-05 DOI:10.1109/IC2E48712.2020.00014

Matthew LeMay, Shijian Li, Tian Guo

{"title":"PERSEUS: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models","authors":"Matthew LeMay, Shijian Li, Tian Guo","doi":"10.1109/IC2E48712.2020.00014","DOIUrl":null,"url":null,"abstract":"Deep learning models are increasingly used for end-user applications, supporting both novel features such as facial recognition, and traditional features, e.g. web search. To accommodate high inference throughput, it is common to host a single pre-trained Convolutional Neural Network (CNN) in dedicated cloud-based servers with hardware accelerators such as Graphics Processing Units (GPUs). However, GPUs can be orders of magnitude more expensive than traditional Central Processing Unit (CPU) servers. These resources could also be under-utilized facing dynamic workloads, which may result in inflated serving costs. One potential way to alleviate this problem is by allowing hosted models to share the underlying resources, which we refer to as multi-tenant inference serving. One of the key challenges is maximizing the resource efficiency for multi-tenant serving given hardware with diverse characteristics, models with unique response time Service Level Agreement (SLA), and dynamic inference workloads. In this paper, we present PERSEUS, a measurement framework that provides the basis for understanding the performance and cost trade-offs of multi-tenant model serving. We implemented PERSEUS in Python atop a popular cloud inference server called Nvidia TensorRT Inference Server. Leveraging PERSEUS, we evaluated the inference throughput and cost for serving various models and demonstrated that multi-tenant model serving led to up to 12% cost reduction.","PeriodicalId":173494,"journal":{"name":"2020 IEEE International Conference on Cloud Engineering (IC2E)","volume":"316 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Conference on Cloud Engineering (IC2E)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC2E48712.2020.00014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Deep learning models are increasingly used for end-user applications, supporting both novel features such as facial recognition, and traditional features, e.g. web search. To accommodate high inference throughput, it is common to host a single pre-trained Convolutional Neural Network (CNN) in dedicated cloud-based servers with hardware accelerators such as Graphics Processing Units (GPUs). However, GPUs can be orders of magnitude more expensive than traditional Central Processing Unit (CPU) servers. These resources could also be under-utilized facing dynamic workloads, which may result in inflated serving costs. One potential way to alleviate this problem is by allowing hosted models to share the underlying resources, which we refer to as multi-tenant inference serving. One of the key challenges is maximizing the resource efficiency for multi-tenant serving given hardware with diverse characteristics, models with unique response time Service Level Agreement (SLA), and dynamic inference workloads. In this paper, we present PERSEUS, a measurement framework that provides the basis for understanding the performance and cost trade-offs of multi-tenant model serving. We implemented PERSEUS in Python atop a popular cloud inference server called Nvidia TensorRT Inference Server. Leveraging PERSEUS, we evaluated the inference throughput and cost for serving various models and demonstrated that multi-tenant model serving led to up to 12% cost reduction.

查看原文本刊更多论文

PERSEUS:表征CNN模型的多租户服务的性能和成本

深度学习模型越来越多地用于终端用户应用，既支持面部识别等新功能，也支持网络搜索等传统功能。为了适应高推断吞吐量，通常在专用的基于云的服务器上托管单个预训练的卷积神经网络(CNN)，并使用硬件加速器(如图形处理单元(gpu))。然而，gpu可能比传统的中央处理单元(CPU)服务器贵几个数量级。面对动态工作负载，这些资源也可能得不到充分利用，这可能导致服务成本膨胀。缓解此问题的一种潜在方法是允许托管模型共享底层资源，我们将其称为多租户推理服务。关键挑战之一是在给定具有不同特征的硬件、具有独特响应时间服务水平协议(Service Level Agreement, SLA)的模型和动态推理工作负载的情况下，最大限度地提高多租户服务的资源效率。在本文中，我们介绍了一个度量框架PERSEUS，它为理解多租户模型服务的性能和成本权衡提供了基础。我们在一个流行的云推理服务器Nvidia TensorRT推理服务器上用Python实现了PERSEUS。利用PERSEUS，我们评估了服务于各种模型的推理吞吐量和成本，并证明了多租户模型服务可以降低高达12%的成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Conference on Cloud Engineering (IC2E)

自引率

0.00%

发文量