SplitRPC:一个用于ML推理服务的{Control + Data}路径分裂RPC堆栈

Proceedings of the ACM on Measurement and Analysis of Computing Systems Pub Date : 2023-05-19 DOI:10.1145/3589974

Adithya Kumar, A. Sivasubramaniam, T. Zhu

{"title":"SplitRPC:一个用于ML推理服务的{Control + Data}路径分裂RPC堆栈","authors":"Adithya Kumar, A. Sivasubramaniam, T. Zhu","doi":"10.1145/3589974","DOIUrl":null,"url":null,"abstract":"The growing adoption of hardware accelerators driven by their intelligent compiler and runtime system counterparts has democratized ML services and precipitously reduced their execution times. This motivates us to shift our attention to efficiently serve these ML services under distributed settings and characterize the overheads imposed by the RPC mechanism ('RPC tax') when serving them on accelerators. The RPC implementations designed over the years implicitly assume the host CPU services the requests, and we focus on expanding such works towards accelerator-based services. While recent proposals calling for SmartNICs to take on this task are reasonable for simple kernels, serving complex ML models requires a more nuanced view to optimize both the data-path and the control/orchestration of these accelerators. We program today's commodity network interface cards (NICs) to split the control and data paths for effective transfer of control while efficiently transferring the payload to the accelerator. As opposed to unified approaches that bundle these paths together, limiting the flexibility in each of these paths, we design and implement SplitRPC - a control + data path optimizing RPC mechanism for ML inference serving. SplitRPC allows us to optimize the datapath to the accelerator while simultaneously allowing the CPU to maintain full orchestration capabilities. We implement SplitRPC on both commodity NICs and SmartNICs and demonstrate how GPU-based ML services running different compiler/runtime systems can benefit. For a variety of ML models served using different inference runtimes, we demonstrate that SplitRPC is effective in minimizing the RPC tax while providing significant gains in throughput and latency over existing kernel by-pass approaches, without requiring expensive SmartNIC devices.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SplitRPC: A {Control + Data} Path Splitting RPC Stack for ML Inference Serving\",\"authors\":\"Adithya Kumar, A. Sivasubramaniam, T. Zhu\",\"doi\":\"10.1145/3589974\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The growing adoption of hardware accelerators driven by their intelligent compiler and runtime system counterparts has democratized ML services and precipitously reduced their execution times. This motivates us to shift our attention to efficiently serve these ML services under distributed settings and characterize the overheads imposed by the RPC mechanism ('RPC tax') when serving them on accelerators. The RPC implementations designed over the years implicitly assume the host CPU services the requests, and we focus on expanding such works towards accelerator-based services. While recent proposals calling for SmartNICs to take on this task are reasonable for simple kernels, serving complex ML models requires a more nuanced view to optimize both the data-path and the control/orchestration of these accelerators. We program today's commodity network interface cards (NICs) to split the control and data paths for effective transfer of control while efficiently transferring the payload to the accelerator. As opposed to unified approaches that bundle these paths together, limiting the flexibility in each of these paths, we design and implement SplitRPC - a control + data path optimizing RPC mechanism for ML inference serving. SplitRPC allows us to optimize the datapath to the accelerator while simultaneously allowing the CPU to maintain full orchestration capabilities. We implement SplitRPC on both commodity NICs and SmartNICs and demonstrate how GPU-based ML services running different compiler/runtime systems can benefit. For a variety of ML models served using different inference runtimes, we demonstrate that SplitRPC is effective in minimizing the RPC tax while providing significant gains in throughput and latency over existing kernel by-pass approaches, without requiring expensive SmartNIC devices.\",\"PeriodicalId\":426760,\"journal\":{\"name\":\"Proceedings of the ACM on Measurement and Analysis of Computing Systems\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM on Measurement and Analysis of Computing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3589974\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3589974","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

由智能编译器和运行时系统驱动的硬件加速器的日益普及，使ML服务大众化，并大大缩短了它们的执行时间。这促使我们将注意力转移到在分布式设置下有效地服务这些ML服务，并描述RPC机制在加速器上服务时强加的开销(“RPC税”)。多年来设计的RPC实现隐式地假设主机CPU为请求提供服务，我们将重点放在将此类工作扩展到基于加速器的服务上。虽然最近的建议要求smartnic承担这项任务对于简单的内核来说是合理的，但服务复杂的ML模型需要更细致的视角来优化这些加速器的数据路径和控制/编组。我们对今天的商品网络接口卡(nic)进行编程，以分离控制和数据路径，以便有效地传输控制，同时有效地将有效载荷传输到加速器。与将这些路径捆绑在一起的统一方法相反，限制了每个路径的灵活性，我们设计并实现了SplitRPC——一种用于ML推理服务的控制+数据路径优化RPC机制。SplitRPC允许我们优化到加速器的数据路径，同时允许CPU保持完整的编排功能。我们在商品nic和smartnic上实现SplitRPC，并演示运行不同编译器/运行时系统的基于gpu的ML服务如何受益。对于使用不同推理运行时服务的各种ML模型，我们证明SplitRPC在最小化RPC税方面是有效的，同时与现有的内核旁路方法相比，在吞吐量和延迟方面提供了显着的收益，而不需要昂贵的SmartNIC设备。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SplitRPC: A {Control + Data} Path Splitting RPC Stack for ML Inference Serving

The growing adoption of hardware accelerators driven by their intelligent compiler and runtime system counterparts has democratized ML services and precipitously reduced their execution times. This motivates us to shift our attention to efficiently serve these ML services under distributed settings and characterize the overheads imposed by the RPC mechanism ('RPC tax') when serving them on accelerators. The RPC implementations designed over the years implicitly assume the host CPU services the requests, and we focus on expanding such works towards accelerator-based services. While recent proposals calling for SmartNICs to take on this task are reasonable for simple kernels, serving complex ML models requires a more nuanced view to optimize both the data-path and the control/orchestration of these accelerators. We program today's commodity network interface cards (NICs) to split the control and data paths for effective transfer of control while efficiently transferring the payload to the accelerator. As opposed to unified approaches that bundle these paths together, limiting the flexibility in each of these paths, we design and implement SplitRPC - a control + data path optimizing RPC mechanism for ML inference serving. SplitRPC allows us to optimize the datapath to the accelerator while simultaneously allowing the CPU to maintain full orchestration capabilities. We implement SplitRPC on both commodity NICs and SmartNICs and demonstrate how GPU-based ML services running different compiler/runtime systems can benefit. For a variety of ML models served using different inference runtimes, we demonstrate that SplitRPC is effective in minimizing the RPC tax while providing significant gains in throughput and latency over existing kernel by-pass approaches, without requiring expensive SmartNIC devices.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ACM on Measurement and Analysis of Computing Systems

CiteScore

3.20

自引率

0.00%

发文量