Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

Proceedings of the 37th International Conference on Supercomputing Pub Date : 2023-06-21 DOI:10.1145/3577193.3593724

Anqi Guo, Y. Hao, Chunshu Wu, Pouya Haghi, Zhenyu Pan, Min Si, Dingwen Tao, Ang Li, Martin C. Herbordt, Tong Geng

{"title":"Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training","authors":"Anqi Guo, Y. Hao, Chunshu Wu, Pouya Haghi, Zhenyu Pan, Min Si, Dingwen Tao, Ang Li, Martin C. Herbordt, Tong Geng","doi":"10.1145/3577193.3593724","DOIUrl":null,"url":null,"abstract":"Deep Learning Recommendation Models (DLRMs) are important applications in various domains and have evolved into one of the largest and most important machine learning applications. With their trillions of parameters necessarily exceeding the high bandwidth memory (HBM) capacity of GPUs, ever more massive DLRMs require large-scale multi-node systems for distributed training and inference. However, these all suffer from the all-to-all communication bottleneck, which limits scalability. SmartNICs couple computation and communication capabilities to provide powerful network-facing heterogeneous devices that reduce communication overhead. There has not, however, been a distributed system design that fully leverages SmartNIC resources to address scalability of DLRMs. We propose a software-hardware co-design of a heterogeneous SmartNIC system that overcomes the communication bottleneck of distributed DLRMs, mitigates the pressure on memory bandwidth, and improves computation efficiency. We provide a set of SmartNIC designs of cache systems (including local cache and remote cache) and SmartNIC computation kernels that reduce data movement, relieve memory lookup intensity, and improve the GPU's computation efficiency. In addition, we propose a graph algorithm that improves the data locality of queries within batches and optimizes the overall system performance with higher data reuse. Our evaluation shows that the system achieves 2.1× latency speedup for inference and 1.6× throughput speedup for training.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593724","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Deep Learning Recommendation Models (DLRMs) are important applications in various domains and have evolved into one of the largest and most important machine learning applications. With their trillions of parameters necessarily exceeding the high bandwidth memory (HBM) capacity of GPUs, ever more massive DLRMs require large-scale multi-node systems for distributed training and inference. However, these all suffer from the all-to-all communication bottleneck, which limits scalability. SmartNICs couple computation and communication capabilities to provide powerful network-facing heterogeneous devices that reduce communication overhead. There has not, however, been a distributed system design that fully leverages SmartNIC resources to address scalability of DLRMs. We propose a software-hardware co-design of a heterogeneous SmartNIC system that overcomes the communication bottleneck of distributed DLRMs, mitigates the pressure on memory bandwidth, and improves computation efficiency. We provide a set of SmartNIC designs of cache systems (including local cache and remote cache) and SmartNIC computation kernels that reduce data movement, relieve memory lookup intensity, and improve the GPU's computation efficiency. In addition, we propose a graph algorithm that improves the data locality of queries within batches and optimizes the overall system performance with higher data reuse. Our evaluation shows that the system achieves 2.1× latency speedup for inference and 1.6× throughput speedup for training.

查看原文本刊更多论文

异构智能网卡推荐模型推理与训练系统软硬件协同设计

深度学习推荐模型(Deep Learning Recommendation Models, dlrm)是各个领域的重要应用，已经发展成为最大、最重要的机器学习应用之一。由于其数万亿的参数必然超过gpu的高带宽内存(HBM)容量，越来越多的大规模dlrm需要大规模的多节点系统来进行分布式训练和推理。然而，这些都受到所有对所有通信瓶颈的影响，这限制了可伸缩性。smartnic将计算和通信能力结合起来，提供强大的面向网络的异构设备，降低通信开销。然而，目前还没有一种分布式系统设计能够充分利用SmartNIC资源来解决dlrm的可扩展性问题。本文提出一种软硬件协同设计的异构SmartNIC系统，克服了分布式dlrm的通信瓶颈，减轻了内存带宽的压力，提高了计算效率。我们提供了一套SmartNIC缓存系统设计(包括本地缓存和远程缓存)和SmartNIC计算内核，减少数据移动，减轻内存查找强度，提高GPU的计算效率。此外，我们提出了一种图算法，该算法提高了批量查询的数据局部性，并通过更高的数据重用优化了整体系统性能。我们的评估表明，系统在推理方面实现了2.1倍的延迟加速，在训练方面实现了1.6倍的吞吐量加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 37th International Conference on Supercomputing

自引率

0.00%

发文量