EMS-i : An Efficient Memory System Design with Specialized Caching Mechanism for Recommendation Inference

IF 2.8 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Embedded Computing Systems Pub Date : 2023-09-09 DOI:10.1145/3609384

Yitu Wang, Shiyu Li, Qilin Zheng, Andrew Chang, Hai Li, Yiran Chen

{"title":"<scp>EMS-i</scp> : An Efficient Memory System Design with Specialized Caching Mechanism for Recommendation Inference","authors":"Yitu Wang, Shiyu Li, Qilin Zheng, Andrew Chang, Hai Li, Yiran Chen","doi":"10.1145/3609384","DOIUrl":null,"url":null,"abstract":"Recommendation systems have been widely embedded into many Internet services. For example, Meta’s deep learning recommendation model (DLRM) shows high prefictive accuracy of click-through rate in processing large-scale embedding tables. The SparseLengthSum (SLS) kernel of the DLRM dominates the inference time of the DLRM due to intensive irregular memory accesses to the embedding vectors. Some prior works directly adopt near data processing (NDP) solutions to obtain higher memory bandwidth to accelerate SLS. However, their inferior memory hierarchy induces low performance-cost ratio and fails to fully exploit the data locality. Although some software-managed cache policies were proposed to improve the cache hit rate, the incurred cache miss penalty is unacceptable considering the high overheads of executing the corresponding programs and the communication between the host and the accelerator. To address the issues aforementioned, we propose EMS-i , an efficient memory system design that integrates Solide State Drive (SSD) into the memory hierarchy using Compute Express Link (CXL) for recommendation system inference. We specialize the caching mechanism according to the characteristics of various DLRM workloads and propose a novel prefetching mechanism to further improve the performance. In addition, we delicately design the inference kernel and develop a customized mapping scheme for SLS operation, considering the multi-level parallelism in SLS and the data locality within a batch of queries. Compared to the state-of-the-art NDP solutions, EMS-i achieves up to 10.9× speedup over RecSSD and the performance comparable to RecNMP with 72% energy savings. EMS-i also saves up to 8.7× and 6.6 × memory cost w.r.t. RecSSD and RecNMP, respectively.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":"695 1","pages":"0"},"PeriodicalIF":2.8000,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Embedded Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3609384","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Recommendation systems have been widely embedded into many Internet services. For example, Meta’s deep learning recommendation model (DLRM) shows high prefictive accuracy of click-through rate in processing large-scale embedding tables. The SparseLengthSum (SLS) kernel of the DLRM dominates the inference time of the DLRM due to intensive irregular memory accesses to the embedding vectors. Some prior works directly adopt near data processing (NDP) solutions to obtain higher memory bandwidth to accelerate SLS. However, their inferior memory hierarchy induces low performance-cost ratio and fails to fully exploit the data locality. Although some software-managed cache policies were proposed to improve the cache hit rate, the incurred cache miss penalty is unacceptable considering the high overheads of executing the corresponding programs and the communication between the host and the accelerator. To address the issues aforementioned, we propose EMS-i , an efficient memory system design that integrates Solide State Drive (SSD) into the memory hierarchy using Compute Express Link (CXL) for recommendation system inference. We specialize the caching mechanism according to the characteristics of various DLRM workloads and propose a novel prefetching mechanism to further improve the performance. In addition, we delicately design the inference kernel and develop a customized mapping scheme for SLS operation, considering the multi-level parallelism in SLS and the data locality within a batch of queries. Compared to the state-of-the-art NDP solutions, EMS-i achieves up to 10.9× speedup over RecSSD and the performance comparable to RecNMP with 72% energy savings. EMS-i also saves up to 8.7× and 6.6 × memory cost w.r.t. RecSSD and RecNMP, respectively.

查看原文本刊更多论文

EMS-i:一种高效的内存系统设计，具有专门的推荐推理缓存机制

推荐系统已经被广泛地嵌入到许多互联网服务中。例如，Meta的深度学习推荐模型(DLRM)在处理大规模嵌入表时显示出很高的点击率预测准确率。DLRM的SparseLengthSum (SLS)内核由于对嵌入向量的密集不规则内存访问而支配了DLRM的推理时间。先前的一些工作直接采用近数据处理(NDP)解决方案，以获得更高的内存带宽来加速SLS。然而，它们较低的内存层次结构导致了较低的性能成本比，并且不能充分利用数据的局部性。尽管提出了一些软件管理的缓存策略来提高缓存命中率，但考虑到执行相应程序的高开销以及主机和加速器之间的通信，所产生的缓存丢失惩罚是不可接受的。为了解决上述问题，我们提出了EMS-i，这是一种高效的内存系统设计，它将固态硬盘(SSD)集成到内存层次结构中，使用Compute Express Link (CXL)进行推荐系统推理。根据不同DLRM工作负载的特点，对缓存机制进行了细化，提出了一种新的预取机制，进一步提高了性能。此外，考虑到SLS中的多级并行性和批量查询中的数据局部性，我们精心设计了推理内核，并为SLS操作开发了定制的映射方案。与最先进的NDP解决方案相比，EMS-i实现了比RecSSD高达10.9倍的加速，性能与RecNMP相当，节能72%。与RecSSD和RecNMP相比，EMS-i还可分别节省高达8.7倍和6.6倍的内存成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Embedded Computing Systems 工程技术-计算机：软件工程

CiteScore

3.70

自引率

0.00%

发文量

138

审稿时长

6 months

期刊介绍： The design of embedded computing systems, both the software and hardware, increasingly relies on sophisticated algorithms, analytical models, and methodologies. ACM Transactions on Embedded Computing Systems (TECS) aims to present the leading work relating to the analysis, design, behavior, and experience with embedded computing systems.