PIMPR: PIM-based Personalized Recommendation with Heterogeneous Memory Hierarchy

2023 Design, Automation & Test in Europe Conference & Exhibition (DATE) Pub Date : 2023-04-01 DOI:10.23919/DATE56975.2023.10137249

Tao Yang, Hui Ma, Yilong Zhao, Fangxin Liu, Zhezhi He, Xiaoli Sun, Li Jiang

{"title":"PIMPR: PIM-based Personalized Recommendation with Heterogeneous Memory Hierarchy","authors":"Tao Yang, Hui Ma, Yilong Zhao, Fangxin Liu, Zhezhi He, Xiaoli Sun, Li Jiang","doi":"10.23919/DATE56975.2023.10137249","DOIUrl":null,"url":null,"abstract":"Deep learning-based personalized recommendation models (DLRMs) are dominating AI tasks in data centers. The performance bottleneck of typical DLRMs mainly lies in the memory-bounded embedding layers. Resistive Random Access Memory (ReRAM)-based Processing-in-memory (PIM) architecture is a natural fit for DLRMs thanks to its in-situ computation and high computational density. However, it remains two challenges before DLRMs fully embrace ReRAM-based PIM architectures: 1) The size of DLRM's embedding tables can reach tens of GBs, far beyond the memory capacity of typical ReRAM chips. 2) The irregular sparsity conveyed in the embedding layers is difficult to exploit in ReRAM crossbars architecture. In this paper, we present a PIM-based DLRM accelerator named PIMPR. PIMPR has a heterogeneous memory hierarchy-ReRAM crossbar-based PIM modules serve as the computing caches with high computing parallelism, while DIMM modules are able to hold the entire embedding table-leveraging the data locality of DLRM's embedding layers. Moreover, we propose a runtime strategy to skip the useless calculation induced by the sparsity and an offline strategy to balance the workload of each ReRAM crossbar. Compared to the state-of-the-art DLRM accelerator SPACE and TRiM, PIMPR achieves on average 2.02×and 1.79× speedup, 5.6 ×, and 5.1 × energy reduction, respectively.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/DATE56975.2023.10137249","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Deep learning-based personalized recommendation models (DLRMs) are dominating AI tasks in data centers. The performance bottleneck of typical DLRMs mainly lies in the memory-bounded embedding layers. Resistive Random Access Memory (ReRAM)-based Processing-in-memory (PIM) architecture is a natural fit for DLRMs thanks to its in-situ computation and high computational density. However, it remains two challenges before DLRMs fully embrace ReRAM-based PIM architectures: 1) The size of DLRM's embedding tables can reach tens of GBs, far beyond the memory capacity of typical ReRAM chips. 2) The irregular sparsity conveyed in the embedding layers is difficult to exploit in ReRAM crossbars architecture. In this paper, we present a PIM-based DLRM accelerator named PIMPR. PIMPR has a heterogeneous memory hierarchy-ReRAM crossbar-based PIM modules serve as the computing caches with high computing parallelism, while DIMM modules are able to hold the entire embedding table-leveraging the data locality of DLRM's embedding layers. Moreover, we propose a runtime strategy to skip the useless calculation induced by the sparsity and an offline strategy to balance the workload of each ReRAM crossbar. Compared to the state-of-the-art DLRM accelerator SPACE and TRiM, PIMPR achieves on average 2.02×and 1.79× speedup, 5.6 ×, and 5.1 × energy reduction, respectively.

查看原文本刊更多论文

PIMPR:基于pim的异构内存层次个性化推荐

基于深度学习的个性化推荐模型(dlrm)在数据中心的人工智能任务中占据主导地位。典型dlrm的性能瓶颈主要在于有内存边界的嵌入层。基于电阻式随机存取存储器(ReRAM)的内存中处理(PIM)架构由于其原位计算和高计算密度而非常适合dlrm。然而，在DLRM完全采用基于ReRAM的PIM架构之前，仍然存在两个挑战:1)DLRM嵌入表的大小可以达到数十gb，远远超过典型ReRAM芯片的内存容量。2)嵌入层所传达的不规则稀疏性在ReRAM横条结构中难以利用。本文提出了一种基于pim的DLRM加速器PIMPR。PIMPR具有异构内存层次结构，基于reram交叉条的PIM模块作为计算缓存，具有较高的计算并行性，而DIMM模块能够保存整个嵌入表，利用DLRM嵌入层的数据局部性。此外，我们还提出了一种运行时策略来跳过稀疏性导致的无用计算，并提出了一种离线策略来平衡每个ReRAM交叉栏的工作负载。与最先进的DLRM加速器SPACE和TRiM相比，PIMPR的平均加速速度分别达到2.02×and 1.79倍，5.6倍和5.1倍的能耗降低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)

自引率

0.00%

发文量