PIMPR: PIM-based Personalized Recommendation with Heterogeneous Memory Hierarchy

Tao Yang, Hui Ma, Yilong Zhao, Fangxin Liu, Zhezhi He, Xiaoli Sun, Li Jiang
{"title":"PIMPR: PIM-based Personalized Recommendation with Heterogeneous Memory Hierarchy","authors":"Tao Yang, Hui Ma, Yilong Zhao, Fangxin Liu, Zhezhi He, Xiaoli Sun, Li Jiang","doi":"10.23919/DATE56975.2023.10137249","DOIUrl":null,"url":null,"abstract":"Deep learning-based personalized recommendation models (DLRMs) are dominating AI tasks in data centers. The performance bottleneck of typical DLRMs mainly lies in the memory-bounded embedding layers. Resistive Random Access Memory (ReRAM)-based Processing-in-memory (PIM) architecture is a natural fit for DLRMs thanks to its in-situ computation and high computational density. However, it remains two challenges before DLRMs fully embrace ReRAM-based PIM architectures: 1) The size of DLRM's embedding tables can reach tens of GBs, far beyond the memory capacity of typical ReRAM chips. 2) The irregular sparsity conveyed in the embedding layers is difficult to exploit in ReRAM crossbars architecture. In this paper, we present a PIM-based DLRM accelerator named PIMPR. PIMPR has a heterogeneous memory hierarchy-ReRAM crossbar-based PIM modules serve as the computing caches with high computing parallelism, while DIMM modules are able to hold the entire embedding table-leveraging the data locality of DLRM's embedding layers. Moreover, we propose a runtime strategy to skip the useless calculation induced by the sparsity and an offline strategy to balance the workload of each ReRAM crossbar. Compared to the state-of-the-art DLRM accelerator SPACE and TRiM, PIMPR achieves on average 2.02×and 1.79× speedup, 5.6 ×, and 5.1 × energy reduction, respectively.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/DATE56975.2023.10137249","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Deep learning-based personalized recommendation models (DLRMs) are dominating AI tasks in data centers. The performance bottleneck of typical DLRMs mainly lies in the memory-bounded embedding layers. Resistive Random Access Memory (ReRAM)-based Processing-in-memory (PIM) architecture is a natural fit for DLRMs thanks to its in-situ computation and high computational density. However, it remains two challenges before DLRMs fully embrace ReRAM-based PIM architectures: 1) The size of DLRM's embedding tables can reach tens of GBs, far beyond the memory capacity of typical ReRAM chips. 2) The irregular sparsity conveyed in the embedding layers is difficult to exploit in ReRAM crossbars architecture. In this paper, we present a PIM-based DLRM accelerator named PIMPR. PIMPR has a heterogeneous memory hierarchy-ReRAM crossbar-based PIM modules serve as the computing caches with high computing parallelism, while DIMM modules are able to hold the entire embedding table-leveraging the data locality of DLRM's embedding layers. Moreover, we propose a runtime strategy to skip the useless calculation induced by the sparsity and an offline strategy to balance the workload of each ReRAM crossbar. Compared to the state-of-the-art DLRM accelerator SPACE and TRiM, PIMPR achieves on average 2.02×and 1.79× speedup, 5.6 ×, and 5.1 × energy reduction, respectively.
PIMPR:基于pim的异构内存层次个性化推荐
基于深度学习的个性化推荐模型(dlrm)在数据中心的人工智能任务中占据主导地位。典型dlrm的性能瓶颈主要在于有内存边界的嵌入层。基于电阻式随机存取存储器(ReRAM)的内存中处理(PIM)架构由于其原位计算和高计算密度而非常适合dlrm。然而,在DLRM完全采用基于ReRAM的PIM架构之前,仍然存在两个挑战:1)DLRM嵌入表的大小可以达到数十gb,远远超过典型ReRAM芯片的内存容量。2)嵌入层所传达的不规则稀疏性在ReRAM横条结构中难以利用。本文提出了一种基于pim的DLRM加速器PIMPR。PIMPR具有异构内存层次结构,基于reram交叉条的PIM模块作为计算缓存,具有较高的计算并行性,而DIMM模块能够保存整个嵌入表,利用DLRM嵌入层的数据局部性。此外,我们还提出了一种运行时策略来跳过稀疏性导致的无用计算,并提出了一种离线策略来平衡每个ReRAM交叉栏的工作负载。与最先进的DLRM加速器SPACE和TRiM相比,PIMPR的平均加速速度分别达到2.02×and 1.79倍,5.6倍和5.1倍的能耗降低。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信