ARCHER：基于 ReRAM 的压缩推荐系统加速器

IF 4.6 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers of Computer Science Pub Date : 2023-12-23 DOI:10.1007/s11704-023-3397-x

Xinyang Shen, Xiaofei Liao, Long Zheng, Yu Huang, Dan Chen, Hai Jin

{"title":"ARCHER：基于 ReRAM 的压缩推荐系统加速器","authors":"Xinyang Shen, Xiaofei Liao, Long Zheng, Yu Huang, Dan Chen, Hai Jin","doi":"10.1007/s11704-023-3397-x","DOIUrl":null,"url":null,"abstract":"<p>Modern recommendation systems are widely used in modern data centers. The random and sparse embedding lookup operations are the main performance bottleneck for processing recommendation systems on traditional platforms as they induce abundant data movements between computing units and memory. ReRAM-based processing-in-memory (PIM) can resolve this problem by processing embedding vectors where they are stored. However, the embedding table can easily exceed the capacity limit of a monolithic ReRAM-based PIM chip, which induces off-chip accesses that may offset the PIM profits. Therefore, we deploy the decomposed model on-chip and leverage the high computing efficiency of ReRAM to compensate for the decompression performance loss. In this paper, we propose ARCHER, a ReRAM-based PIM architecture that implements fully on-chip recommendations under resource constraints. First, we make a full analysis of the computation pattern and access pattern on the decomposed table. Based on the computation pattern, we unify the operations of each layer of the decomposed model in multiply-and-accumulate operations. Based on the access observation, we propose a hierarchical mapping schema and a specialized hardware design to maximize resource utilization. Under the unified computation and mapping strategy, we can coordinate the inter-processing elements pipeline. The evaluation shows that ARCHER outperforms the state-of-the-art GPU-based DLRM system, the state-of-the-art near-memory processing recommendation system RecNMP, and the ReRAM-based recommendation accelerator REREC by 15.79×, 2.21×, and 1.21× in terms of performance and 56.06×, 6.45×, and 1.71× in terms of energy savings, respectively.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"36 1","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ARCHER: a ReRAM-based accelerator for compressed recommendation systems\",\"authors\":\"Xinyang Shen, Xiaofei Liao, Long Zheng, Yu Huang, Dan Chen, Hai Jin\",\"doi\":\"10.1007/s11704-023-3397-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Modern recommendation systems are widely used in modern data centers. The random and sparse embedding lookup operations are the main performance bottleneck for processing recommendation systems on traditional platforms as they induce abundant data movements between computing units and memory. ReRAM-based processing-in-memory (PIM) can resolve this problem by processing embedding vectors where they are stored. However, the embedding table can easily exceed the capacity limit of a monolithic ReRAM-based PIM chip, which induces off-chip accesses that may offset the PIM profits. Therefore, we deploy the decomposed model on-chip and leverage the high computing efficiency of ReRAM to compensate for the decompression performance loss. In this paper, we propose ARCHER, a ReRAM-based PIM architecture that implements fully on-chip recommendations under resource constraints. First, we make a full analysis of the computation pattern and access pattern on the decomposed table. Based on the computation pattern, we unify the operations of each layer of the decomposed model in multiply-and-accumulate operations. Based on the access observation, we propose a hierarchical mapping schema and a specialized hardware design to maximize resource utilization. Under the unified computation and mapping strategy, we can coordinate the inter-processing elements pipeline. The evaluation shows that ARCHER outperforms the state-of-the-art GPU-based DLRM system, the state-of-the-art near-memory processing recommendation system RecNMP, and the ReRAM-based recommendation accelerator REREC by 15.79×, 2.21×, and 1.21× in terms of performance and 56.06×, 6.45×, and 1.71× in terms of energy savings, respectively.</p>\",\"PeriodicalId\":12640,\"journal\":{\"name\":\"Frontiers of Computer Science\",\"volume\":\"36 1\",\"pages\":\"\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2023-12-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers of Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11704-023-3397-x\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers of Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11704-023-3397-x","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

现代推荐系统广泛应用于现代数据中心。在传统平台上，随机和稀疏的嵌入查找操作是处理推荐系统的主要性能瓶颈，因为这些操作会导致大量数据在计算单元和内存之间移动。基于 ReRAM 的内存处理（PIM）可以在嵌入向量存储的地方对其进行处理，从而解决这一问题。但是，嵌入表很容易超出基于 ReRAM 的单片式 PIM 芯片的容量限制，从而导致片外访问，这可能会抵消 PIM 的利润。因此，我们在芯片上部署分解模型，并利用 ReRAM 的高计算效率来弥补解压缩性能的损失。在本文中，我们提出了基于 ReRAM 的 PIM 架构 ARCHER，该架构可在资源限制条件下实现完全片上推荐。首先，我们对分解表的计算模式和访问模式进行了全面分析。根据计算模式，我们将分解模型各层的操作统一为乘法累加操作。根据访问观察结果，我们提出了分层映射模式和专用硬件设计，以最大限度地提高资源利用率。在统一的计算和映射策略下，我们可以协调处理元素间的流水线。评估结果表明，ARCHER 在性能方面分别优于最先进的基于 GPU 的 DLRM 系统、最先进的近内存处理推荐系统 RecNMP 和基于 ReRAM 的推荐加速器 REREC 15.79 倍、2.21 倍和 1.21 倍，在节能方面分别优于最先进的基于 GPU 的 DLRM 系统、最先进的近内存处理推荐系统 RecNMP 和基于 ReRAM 的推荐加速器 REREC 56.06 倍、6.45 倍和 1.71 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ARCHER: a ReRAM-based accelerator for compressed recommendation systems

Modern recommendation systems are widely used in modern data centers. The random and sparse embedding lookup operations are the main performance bottleneck for processing recommendation systems on traditional platforms as they induce abundant data movements between computing units and memory. ReRAM-based processing-in-memory (PIM) can resolve this problem by processing embedding vectors where they are stored. However, the embedding table can easily exceed the capacity limit of a monolithic ReRAM-based PIM chip, which induces off-chip accesses that may offset the PIM profits. Therefore, we deploy the decomposed model on-chip and leverage the high computing efficiency of ReRAM to compensate for the decompression performance loss. In this paper, we propose ARCHER, a ReRAM-based PIM architecture that implements fully on-chip recommendations under resource constraints. First, we make a full analysis of the computation pattern and access pattern on the decomposed table. Based on the computation pattern, we unify the operations of each layer of the decomposed model in multiply-and-accumulate operations. Based on the access observation, we propose a hierarchical mapping schema and a specialized hardware design to maximize resource utilization. Under the unified computation and mapping strategy, we can coordinate the inter-processing elements pipeline. The evaluation shows that ARCHER outperforms the state-of-the-art GPU-based DLRM system, the state-of-the-art near-memory processing recommendation system RecNMP, and the ReRAM-based recommendation accelerator REREC by 15.79×, 2.21×, and 1.21× in terms of performance and 56.06×, 6.45×, and 1.71× in terms of energy savings, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers of Computer Science COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

8.60

自引率

2.40%

发文量

799

审稿时长

6-12 weeks

期刊介绍： Frontiers of Computer Science aims to provide a forum for the publication of peer-reviewed papers to promote rapid communication and exchange between computer scientists. The journal publishes research papers and review articles in a wide range of topics, including: architecture, software, artificial intelligence, theoretical computer science, networks and communication, information systems, multimedia and graphics, information security, interdisciplinary, etc. The journal especially encourages papers from new emerging and multidisciplinary areas, as well as papers reflecting the international trends of research and development and on special topics reporting progress made by Chinese computer scientists.