Training personalized recommendation systems from (GPU) scratch: look forward not backwards

Proceedings of the 49th Annual International Symposium on Computer Architecture Pub Date : 2022-05-10 DOI:10.1145/3470496.3527386

Youngeun Kwon, Minsoo Rhu

{"title":"Training personalized recommendation systems from (GPU) scratch: look forward not backwards","authors":"Youngeun Kwon, Minsoo Rhu","doi":"10.1145/3470496.3527386","DOIUrl":null,"url":null,"abstract":"Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the \"future\" cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can \"always\" be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"117 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 49th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3470496.3527386","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the "future" cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can "always" be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.

查看原文本刊更多论文

从头开始训练个性化推荐系统:向前看而不是向后看

个性化推荐模型(RecSys)是最受欢迎的机器学习工作负载之一。训练RecSys的一个关键挑战是其高内存容量要求，达到数百gb到tb的模型大小。在RecSys中，所谓的嵌入层占用了大部分内存，所以当前的系统采用CPU- gpu混合设计，让大的CPU内存存储需要内存的嵌入层。不幸的是，训练嵌入涉及多个内存带宽密集型操作，这与缓慢的CPU内存不一致，导致性能开销。先前的工作建议在GPU内存中缓存频繁访问的嵌入，作为过滤嵌入层流量到CPU内存的手段，但本文观察到这种缓存设计的几个限制。在这项工作中，我们提出了一种完全不同的方法来为RecSys设计嵌入缓存。我们提出的ScratchPipe架构利用RecSys训练的独特属性来开发嵌入缓存，不仅可以看到过去，还可以看到“未来”缓存访问。ScratchPipe利用这种特性来保证嵌入层的活动工作集可以“总是”在我们提出的缓存设计中被捕获，从而使嵌入层训练能够在GPU内存速度下进行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 49th Annual International Symposium on Computer Architecture

自引率

0.00%

发文量