Training personalized recommendation systems from (GPU) scratch: look forward not backwards

Youngeun Kwon, Minsoo Rhu
{"title":"Training personalized recommendation systems from (GPU) scratch: look forward not backwards","authors":"Youngeun Kwon, Minsoo Rhu","doi":"10.1145/3470496.3527386","DOIUrl":null,"url":null,"abstract":"Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the \"future\" cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can \"always\" be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"117 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 49th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3470496.3527386","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the "future" cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can "always" be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.
从头开始训练个性化推荐系统:向前看而不是向后看
个性化推荐模型(RecSys)是最受欢迎的机器学习工作负载之一。训练RecSys的一个关键挑战是其高内存容量要求,达到数百gb到tb的模型大小。在RecSys中,所谓的嵌入层占用了大部分内存,所以当前的系统采用CPU- gpu混合设计,让大的CPU内存存储需要内存的嵌入层。不幸的是,训练嵌入涉及多个内存带宽密集型操作,这与缓慢的CPU内存不一致,导致性能开销。先前的工作建议在GPU内存中缓存频繁访问的嵌入,作为过滤嵌入层流量到CPU内存的手段,但本文观察到这种缓存设计的几个限制。在这项工作中,我们提出了一种完全不同的方法来为RecSys设计嵌入缓存。我们提出的ScratchPipe架构利用RecSys训练的独特属性来开发嵌入缓存,不仅可以看到过去,还可以看到“未来”缓存访问。ScratchPipe利用这种特性来保证嵌入层的活动工作集可以“总是”在我们提出的缓存设计中被捕获,从而使嵌入层训练能够在GPU内存速度下进行。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信